Science.gov

Sample records for functional gene annotation

  1. FUNCTIONAL ANNOTATION OF OIL PALM GENES USING AN AUTOMATED BIOINFORMATICS APPROACH FUNCTIONAL ANNOTATION OF OIL PALM

    E-print Network

    Sinskey, Anthony J.

    FUNCTIONAL ANNOTATION OF OIL PALM GENES USING AN AUTOMATED BIOINFORMATICS APPROACH 35 FUNCTIONAL ANNOTATION OF OIL PALM GENES USING AN AUTOMATED BIOINFORMATICS APPROACH LAURA B WILLIS*; PHILIP A LESSARDBank, and duplicate entries were eliminated by pairwise BLAST searches, resulting in a collection of unique oil palm

  2. Combining gene sequence similarity and textual information for gene function annotation in the literature

    E-print Network

    Kihara, Daisuke

    Ontology terms for target genes. Empirical studies on two testbeds demonstrate that the combination as an important information source for gene function annotation. The automatic process of function annotation from

  3. Exploiting ontology graph for predicting sparsely annotated gene function

    PubMed Central

    Wang, Sheng; Cho, Hyunghoon; Zhai, ChengXiang; Berger, Bonnie; Peng, Jian

    2015-01-01

    Motivation: Systematically predicting gene (or protein) function based on molecular interaction networks has become an important tool in refining and enhancing the existing annotation catalogs, such as the Gene Ontology (GO) database. However, functional labels with only a few (<10) annotated genes, which constitute about half of the GO terms in yeast, mouse and human, pose a unique challenge in that any prediction algorithm that independently considers each label faces a paucity of information and thus is prone to capture non-generalizable patterns in the data, resulting in poor predictive performance. There exist a variety of algorithms for function prediction, but none properly address this ‘overfitting’ issue of sparsely annotated functions, or do so in a manner scalable to tens of thousands of functions in the human catalog. Results: We propose a novel function prediction algorithm, clusDCA, which transfers information between similar functional labels to alleviate the overfitting problem for sparsely annotated functions. Our method is scalable to datasets with a large number of annotations. In a cross-validation experiment in yeast, mouse and human, our method greatly outperformed previous state-of-the-art function prediction algorithms in predicting sparsely annotated functions, without sacrificing the performance on labels with sufficient information. Furthermore, we show that our method can accurately predict genes that will be assigned a functional label that has no known annotations, based only on the ontology graph structure and genes associated with other labels, which further suggests that our method effectively utilizes the similarity between gene functions. Availability and implementation: https://github.com/wangshenguiuc/clusDCA. Contact: jianpeng@illinois.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26072504

  4. Measuring semantic similarities by combining gene ontology annotations and gene co-function networks

    SciTech Connect

    Peng, Jiajie; Uygun, Sahra; Kim, Taehyong; Wang, Yadong; Rhee, Seung Y.; Chen, Jin

    2015-02-14

    Background: Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms. Results: We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstrate that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families. Conclusions: Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited.

  5. Measuring semantic similarities by combining gene ontology annotations and gene co-function networks

    DOE PAGESBeta

    Peng, Jiajie; Uygun, Sahra; Kim, Taehyong; Wang, Yadong; Rhee, Seung Y.; Chen, Jin

    2015-02-14

    Background: Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms. Results: We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstratemore »that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families. Conclusions: Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited.« less

  6. Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets

    PubMed Central

    Aubry, Marc; Monnier, Annabelle; Chicault, Celine; de Tayrac, Marie; Galibert, Marie-Dominique; Burgun, Anita; Mosser, Jean

    2006-01-01

    Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence) or text-mining of the published scientific literature (literature profiling). Results We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human) and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2) and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells). Conclusion The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions. PMID:16674810

  7. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    SciTech Connect

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcin P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew W.; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2008-10-27

    Hypothetical and conserved hypothetical genes account for>30percent of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved hypothetical (9.5percent) along with 887 hypothetical genes (24.4percent). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 hypothetical and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC-MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. 1212 of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes. Except for the latter, monocistronic gene annotation was expanded using the above criteria along with matching Clusters of Orthologous Groups. Polycistronic genes were annotated in the same manner with inferences from their proximity to more confidently annotated genes. Two targeted deletion mutants were used as test cases to determine the relevance of the inferred functional annotations.

  8. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    PubMed Central

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcin P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew W.; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2009-01-01

    Hypothetical (HyP) and conserved HyP genes account for >30% of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved HyP (9.5%) along with 887 HyP genes (24.4%). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 HyP and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC–MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. One thousand two hundred and twelve of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes. Except for the latter, monocistronic gene annotation was expanded using the above criteria along with matching Clusters of Orthologous Groups. Polycistronic genes were annotated in the same manner with inferences from their proximity to more confidently annotated genes. Two targeted deletion mutants were used as test cases to determine the relevance of the inferred functional annotations. PMID:19293273

  9. The Gene Ontology's Reference Genome Project: a unified framework for functional annotation across species.

    PubMed

    2009-07-01

    The Gene Ontology (GO) is a collaborative effort that provides structured vocabularies for annotating the molecular function, biological role, and cellular location of gene products in a highly systematic way and in a species-neutral manner with the aim of unifying the representation of gene function across different organisms. Each contributing member of the GO Consortium independently associates GO terms to gene products from the organism(s) they are annotating. Here we introduce the Reference Genome project, which brings together those independent efforts into a unified framework based on the evolutionary relationships between genes in these different organisms. The Reference Genome project has two primary goals: to increase the depth and breadth of annotations for genes in each of the organisms in the project, and to create data sets and tools that enable other genome annotation efforts to infer GO annotations for homologous genes in their organisms. In addition, the project has several important incidental benefits, such as increasing annotation consistency across genome databases, and providing important improvements to the GO's logical structure and biological content. PMID:19578431

  10. Algal functional annotation tool

    SciTech Connect

    Lopez, D.; Casero, D.; Cokus, S. J.; Merchant, S. S.; Pellegrini, M.

    2012-07-01

    The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps and batch gene identifier conversion.

  11. Evaluation of a Hybrid Approach Using UBLAST and BLASTX for Metagenomic Sequences Annotation of Specific Functional Genes

    PubMed Central

    Yang, Ying; Jiang, Xiao-Tao; Zhang, Tong

    2014-01-01

    The fast development of next generation sequencing (NGS) has dramatically increased the application of metagenomics in various aspects. Functional annotation is a major step in the metagenomics studies. Fast annotation of functional genes has been a challenge because of the deluge of NGS data and expanding databases. A hybrid annotation pipeline proposed previously for taxonomic assignments was evaluated in this study for metagenomic sequences annotation of specific functional genes, such as antibiotic resistance genes, arsenic resistance genes and key genes in nitrogen metabolism. The hybrid approach using UBLAST and BLASTX is 44–177 times faster than direct BLASTX in the annotation using the small protein database for the specific functional genes, with the cost of missing a small portion (<1.8%) of target sequences compared with direct BLASTX hits. Different from direct BLASTX, the time required for specific functional genes annotation using the hybrid annotation pipeline depends on the abundance for the target genes. Thus this hybrid annotation pipeline is more suitable in specific functional genes annotation than in comprehensive functional genes annotation. PMID:25347677

  12. Algal functional annotation tool

    Energy Science and Technology Software Center (ESTSC)

    2012-07-12

    Abstract BACKGROUND: Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations tomore »interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. DESCRIPTION: The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps and batch gene identifier conversion. CONCLUSIONS: The Algal Functional Annotation Tool aims to provide an integrated data-mining environment for algal genomics by combining data from multiple annotation databases into a centralized tool. This site is designed to expedite the process of functional annotation and the interpretation of gene lists, such as those derived from high-throughput RNA-seq experiments. The tool is publicly available at http://pathways.mcdb.ucla.edu.« less

  13. Algal functional annotation tool

    SciTech Connect

    2012-07-12

    Abstract BACKGROUND: Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. DESCRIPTION: The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps and batch gene identifier conversion. CONCLUSIONS: The Algal Functional Annotation Tool aims to provide an integrated data-mining environment for algal genomics by combining data from multiple annotation databases into a centralized tool. This site is designed to expedite the process of functional annotation and the interpretation of gene lists, such as those derived from high-throughput RNA-seq experiments. The tool is publicly available at http://pathways.mcdb.ucla.edu.

  14. Comparative analysis of grapevine whole-genome gene predictions, functional annotation, categorization and integration of the predicted gene sequences

    PubMed Central

    2012-01-01

    Background The first draft assembly and gene prediction of the grapevine genome (8X base coverage) was made available to the scientific community in 2007, and functional annotation was developed on this gene prediction. Since then additional Sanger sequences were added to the 8X sequences pool and a new version of the genomic sequence with superior base coverage (12X) was produced. Results In order to more efficiently annotate the function of the genes predicted in the new assembly, it is important to build on as much of the previous work as possible, by transferring 8X annotation of the genome to the 12X version. The 8X and 12X assemblies and gene predictions of the grapevine genome were compared to answer the question, “Can we uniquely map 8X predicted genes to 12X predicted genes?” The results show that while the assemblies and gene structure predictions are too different to make a complete mapping between them, most genes (18,725) showed a one-to-one relationship between 8X predicted genes and the last version of 12X predicted genes. In addition, reshuffled genomic sequence structures appeared. These highlight regions of the genome where the gene predictions need to be taken with caution. Based on the new grapevine gene functional annotation and in-depth functional categorization, twenty eight new molecular networks have been created for VitisNet while the existing networks were updated. Conclusions The outcomes of this study provide a functional annotation of the 12X genes, an update of VitisNet, the system of the grapevine molecular networks, and a new functional categorization of genes. Data are available at the VitisNet website (http://www.sdstate.edu/ps/research/vitis/pathways.cfm). PMID:22554261

  15. Visual annotation display (VLAD): a tool for finding functional themes in lists of genes.

    PubMed

    Richardson, Joel E; Bult, Carol J

    2015-10-01

    Experiments that employ genome scale technology platforms frequently result in lists of tens to thousands of genes with potential significance to a specific biological process or disease. Searching for biologically relevant connections among the genes or gene products in these lists is a common data analysis task. We have implemented a software application for uncovering functional themes in sets of genes based on their annotations to bio-ontologies, such as the gene ontology and the mammalian phenotype ontology. The application, called VisuaL Annotation Display (VLAD), performs a statistical analysis to test for the enrichment of ontology terms in a set of genes submitted by a researcher. The results for each analysis using VLAD includes a table of ontology terms, sorted in decreasing order of significance. Each row contains the term, statistics such as the number of annotated terms, the p value, etc., and the symbols of annotated genes. An accompanying graphical display shows portions of the ontology hierarchy, where node sizes are scaled based on p values. Although numerous ontology term enrichment programs already exist, VLAD is unique in that it allows users to upload their own annotation files and ontologies for customized term enrichment analyses, supports the analysis of multiple gene sets at once, provides interfaces to customize graphical output, and is tightly integrated with functional and biological details about mouse genes in the Mouse Genome Informatics (MGI) database. VLAD is available as a web-based application from the MGI web site ( http://proto.informatics.jax.org/prototypes/vlad/ ). PMID:26047590

  16. Gene fusions and gene duplications: relevance to genomic annotation and functional analysis

    PubMed Central

    Serres, Margrethe H; Riley, Monica

    2005-01-01

    Background Escherichia coli a model organism provides information for annotation of other genomes. Our analysis of its genome has shown that proteins encoded by fused genes need special attention. Such composite (multimodular) proteins consist of two or more components (modules) encoding distinct functions. Multimodular proteins have been found to complicate both annotation and generation of sequence similar groups. Previous work overstated the number of multimodular proteins in E. coli. This work corrects the identification of modules by including sequence information from proteins in 50 sequenced microbial genomes. Results Multimodular E. coli K-12 proteins were identified from sequence similarities between their component modules and non-fused proteins in 50 genomes and from the literature. We found 109 multimodular proteins in E. coli containing either two or three modules. Most modules had standalone sequence relatives in other genomes. The separated modules together with all the single (un-fused) proteins constitute the sum of all unimodular proteins of E. coli. Pairwise sequence relationships among all E. coli unimodular proteins generated 490 sequence similar, paralogous groups. Groups ranged in size from 92 to 2 members and had varying degrees of relatedness among their members. Some E. coli enzyme groups were compared to homologs in other bacterial genomes. Conclusion The deleterious effects of multimodular proteins on annotation and on the formation of groups of paralogs are emphasized. To improve annotation results, all multimodular proteins in an organism should be detected and when known each function should be connected with its location in the sequence of the protein. When transferring functions by sequence similarity, alignment locations must be noted, particularly when alignments cover only part of the sequences, in order to enable transfer of the correct function. Separating multimodular proteins into module units makes it possible to generate protein groups related by both sequence and function, avoiding mixing of unrelated sequences. Organisms differ in sizes of groups of sequence-related proteins. A sample comparison of orthologs to selected E. coli paralogous groups correlates with known physiological and taxonomic relationships between the organisms. PMID:15757509

  17. Gene annotation and functional analysis of a newly sequenced Synechococcus strain.

    PubMed

    Li, Y; Rao, N N; Yang, Y; Zhang, Y; Gu, Y N

    2015-01-01

    Synechococcus sp PCC 7336 represents a newly sequenced strain, and its genome is obviously different from that of other Synechococcus strains. In this analysis, local alignment and annotation databases were constructed and combined with various bioinformatic tools to carry out gene annotation and functional analysis of this strain. From this analysis, we identified 5096 protein-coding genes and 47 RNA genes. Of these, 116 genes that were classified into 9 categories were associated with photosynthesis, and type V polymerase proteins that were identified are unique for this strain. An additional 107 genes were closely related to signal transduction pathways, which primarily comprised parts of two-component regulatory systems. Gene ontogeny analysis showed that 2377 genes were annotated with a total number of 9791 functional categories, and specifically that 41 genes distributed in 4 protein complexes were involved in oxidative phosphorylation. Clusters of orthologous groups classification showed that there were 1463 homologous proteins associated with 17 specific metabolic pathways, and that most of the proteins participated in primary metabolic processes such as binding and catalysis. The phylogenetic tree based on 16S rRNA sequences indicated that Synechococcus PCC 7336 is highly likely to represent a new branch. PMID:26505391

  18. Finding New Order in Biological Functions from the Network Structure of Gene Annotations.

    PubMed

    Glass, Kimberly; Girvan, Michelle

    2015-11-01

    The Gene Ontology (GO) provides biologists with a controlled terminology that describes how genes are associated with functions and how functional terms are related to one another. These term-term relationships encode how scientists conceive the organization of biological functions, and they take the form of a directed acyclic graph (DAG). Here, we propose that the network structure of gene-term annotations made using GO can be employed to establish an alternative approach for grouping functional terms that captures intrinsic functional relationships that are not evident in the hierarchical structure established in the GO DAG. Instead of relying on an externally defined organization for biological functions, our approach connects biological functions together if they are performed by the same genes, as indicated in a compendium of gene annotation data from numerous different sources. We show that grouping terms by this alternate scheme provides a new framework with which to describe and predict the functions of experimentally identified sets of genes. PMID:26588252

  19. Finding New Order in Biological Functions from the Network Structure of Gene Annotations

    PubMed Central

    Glass, Kimberly; Girvan, Michelle

    2015-01-01

    The Gene Ontology (GO) provides biologists with a controlled terminology that describes how genes are associated with functions and how functional terms are related to one another. These term-term relationships encode how scientists conceive the organization of biological functions, and they take the form of a directed acyclic graph (DAG). Here, we propose that the network structure of gene-term annotations made using GO can be employed to establish an alternative approach for grouping functional terms that captures intrinsic functional relationships that are not evident in the hierarchical structure established in the GO DAG. Instead of relying on an externally defined organization for biological functions, our approach connects biological functions together if they are performed by the same genes, as indicated in a compendium of gene annotation data from numerous different sources. We show that grouping terms by this alternate scheme provides a new framework with which to describe and predict the functions of experimentally identified sets of genes. PMID:26588252

  20. Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation

    SciTech Connect

    Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcine P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

    2009-03-17

    Hypothetical (HyP) and conserved HyP genes account for >30% of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved HyP (9.5%) along with 887 HyP genes (24.4%). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 HyP and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC–MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. One thousand two hundred and twelve of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes.

  1. Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga Nannochloropsis oceanica CCMP1779

    PubMed Central

    Tsai, Chia-Hong; Bullard, Blair; Cornish, Adam J.; Harvey, Christopher; Reca, Ida-Barbara; Thornburg, Chelsea; Achawanantakun, Rujira; Buehl, Christopher J.; Campbell, Michael S.; Cavalier, David; Childs, Kevin L.; Clark, Teresa J.; Deshpande, Rahul; Erickson, Erika; Armenia Ferguson, Ann; Handee, Witawas; Kong, Que; Li, Xiaobo; Liu, Bensheng; Lundback, Steven; Peng, Cheng; Roston, Rebecca L.; Sanjaya; Simpson, Jeffrey P.; TerBush, Allan; Warakanont, Jaruswan; Zäuner, Simone; Farre, Eva M.; Hegg, Eric L.; Jiang, Ning; Kuo, Min-Hao; Lu, Yan; Niyogi, Krishna K.; Ohlrogge, John; Osteryoung, Katherine W.; Shachar-Hill, Yair; Sears, Barbara B.; Sun, Yanni; Takahashi, Hideki; Yandell, Mark; Shiu, Shin-Han; Benning, Christoph

    2012-01-01

    Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred organism. Moreover, adequate molecular and genetic resources prerequisite for the rational engineering of marine algal feedstocks are lacking for most candidate species. Heterokonts of the genus Nannochloropsis naturally have high cellular oil content and are already in use for industrial production of high-value lipid products. First success in applying reverse genetics by targeted gene replacement makes Nannochloropsis oceanica an attractive model to investigate the cell and molecular biology and biochemistry of this fascinating organism group. Here we present the assembly of the 28.7 Mb genome of N. oceanica CCMP1779. RNA sequencing data from nitrogen-replete and nitrogen-depleted growth conditions support a total of 11,973 genes, of which in addition to automatic annotation some were manually inspected to predict the biochemical repertoire for this organism. Among others, more than 100 genes putatively related to lipid metabolism, 114 predicted transcription factors, and 109 transcriptional regulators were annotated. Comparison of the N. oceanica CCMP1779 gene repertoire with the recently published N. gaditana genome identified 2,649 genes likely specific to N. oceanica CCMP1779. Many of these N. oceanica–specific genes have putative orthologs in other species or are supported by transcriptional evidence. However, because similarity-based annotations are limited, functions of most of these species-specific genes remain unknown. Aside from the genome sequence and its analysis, protocols for the transformation of N. oceanica CCMP1779 are provided. The availability of genomic and transcriptomic data for Nannochloropsis oceanica CCMP1779, along with efficient transformation protocols, provides a blueprint for future detailed gene functional analysis and genetic engineering of Nannochloropsis species by a growing academic community focused on this genus. PMID:23166516

  2. Integrating biological knowledge based on functional annotations for biclustering of gene expression data.

    PubMed

    Nepomuceno, Juan A; Troncoso, Alicia; Nepomuceno-Chamorro, Isabel A; Aguilar-Ruiz, Jesús S

    2015-05-01

    Gene expression data analysis is based on the assumption that co-expressed genes imply co-regulated genes. This assumption is being reformulated because the co-expression of a group of genes may be the result of an independent activation with respect to the same experimental condition and not due to the same regulatory regime. For this reason, traditional techniques are recently being improved with the use of prior biological knowledge from open-access repositories together with gene expression data. Biclustering is an unsupervised machine learning technique that searches patterns in gene expression data matrices. A scatter search-based biclustering algorithm that integrates biological information is proposed in this paper. In addition to the gene expression data matrix, the input of the algorithm is only a direct annotation file that relates each gene to a set of terms from a biological repository where genes are annotated. Two different biological measures, FracGO and SimNTO, are proposed to integrate this information by means of its addition to-be-optimized fitness function in the scatter search scheme. The measure FracGO is based on the biological enrichment and SimNTO is based on the overlapping among GO annotations of pairs of genes. Experimental results evaluate the proposed algorithm for two datasets and show the algorithm performs better when biological knowledge is integrated. Moreover, the analysis and comparison between the two different biological measures is presented and it is concluded that the differences depend on both the data source and how the annotation file has been built in the case GO is used. It is also shown that the proposed algorithm obtains a greater number of enriched biclusters than other classical biclustering algorithms typically used as benchmark and an analysis of the overlapping among biclusters reveals that the biclusters obtained present a low overlapping. The proposed methodology is a general-purpose algorithm which allows the integration of biological information from several sources and can be extended to other biclustering algorithms based on the optimization of a merit function. PMID:25843807

  3. Gene Ontology annotations and resources.

    PubMed

    Blake, J A; Dolan, M; Drabkin, H; Hill, D P; Li, Ni; Sitnikov, D; Bridges, S; Burgess, S; Buza, T; McCarthy, F; Peddinti, D; Pillai, L; Carbon, S; Dietze, H; Ireland, A; Lewis, S E; Mungall, C J; Gaudet, P; Chrisholm, R L; Fey, P; Kibbe, W A; Basu, S; Siegele, D A; McIntosh, B K; Renfro, D P; Zweifel, A E; Hu, J C; Brown, N H; Tweedie, S; Alam-Faruque, Y; Apweiler, R; Auchinchloss, A; Axelsen, K; Bely, B; Blatter, M -C; Bonilla, C; Bouguerleret, L; Boutet, E; Breuza, L; Bridge, A; Chan, W M; Chavali, G; Coudert, E; Dimmer, E; Estreicher, A; Famiglietti, L; Feuermann, M; Gos, A; Gruaz-Gumowski, N; Hieta, R; Hinz, C; Hulo, C; Huntley, R; James, J; Jungo, F; Keller, G; Laiho, K; Legge, D; Lemercier, P; Lieberherr, D; Magrane, M; Martin, M J; Masson, P; Mutowo-Muellenet, P; O'Donovan, C; Pedruzzi, I; Pichler, K; Poggioli, D; Porras Millán, P; Poux, S; Rivoire, C; Roechert, B; Sawford, T; Schneider, M; Stutz, A; Sundaram, S; Tognolli, M; Xenarios, I; Foulgar, R; Lomax, J; Roncaglia, P; Khodiyar, V K; Lovering, R C; Talmud, P J; Chibucos, M; Giglio, M Gwinn; Chang, H -Y; Hunter, S; McAnulla, C; Mitchell, A; Sangrador, A; Stephan, R; Harris, M A; Oliver, S G; Rutherford, K; Wood, V; Bahler, J; Lock, A; Kersey, P J; McDowall, D M; Staines, D M; Dwinell, M; Shimoyama, M; Laulederkind, S; Hayman, T; Wang, S -J; Petri, V; Lowry, T; D'Eustachio, P; Matthews, L; Balakrishnan, R; Binkley, G; Cherry, J M; Costanzo, M C; Dwight, S S; Engel, S R; Fisk, D G; Hitz, B C; Hong, E L; Karra, K; Miyasato, S R; Nash, R S; Park, J; Skrzypek, M S; Weng, S; Wong, E D; Berardini, T Z; Huala, E; Mi, H; Thomas, P D; Chan, J; Kishore, R; Sternberg, P; Van Auken, K; Howe, D; Westerfield, M

    2013-01-01

    The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new 'phylogenetic annotation' process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources. PMID:23161678

  4. Gene Ontology Annotations and Resources

    PubMed Central

    2013-01-01

    The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new ‘phylogenetic annotation’ process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources. PMID:23161678

  5. The Saccharomyces Genome Database: Gene Product Annotation of Function, Process, and Component.

    PubMed

    Cherry, J Michael

    2015-01-01

    An ontology is a highly structured form of controlled vocabulary. Each entry in the ontology is commonly called a term. These terms are used when talking about an annotation. However, each term has a definition that, like the definition of a word found within a dictionary, provides the complete usage and detailed explanation of the term. It is critical to consult a term's definition because the distinction between terms can be subtle. The use of ontologies in biology started as a way of unifying communication between scientific communities and to provide a standard dictionary for different topics, including molecular functions, biological processes, mutant phenotypes, chemical properties and structures. The creation of ontology terms and their definitions often requires debate to reach agreement but the result has been a unified descriptive language used to communicate knowledge. In addition to terms and definitions, ontologies require a relationship used to define the type of connection between terms. In an ontology, a term can have more than one parent term, the term above it in an ontology, as well as more than one child, the term below it in the ontology. Many ontologies are used to construct annotations in the Saccharomyces Genome Database (SGD), as in all modern biological databases; however, Gene Ontology (GO), a descriptive system used to categorize gene function, is the most extensively used ontology in SGD annotations. Examples included in this protocol illustrate the structure and features of this ontology. PMID:26631125

  6. Computational algorithms to predict Gene Ontology annotations

    PubMed Central

    2015-01-01

    Background Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful. Methods We used a set of computational algorithms and weighting schemes to infer novel gene annotations from a set of known ones. We used the latent semantic analysis approach, implementing two popular algorithms (Latent Semantic Indexing and Probabilistic Latent Semantic Analysis) and propose a novel method, the Semantic IMproved Latent Semantic Analysis, which adds a clustering step on the set of considered genes. Furthermore, we propose the improvement of these algorithms by weighting the annotations in the input set. Results We tested our methods and their weighted variants on the Gene Ontology annotation sets of three model organism genes (Bos taurus, Danio rerio and Drosophila melanogaster ). The methods showed their ability in predicting novel gene annotations and the weighting procedures demonstrated to lead to a valuable improvement, although the obtained results vary according to the dimension of the input annotation set and the considered algorithm. Conclusions Out of the three considered methods, the Semantic IMproved Latent Semantic Analysis is the one that provides better results. In particular, when coupled with a proper weighting policy, it is able to predict a significant number of novel annotations, demonstrating to actually be a helpful tool in supporting scientists in the curation process of gene functional annotations. PMID:25916950

  7. Insyght: navigating amongst abundant homologues, syntenies and gene functional annotations in bacteria, it's that symbol!

    PubMed Central

    Lacroix, Thomas; Loux, Valentin; Gendrault, Annie; Hoebeke, Mark; Gibrat, Jean-François

    2014-01-01

    High-throughput techniques have considerably increased the potential of comparative genomics whilst simultaneously posing many new challenges. One of those challenges involves efficiently mining the large amount of data produced and exploring the landscape of both conserved and idiosyncratic genomic regions across multiple genomes. Domains of application of these analyses are diverse: identification of evolutionary events, inference of gene functions, detection of niche-specific genes or phylogenetic profiling. Insyght is a comparative genomic visualization tool that combines three complementary displays: (i) a table for thoroughly browsing amongst homologues, (ii) a comparator of orthologue functional annotations and (iii) a genomic organization view designed to improve the legibility of rearrangements and distinctive loci. The latter display combines symbolic and proportional graphical paradigms. Synchronized navigation across multiple species and interoperability between the views are core features of Insyght. A gene filter mechanism is provided that helps the user to build a biologically relevant gene set according to multiple criteria such as presence/absence of homologues and/or various annotations. We illustrate the use of Insyght with scenarios. Currently, only Bacteria and Archaea are supported. A public instance is available at http://genome.jouy.inra.fr/Insyght. The tool is freely downloadable for private data set analysis. PMID:25249626

  8. Insyght: navigating amongst abundant homologues, syntenies and gene functional annotations in bacteria, it's that symbol!

    PubMed

    Lacroix, Thomas; Loux, Valentin; Gendrault, Annie; Hoebeke, Mark; Gibrat, Jean-François

    2014-12-01

    High-throughput techniques have considerably increased the potential of comparative genomics whilst simultaneously posing many new challenges. One of those challenges involves efficiently mining the large amount of data produced and exploring the landscape of both conserved and idiosyncratic genomic regions across multiple genomes. Domains of application of these analyses are diverse: identification of evolutionary events, inference of gene functions, detection of niche-specific genes or phylogenetic profiling. Insyght is a comparative genomic visualization tool that combines three complementary displays: (i) a table for thoroughly browsing amongst homologues, (ii) a comparator of orthologue functional annotations and (iii) a genomic organization view designed to improve the legibility of rearrangements and distinctive loci. The latter display combines symbolic and proportional graphical paradigms. Synchronized navigation across multiple species and interoperability between the views are core features of Insyght. A gene filter mechanism is provided that helps the user to build a biologically relevant gene set according to multiple criteria such as presence/absence of homologues and/or various annotations. We illustrate the use of Insyght with scenarios. Currently, only Bacteria and Archaea are supported. A public instance is available at http://genome.jouy.inra.fr/Insyght. The tool is freely downloadable for private data set analysis. PMID:25249626

  9. Statistical analysis of genomic protein family and domain controlled annotations for functional investigation of classified gene lists

    PubMed Central

    Masseroli, Marco; Bellistri, Elisa; Franceschini, Andrea; Pinciroli, Francesco

    2007-01-01

    Background The increasing protein family and domain based annotations constitute important information to understand protein functions and gain insight into relations among their codifying genes. To allow analyzing of gene proteomic annotations, we implemented novel modules within GFINDer, a Web system we previously developed that dynamically aggregates functional and phenotypic annotations of user-uploaded gene lists and allows performing their statistical analysis and mining. Results Exploiting protein information in Pfam and InterPro databanks, we developed and added in GFINDer original modules specifically devoted to the exploration and analysis of functional signatures of gene protein products. They allow annotating numerous user-classified nucleotide sequence identifiers with controlled information on related protein families, domains and functional sites, classifying them according to such protein annotation categories, and statistically analyzing the obtained classifications. In particular, when uploaded nucleotide sequence identifiers are subdivided in classes, the Statistics Protein Families&Domains module allows estimating relevance of Pfam or InterPro controlled annotations for the uploaded genes by highlighting protein signatures significantly more represented within user-defined classes of genes. In addition, the Logistic Regression module allows identifying protein functional signatures that better explain the considered gene classification. Conclusion Novel GFINDer modules provide genomic protein family and domain analyses supporting better functional interpretation of gene classes, for instance defined through statistical and clustering analyses of gene expression results from microarray experiments. They can hence help understanding fundamental biological processes and complex cellular mechanisms influenced by protein domain composition, and contribute to unveil new biomedical knowledge about the codifying genes. PMID:17430558

  10. Identification and functional annotation of lncRNA genes with hypermethylation in colorectal cancer.

    PubMed

    Liao, Qi; He, Weiling; Liu, Jianfa; Cen, Yi; Luo, Liang; Yu, Chengliang; Li, Yang; Chen, Sitong; Duan, Shiwei

    2015-11-10

    Colorectal cancer (CRC) is one of the leading causes of mortality worldwide. DNA methylation is an important epigenetic modification for CRC. Although currently a number of studies about DNA methylation of protein coding genes have been carried out, only a few are about the methylation of genes encoding the long noncoding RNAs (lncRNAs). In this study, we identified 761 lncRNA genes with DNA hypermethylation in CRC using a free MethylCap-seq dataset. Integration of lncRNA expression and methylation datasets showed that the expression of lncRNAs is negatively correlated with DNA methylation (p<0.01). Co-methylation network was also constructed to annotate the functions of unknown lncRNAs. Our results showed that a total of 364 lncRNAs were annotated with at least one GO biological process term. The current data-mining work is likely to provide informative clues for biological researchers to further understand the role of lncRNAs in the development of CRC. PMID:26172871

  11. Functional Annotation of Genes Overlapping Copy Number Variants in Autistic Patients: Focus on Axon Pathfinding

    PubMed Central

    Sbacchi, Silvia; Acquadro, Francesco; Calò, Ignazio; Calì, Francesco; Romano, Valentino

    2010-01-01

    We have used Gene Ontology (GO) and pathway analyses to uncover the common functions associated to the genes overlapping Copy Number Variants (CNVs) in autistic patients. Our source of data were four published studies [1-4]. We first applied a two-step enrichment strategy for autism-specific genes. We fished out from the four mentioned studies a list of 2928 genes overall overlapping 328 CNVs in patients and we first selected a sub-group of 2044 genes after excluding those ones that are also involved in CNVs reported in the Database of Genomic Variants (enrichment step 1). We then selected from the step 1-enriched list a sub-group of 514 genes each of which was found to be deleted or duplicated in at least two patients (enrichment step 2). The number of statistically significant processes and pathways identified by the Database for Annotation, Visualization and Integrated Discovery and Ingenuity Pathways Analysis softwares with the step 2-enriched list was significantly higher compared to the step 1-enriched list. In addition, statistically significant GO terms, biofunctions and pathways related to nervous system development and function were exclusively identified by the step 2-enriched list of genes. Interestingly, 21 genes were associated to axon growth and pathfinding. The latter genes and other ones associated to nervous system in this study represent a new set of autism candidate genes deserving further investigation. In summary, our results suggest that the autism’s “connectivity genes” in some patients affect very early phases of neurodevelopment, i.e., earlier than synaptogenesis. PMID:20885821

  12. Annotation of gene function in citrus using gene expression information and co-expression networks

    PubMed Central

    2014-01-01

    Background The genus Citrus encompasses major cultivated plants such as sweet orange, mandarin, lemon and grapefruit, among the world’s most economically important fruit crops. With increasing volumes of transcriptomics data available for these species, Gene Co-expression Network (GCN) analysis is a viable option for predicting gene function at a genome-wide scale. GCN analysis is based on a “guilt-by-association” principle whereby genes encoding proteins involved in similar and/or related biological processes may exhibit similar expression patterns across diverse sets of experimental conditions. While bioinformatics resources such as GCN analysis are widely available for efficient gene function prediction in model plant species including Arabidopsis, soybean and rice, in citrus these tools are not yet developed. Results We have constructed a comprehensive GCN for citrus inferred from 297 publicly available Affymetrix Genechip Citrus Genome microarray datasets, providing gene co-expression relationships at a genome-wide scale (33,000 transcripts). The comprehensive citrus GCN consists of a global GCN (condition-independent) and four condition-dependent GCNs that survey the sweet orange species only, all citrus fruit tissues, all citrus leaf tissues, or stress-exposed plants. All of these GCNs are clustered using genome-wide, gene-centric (guide) and graph clustering algorithms for flexibility of gene function prediction. For each putative cluster, gene ontology (GO) enrichment and gene expression specificity analyses were performed to enhance gene function, expression and regulation pattern prediction. The guide-gene approach was used to infer novel roles of genes involved in disease susceptibility and vitamin C metabolism, and graph-clustering approaches were used to investigate isoprenoid/phenylpropanoid metabolism in citrus peel, and citric acid catabolism via the GABA shunt in citrus fruit. Conclusions Integration of citrus gene co-expression networks, functional enrichment analysis and gene expression information provide opportunities to infer gene function in citrus. We present a publicly accessible tool, Network Inference for Citrus Co-Expression (NICCE, http://citrus.adelaide.edu.au/nicce/home.aspx), for the gene co-expression analysis in citrus. PMID:25023870

  13. Human cell adhesion molecules: annotated functional subtypes and overrepresentation of addiction-associated genes.

    PubMed

    Zhong, Xiaoming; Drgonova, Jana; Li, Chuan-Yun; Uhl, George R

    2015-09-01

    Human cell adhesion molecules (CAMs) are essential for proper development, modulation, and maintenance of interactions between cells and cell-to-cell (and matrix-to-cell) communication about these interactions. Despite the differential functional significance of these roles, there have been surprisingly few systematic studies to enumerate the universe of CAMs and identify specific CAMs in distinct functions. In this paper, we update and review the set of human genes likely to encode CAMs with searches of databases, literature reviews, and annotations. We describe likely CAMs and functional subclasses, including CAMs that have a primary function in information exchange (iCAMs), CAMs involved in focal adhesions, CAM gene products that are preferentially involved with stereotyped and morphologically identifiable connections between cells (e.g., adherens junctions, gap junctions), and smaller numbers of CAM genes in other classes. We discuss a novel proposed mechanism involving selective anchoring of the constituents of iCAM-containing lipid rafts in zones of close neuronal apposition to membranes expressing iCAM binding partners. We also discuss data from genetic and genomic studies of addiction in humans and mouse models to highlight the ways in which CAM variation may contribute to a specific brain-based disorder such as addiction. Specific examples include changes in CAM mRNA splicing mediated by differences in the addiction-associated splicing regulator RBFOX1/A2BP1 and CAM expression in dopamine neurons. PMID:25988664

  14. Systematic condition-dependent annotation of metabolic genes

    E-print Network

    Sharan, Roded

    Systematic condition-dependent annotation of metabolic genes Tomer Shlomi,1,4 Markus Herrgard,2, Israel The task of deriving a functional annotation for genes is complex as their involvement in various the function of yeast genes and derive a condition-dependent annotation (CDA) for their involvement in major

  15. Comparison of functional gene annotation of Toxascaris leonina and Toxocara canis using CLC genomics workbench.

    PubMed

    Kim, Ki Uk; Park, Sang Kyun; Kang, Shin Ae; Park, Mi Kyung; Cho, Min Kyoung; Jung, Ho-Jin; Kim, Kyung-Yun; Yu, Hak Sun

    2013-10-01

    The ascarids, Toxocara canis and Toxascaris leonina, are probably the most common gastrointestinal helminths encountered in dogs. In order to understand biological differences of 2 ascarids, we analyzed gene expression profiles of female adults of T. canis and T. leonina using CLC Genomics Workbench, and the results were compared with those of free-living nematode Caenorhabditis elegans. A total of 2,880 and 7,949 ESTs were collected from T. leonina and T. canis, respectively. The length of ESTs ranged from 106 to 4,637 bp with an average insert size of 820 bp. Overall, our results showed that most functional gene annotations of 2 ascarids were quite similar to each other in 3 major categories, i.e., cellular component, biological process, and molecular function. Although some different transcript expression categories were found, the distance was short and it was not enough to explain their different lifestyles. However, we found distinguished transcript differences between ascarid parasites and free-living nematodes. Understanding evolutionary genetic changes might be helpful for studies of the lifestyle and evolution of parasites. PMID:24327777

  16. Quality of Computationally Inferred Gene Ontology Annotations

    PubMed Central

    Škunca, Nives; Altenhoff, Adrian; Dessimoz, Christophe

    2012-01-01

    Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation. PMID:22693439

  17. Functional Annotation of Hierarchical Modularity

    PubMed Central

    Padmanabhan, Kanchana; Wang, Kuangyu; Samatova, Nagiza F.

    2012-01-01

    In biological networks of molecular interactions in a cell, network motifs that are biologically relevant are also functionally coherent, or form functional modules. These functionally coherent modules combine in a hierarchical manner into larger, less cohesive subsystems, thus revealing one of the essential design principles of system-level cellular organization and function–hierarchical modularity. Arguably, hierarchical modularity has not been explicitly taken into consideration by most, if not all, functional annotation systems. As a result, the existing methods would often fail to assign a statistically significant functional coherence score to biologically relevant molecular machines. We developed a methodology for hierarchical functional annotation. Given the hierarchical taxonomy of functional concepts (e.g., Gene Ontology) and the association of individual genes or proteins with these concepts (e.g., GO terms), our method will assign a Hierarchical Modularity Score (HMS) to each node in the hierarchy of functional modules; the HMS score and its value measure functional coherence of each module in the hierarchy. While existing methods annotate each module with a set of “enriched” functional terms in a bag of genes, our complementary method provides the hierarchical functional annotation of the modules and their hierarchically organized components. A hierarchical organization of functional modules often comes as a bi-product of cluster analysis of gene expression data or protein interaction data. Otherwise, our method will automatically build such a hierarchy by directly incorporating the functional taxonomy information into the hierarchy search process and by allowing multi-functional genes to be part of more than one component in the hierarchy. In addition, its underlying HMS scoring metric ensures that functional specificity of the terms across different levels of the hierarchical taxonomy is properly treated. We have evaluated our method using Saccharomyces cerevisiae data from KEGG and MIPS databases and several other computationally derived and curated datasets. The code and additional supplemental files can be obtained from http://code.google.com/p/functional-annotation-of-hierarchical-modularity/ (Accessed 2012 March 13). PMID:22496762

  18. Functional annotation of novel lineage-specific genes using co-expression and promoter analysis

    PubMed Central

    2010-01-01

    Background The diversity of placental architectures within and among mammalian orders is believed to be the result of adaptive evolution. Although, the genetic basis for these differences is unknown, some may arise from rapidly diverging and lineage-specific genes. Previously, we identified 91 novel lineage-specific transcripts (LSTs) from a cow term-placenta cDNA library, which are excellent candidates for adaptive placental functions acquired by the ruminant lineage. The aim of the present study was to infer functions of previously uncharacterized lineage-specific genes (LSGs) using co-expression, promoter, pathway and network analysis. Results Clusters of co-expressed genes preferentially expressed in liver, placenta and thymus were found using 49 previously uncharacterized LSTs as seeds. Over-represented composite transcription factor binding sites (TFBS) in promoters of clustered LSGs and known genes were then identified computationally. Functions were inferred for nine previously uncharacterized LSGs using co-expression analysis and pathway analysis tools. Our results predict that these LSGs may function in cell signaling, glycerophospholipid/fatty acid metabolism, protein trafficking, regulatory processes in the nucleus, and processes that initiate parturition and immune system development. Conclusions The placenta is a rich source of lineage-specific genes that function in the adaptive evolution of placental architecture and functions. We have shown that co-expression, promoter, and gene network analyses are useful methods to infer functions of LSGs with heretofore unknown functions. Our results indicate that many LSGs are involved in cellular recognition and developmental processes. Furthermore, they provide guidance for experimental approaches to validate the functions of LSGs and to study their evolution. PMID:20214810

  19. Morgan's legacy: fruit flies and the functional annotation of conserved genes.

    PubMed

    Bellen, Hugo J; Yamamoto, Shinya

    2015-09-24

    In 1915, "The Mechanism of Mendelian Heredity" was published by four prominent Drosophila geneticists. They discovered that genes form linkage groups on chromosomes inherited in a Mendelian fashion and laid the genetic foundation that promoted Drosophila as a model organism. Flies continue to offer great opportunities, including studies in the field of functional genomics. PMID:26406362

  20. SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data

    E-print Network

    Botstein, David

    , Christian A. Rees2 , J. Michael Cherry2 , David Botstein2 , Patrick O. Brown1,3 and Ash A. Alizadeh1 1 ABSTRACT The explosion in the number of functional genomic datasets generated with tools such as DNA micro number of conditions using a simple and intuitive interface. SOURCE provides content both in gene and cDNA

  1. In Silico Functional Annotation of Genomic Variation.

    PubMed

    Butkiewicz, Mariusz; Bush, William S

    2016-01-01

    This unit describes the concepts and practical techniques for annotating genomic variants in the human genome to estimate their functional significance. With the rapid increase of available whole exome and whole genome sequencing information for human studies, annotation techniques have become progressively more important for highlighting and prioritizing nucleotide variants and their potential impact on genes and other genetic constructs. Here, we present an overview of different types of variant annotation approaches and elaborate on their foundations, assumptions, and the downstream consequences of their use. Computational approaches and tools to assign annotations and to identify variants are reviewed. Further, the general philosophy of assigning potential function to a genetic change within the biological context of a disease is discussed. © 2016 by John Wiley & Sons, Inc. PMID:26724722

  2. Improving gene annotation of complete viral genomes

    PubMed Central

    Mills, Ryan; Rozanov, Michael; Lomsadze, Alexandre; Tatusova, Tatiana; Borodovsky, Mark

    2003-01-01

    Gene annotation in viruses often relies upon similarity search methods. These methods possess high specificity but some genes may be missed, either those unique to a particular genome or those highly divergent from known homologs. To identify potentially missing viral genes we have analyzed all complete viral genomes currently available in GenBank with a specialized and augmented version of the gene finding program GeneMarkS. In particular, by implementing genome-specific self-training protocols we have better adjusted the GeneMarkS statistical models to sequences of viral genomes. Hundreds of new genes were identified, some in well studied viral genomes. For example, a new gene predicted in the genome of the Epstein–Barr virus was shown to encode a protein similar to ?-herpesvirus minor tegument protein UL14 with heat shock functions. Convincing evidence of this similarity was obtained after only 12 PSI-BLAST iterations. In another example, several iterations of PSI-BLAST were required to demonstrate that a gene predicted in the genome of Alcelaphine herpesvirus 1 encodes a BALF1-like protein which is thought to be involved in apoptosis regulation and, potentially, carcinogenesis. New predictions were used to refine annotations of viral genomes in the RefSeq collection curated by the National Center for Biotechnology Information. Importantly, even in those cases where no sequence similarities were detected, GeneMarkS significantly reduced the number of primary targets for experimental characterization by identifying the most probable candidate genes. The new genome annotations were stored in VIOLIN, an interactive database which provides access to similarity search tools for up-to-date analysis of predicted viral proteins. PMID:14627837

  3. The GOA database: gene Ontology annotation updates for 2015.

    PubMed

    Huntley, Rachael P; Sawford, Tony; Mutowo-Meullenet, Prudence; Shypitsyna, Aleksandra; Bonilla, Carlos; Martin, Maria J; O'Donovan, Claire

    2015-01-01

    The Gene Ontology Annotation (GOA) resource (http://www.ebi.ac.uk/GOA) provides evidence-based Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB). Manual annotations provided by UniProt curators are supplemented by manual and automatic annotations from model organism databases and specialist annotation groups. GOA currently supplies 368 million GO annotations to almost 54 million proteins in more than 480,000 taxonomic groups. The resource now provides annotations to five times the number of proteins it did 4 years ago. As a member of the GO Consortium, we adhere to the most up-to-date Consortium-agreed annotation guidelines via the use of quality control checks that ensures that the GOA resource supplies high-quality functional information to proteins from a wide range of species. Annotations from GOA are freely available and are accessible through a powerful web browser as well as a variety of annotation file formats. PMID:25378336

  4. Metagenomic gene annotation by a homology-independent approach

    SciTech Connect

    Froula, Jeff; Zhang, Tao; Salmeen, Annette; Hess, Matthias; Kerfeld, Cheryl A.; Wang, Zhong; Du, Changbin

    2011-06-02

    Fully understanding the genetic potential of a microbial community requires functional annotation of all the genes it encodes. The recently developed deep metagenome sequencing approach has enabled rapid identification of millions of genes from a complex microbial community without cultivation. Current homology-based gene annotation fails to detect distantly-related or structural homologs. Furthermore, homology searches with millions of genes are very computational intensive. To overcome these limitations, we developed rhModeller, a homology-independent software pipeline to efficiently annotate genes from metagenomic sequencing projects. Using cellulases and carbonic anhydrases as two independent test cases, we demonstrated that rhModeller is much faster than HMMER but with comparable accuracy, at 94.5percent and 99.9percent accuracy, respectively. More importantly, rhModeller has the ability to detect novel proteins that do not share significant homology to any known protein families. As {approx}50percent of the 2 million genes derived from the cow rumen metagenome failed to be annotated based on sequence homology, we tested whether rhModeller could be used to annotate these genes. Preliminary results suggest that rhModeller is robust in the presence of missense and frameshift mutations, two common errors in metagenomic genes. Applying the pipeline to the cow rumen genes identified 4,990 novel cellulases candidates and 8,196 novel carbonic anhydrase candidates.In summary, we expect rhModeller to dramatically increase the speed and quality of metagnomic gene annotation.

  5. Structural and functional annotation of the porcine immunome

    PubMed Central

    2013-01-01

    Background The domestic pig is known as an excellent model for human immunology and the two species share many pathogens. Susceptibility to infectious disease is one of the major constraints on swine performance, yet the structure and function of genes comprising the pig immunome are not well-characterized. The completion of the pig genome provides the opportunity to annotate the pig immunome, and compare and contrast pig and human immune systems. Results The Immune Response Annotation Group (IRAG) used computational curation and manual annotation of the swine genome assembly 10.2 (Sscrofa10.2) to refine the currently available automated annotation of 1,369 immunity-related genes through sequence-based comparison to genes in other species. Within these genes, we annotated 3,472 transcripts. Annotation provided evidence for gene expansions in several immune response families, and identified artiodactyl-specific expansions in the cathelicidin and type 1 Interferon families. We found gene duplications for 18 genes, including 13 immune response genes and five non-immune response genes discovered in the annotation process. Manual annotation provided evidence for many new alternative splice variants and 8 gene duplications. Over 1,100 transcripts without porcine sequence evidence were detected using cross-species annotation. We used a functional approach to discover and accurately annotate porcine immune response genes. A co-expression clustering analysis of transcriptomic data from selected experimental infections or immune stimulations of blood, macrophages or lymph nodes identified a large cluster of genes that exhibited a correlated positive response upon infection across multiple pathogens or immune stimuli. Interestingly, this gene cluster (cluster 4) is enriched for known general human immune response genes, yet contains many un-annotated porcine genes. A phylogenetic analysis of the encoded proteins of cluster 4 genes showed that 15% exhibited an accelerated evolution as compared to 4.1% across the entire genome. Conclusions This extensive annotation dramatically extends the genome-based knowledge of the molecular genetics and structure of a major portion of the porcine immunome. Our complementary functional approach using co-expression during immune response has provided new putative immune response annotation for over 500 porcine genes. Our phylogenetic analysis of this core immunome cluster confirms rapid evolutionary change in this set of genes, and that, as in other species, such genes are important components of the pig’s adaptation to pathogen challenge over evolutionary time. These comprehensive and integrated analyses increase the value of the porcine genome sequence and provide important tools for global analyses and data-mining of the porcine immune response. PMID:23676093

  6. Genetic Annotation of Gain-Of-Function Screens Using RNA Interference and in Situ Hybridization of Candidate Genes in the Drosophila Wing

    PubMed Central

    Molnar, Cristina; Casado, Mar; López-Varea, Ana; Cruz, Cristina; de Celis, Jose F.

    2012-01-01

    Gain-of-function screens in Drosophila are an effective method with which to identify genes that affect the development of particular structures or cell types. It has been found that a fraction of 2–10% of the genes tested, depending on the particularities of the screen, results in a discernible phenotype when overexpressed. However, it is not clear to what extent a gain-of-function phenotype generated by overexpression is informative about the normal function of the gene. Thus, very few reports attempt to correlate the loss- and overexpression phenotype for collections of genes identified in gain-of-function screens. In this work we use RNA interference and in situ hybridization to annotate a collection of 123 P-GS insertions that in combination with different Gal4 drivers affect the size and/or patterning of the wing. We identify the gene causing the overexpression phenotype by expressing, in a background of overexpression, RNA interference for the genes affected by each P-GS insertion. Then, we compare the loss and gain-of-function phenotypes obtained for each gene and relate them to its expression pattern in the wing disc. We find that 52% of genes identified by their overexpression phenotype are required during normal development. However, only in 9% of the cases analyzed was there some complementarity between the gain- and loss-of-function phenotype, suggesting that, in general, the overexpression phenotypes would not be indicative of the normal requirements of the gene. PMID:22798488

  7. Functional Annotation, Genome Organization and Phylogeny of the Grapevine (Vitis vinifera) Terpene Synthase Gene Family Based on Genome Assembly, FLcDNA Cloning, and Enzyme Assays

    PubMed Central

    2010-01-01

    Background Terpenoids are among the most important constituents of grape flavour and wine bouquet, and serve as useful metabolite markers in viticulture and enology. Based on the initial 8-fold sequencing of a nearly homozygous Pinot noir inbred line, 89 putative terpenoid synthase genes (VvTPS) were predicted by in silico analysis of the grapevine (Vitis vinifera) genome assembly [1]. The finding of this very large VvTPS family, combined with the importance of terpenoid metabolism for the organoleptic properties of grapevine berries and finished wines, prompted a detailed examination of this gene family at the genomic level as well as an investigation into VvTPS biochemical functions. Results We present findings from the analysis of the up-dated 12-fold sequencing and assembly of the grapevine genome that place the number of predicted VvTPS genes at 69 putatively functional VvTPS, 20 partial VvTPS, and 63 VvTPS probable pseudogenes. Gene discovery and annotation included information about gene architecture and chromosomal location. A dense cluster of 45 VvTPS is localized on chromosome 18. Extensive FLcDNA cloning, gene synthesis, and protein expression enabled functional characterization of 39 VvTPS; this is the largest number of functionally characterized TPS for any species reported to date. Of these enzymes, 23 have unique functions and/or phylogenetic locations within the plant TPS gene family. Phylogenetic analyses of the TPS gene family showed that while most VvTPS form species-specific gene clusters, there are several examples of gene orthology with TPS of other plant species, representing perhaps more ancient VvTPS, which have maintained functions independent of speciation. Conclusions The highly expanded VvTPS gene family underpins the prominence of terpenoid metabolism in grapevine. We provide a detailed experimental functional annotation of 39 members of this important gene family in grapevine and comprehensive information about gene structure and phylogeny for the entire currently known VvTPS gene family. PMID:20964856

  8. Gene calling and bacterial genome annotation with BG7.

    PubMed

    Tobes, Raquel; Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Kovach, Evdokim; Alekhin, Alexey; Pareja, Eduardo

    2015-01-01

    New massive sequencing technologies are providing many bacterial genome sequences from diverse taxa but a refined annotation of these genomes is crucial for obtaining scientific findings and new knowledge. Thus, bacterial genome annotation has emerged as a key point to investigate in bacteria. Any efficient tool designed specifically to annotate bacterial genomes sequenced with massively parallel technologies has to consider the specific features of bacterial genomes (absence of introns and scarcity of nonprotein-coding sequence) and of next-generation sequencing (NGS) technologies (presence of errors and not perfectly assembled genomes). These features make it convenient to focus on coding regions and, hence, on protein sequences that are the elements directly related with biological functions. In this chapter we describe how to annotate bacterial genomes with BG7, an open-source tool based on a protein-centered gene calling/annotation paradigm. BG7 is specifically designed for the annotation of bacterial genomes sequenced with NGS. This tool is sequence error tolerant maintaining their capabilities for the annotation of highly fragmented genomes or for annotating mixed sequences coming from several genomes (as those obtained through metagenomics samples). BG7 has been designed with scalability as a requirement, with a computing infrastructure completely based on cloud computing (Amazon Web Services). PMID:25343866

  9. Metabolomics as a Hypothesis-Generating Functional Genomics Tool for the Annotation of Arabidopsis thaliana Genes of “Unknown Function

    PubMed Central

    Quanbeck, Stephanie M.; Brachova, Libuse; Campbell, Alexis A.; Guan, Xin; Perera, Ann; He, Kun; Rhee, Seung Y.; Bais, Preeti; Dickerson, Julie A.; Dixon, Philip; Wohlgemuth, Gert; Fiehn, Oliver; Barkan, Lenore; Lange, Iris; Lange, B. Markus; Lee, Insuk; Cortes, Diego; Salazar, Carolina; Shuman, Joel; Shulaev, Vladimir; Huhman, David V.; Sumner, Lloyd W.; Roth, Mary R.; Welti, Ruth; Ilarslan, Hilal; Wurtele, Eve S.; Nikolau, Basil J.

    2012-01-01

    Metabolomics is the methodology that identifies and measures global pools of small molecules (of less than about 1,000?Da) of a biological sample, which are collectively called the metabolome. Metabolomics can therefore reveal the metabolic outcome of a genetic or environmental perturbation of a metabolic regulatory network, and thus provide insights into the structure and regulation of that network. Because of the chemical complexity of the metabolome and limitations associated with individual analytical platforms for determining the metabolome, it is currently difficult to capture the complete metabolome of an organism or tissue, which is in contrast to genomics and transcriptomics. This paper describes the analysis of Arabidopsis metabolomics data sets acquired by a consortium that includes five analytical laboratories, bioinformaticists, and biostatisticians, which aims to develop and validate metabolomics as a hypothesis-generating functional genomics tool. The consortium is determining the metabolomes of Arabidopsis T-DNA mutant stocks, grown in standardized controlled environment optimized to minimize environmental impacts on the metabolomes. Metabolomics data were generated with seven analytical platforms, and the combined data is being provided to the research community to formulate initial hypotheses about genes of unknown function (GUFs). A public database (www.PlantMetabolomics.org) has been developed to provide the scientific community with access to the data along with tools to allow for its interactive analysis. Exemplary datasets are discussed to validate the approach, which illustrate how initial hypotheses can be generated from the consortium-produced metabolomics data, integrated with prior knowledge to provide a testable hypothesis concerning the functionality of GUFs. PMID:22645570

  10. Dependence Relationships between Gene Ontology Terms based on TIGR Gene Product Annotations

    E-print Network

    Borgelt, Christian

    vocabularies for the designations of cellular components, molecular functions, and biological processes usedDependence Relationships between Gene Ontology Terms based on TIGR Gene Product Annotations Anand for the representation and processing of information about gene products and functions. It provides controlled

  11. Detection of gene annotations and protein-protein interaction associated disorders through transitive relationships between integrated annotations

    PubMed Central

    2015-01-01

    Background Increasingly high amounts of heterogeneous and valuable controlled biomolecular annotations are available, but far from exhaustive and scattered in many databases. Several annotation integration and prediction approaches have been proposed, but these issues are still unsolved. We previously created a Genomic and Proteomic Knowledge Base (GPKB) that efficiently integrates many distributed biomolecular annotation and interaction data of several organisms, including 32,956,102 gene annotations, 273,522,470 protein annotations and 277,095 protein-protein interactions (PPIs). Results By comprehensively leveraging transitive relationships defined by the numerous association data integrated in GPKB, we developed a software procedure that effectively detects and supplement consistent biomolecular annotations not present in the integrated sources. According to some defined logic rules, it does so only when the semantic type of data and of their relationships, as well as the cardinality of the relationships, allow identifying molecular biology compliant annotations. Thanks to controlled consistency and quality enforced on data integrated in GPKB, and to the procedures used to avoid error propagation during their automatic processing, we could reliably identify many annotations, which we integrated in GPKB. They comprise 3,144 gene to pathway and 21,942 gene to biological function annotations of many organisms, and 1,027 candidate associations between 317 genetic disorders and 782 human PPIs. Overall estimated recall and precision of our approach were 90.56 % and 96.61 %, respectively. Co-functional evaluation of genes with known function showed high functional similarity between genes with new detected and known annotation to the same pathway; considering also the new detected gene functional annotations enhanced such functional similarity, which resembled the one existing between genes known to be annotated to the same pathway. Strong evidence was also found in the literature for the candidate associations detected between Cystic fibrosis disorder and the PPIs between the CFTR_HUMAN, DERL1_HUMAN, RNF5_HUMAN, AHSA1_HUMAN and GOPC_HUMAN proteins, and between the CHIP_HUMAN and HSP7C_HUMAN proteins. Conclusions Although identified gene annotations and PPI-genetic disorder candidate associations require biological validation, our approach intrinsically provides their in silico evidence based on available data. Public availability within the GPKB (http://www.bioinformatics.deib.polimi.it/GPKB/) of all identified and integrated annotations offers a valuable resource fostering new biomedical-molecular knowledge discoveries. PMID:26046679

  12. Annotation and functional assignment of the genes for the C30 carotenoid pathways from the genomes of two bacteria: Bacillus indicus and Bacillus firmus.

    PubMed

    Steiger, Sabine; Perez-Fons, Laura; Cutting, Simon M; Fraser, Paul D; Sandmann, Gerhard

    2015-01-01

    Bacillus indicus and Bacillus firmus synthesize C30 carotenoids via farnesyl pyrophosphate, forming apophytoene as the first committed step in the pathway. The products of the pathways were methyl 4'-[6-O-acyl-glycosyl)oxy]-4,4'-diapolycopen-4-oic acid and 4,4'-diapolycopen-4,4'-dioic acid with putative glycosyl esters. The genomes of both bacteria were sequenced, and the genes for their early terpenoid and specific carotenoid pathways annotated. All genes for a functional 1-deoxy-d-xylulose 5-phosphate synthase pathway were identified in both species, whereas genes of the mevalonate pathway were absent. The genes for specific carotenoid synthesis and conversion were found on gene clusters which were organized differently in the two species. The genes involved in the formation of the carotenoid cores were assigned by functional complementation in Escherichia coli. This bacterium was co-transformed with a plasmid mediating the formation of the putative substrate and a second plasmid with the gene of interest. Carotenoid products in the transformants were determined by HPLC. Using this approach, we identified the genes for a 4,4'-diapophytoene synthase (crtM), 4,4'-diapophytoene desaturase (crtNa), 4,4'-diapolycopene ketolase (crtNb) and 4,4'-diapolycopene aldehyde oxidase (crtNc). The three crtN genes were closely related and belonged to the crtI gene family with a similar reaction mechanism of their enzyme products. Additional genes encoding glycosyltransferases and acyltransferases for the modification of the carotenoid skeleton of the diapolycopenoic acids were identified by comparison with the corresponding genes from other bacteria. PMID:25326460

  13. The Disease and Gene Annotations (DGA): an annotation resource for human disease.

    PubMed

    Peng, Kai; Xu, Wei; Zheng, Jianyong; Huang, Kegui; Wang, Huisong; Tong, Jiansong; Lin, Zhifeng; Liu, Jun; Cheng, Wenqing; Fu, Dong; Du, Pan; Kibbe, Warren A; Lin, Simon M; Xia, Tian

    2013-01-01

    Disease and Gene Annotations database (DGA, http://dga.nubic.northwestern.edu) is a collaborative effort aiming to provide a comprehensive and integrative annotation of the human genes in disease network context by integrating computable controlled vocabulary of the Disease Ontology (DO version 3 revision 2510, which has 8043 inherited, developmental and acquired human diseases), NCBI Gene Reference Into Function (GeneRIF) and molecular interaction network (MIN). DGA integrates these resources together using semantic mappings to build an integrative set of disease-to-gene and gene-to-gene relationships with excellent coverage based on current knowledge. DGA is kept current by periodically reparsing DO, GeneRIF, and MINs. DGA provides a user-friendly and interactive web interface system enabling users to efficiently query, download and visualize the DO tree structure and annotations as a tree, a network graph or a tabular list. To facilitate integrative analysis, DGA provides a web service Application Programming Interface for integration with external analytic tools. PMID:23197658

  14. The disease and gene annotations (DGA): an annotation resource for human disease

    PubMed Central

    Peng, Kai; Xu, Wei; Zheng, Jianyong; Huang, Kegui; Wang, Huisong; Tong, Jiansong; Lin, Zhifeng; Liu, Jun; Cheng, Wenqing; Fu, Dong; Du, Pan; Kibbe, Warren A.; Lin, Simon M.; Xia, Tian

    2013-01-01

    Disease and Gene Annotations database (DGA, http://dga.nubic.northwestern.edu) is a collaborative effort aiming to provide a comprehensive and integrative annotation of the human genes in disease network context by integrating computable controlled vocabulary of the Disease Ontology (DO version 3 revision 2510, which has 8043 inherited, developmental and acquired human diseases), NCBI Gene Reference Into Function (GeneRIF) and molecular interaction network (MIN). DGA integrates these resources together using semantic mappings to build an integrative set of disease-to-gene and gene-to-gene relationships with excellent coverage based on current knowledge. DGA is kept current by periodically reparsing DO, GeneRIF, and MINs. DGA provides a user-friendly and interactive web interface system enabling users to efficiently query, download and visualize the DO tree structure and annotations as a tree, a network graph or a tabular list. To facilitate integrative analysis, DGA provides a web service Application Programming Interface for integration with external analytic tools. PMID:23197658

  15. Functional Annotation of Rheumatoid Arthritis and Osteoarthritis Associated Genes by Integrative Genome-Wide Gene Expression Profiling Analysis

    PubMed Central

    Li, Zhan-Chun; Xiao, Jie; Peng, Jin-Liang; Chen, Jian-Wei; Ma, Tao; Cheng, Guang-Qi; Dong, Yu-Qi; Wang, Wei-li; Liu, Zu-De

    2014-01-01

    Background Rheumatoid arthritis (RA) and osteoarthritis (OA) are two major types of joint diseases that share multiple common symptoms. However, their pathological mechanism remains largely unknown. The aim of our study is to identify RA and OA related-genes and gain an insight into the underlying genetic basis of these diseases. Methods We collected 11 whole genome-wide expression profiling datasets from RA and OA cohorts and performed a meta-analysis to comprehensively investigate their expression signatures. This method can avoid some pitfalls of single dataset analyses. Results and Conclusion We found that several biological pathways (i.e., the immunity, inflammation and apoptosis related pathways) are commonly involved in the development of both RA and OA. Whereas several other pathways (i.e., vasopressin-related pathway, regulation of autophagy, endocytosis, calcium transport and endoplasmic reticulum stress related pathways) present significant difference between RA and OA. This study provides novel insights into the molecular mechanisms underlying this disease, thereby aiding the diagnosis and treatment of the disease. PMID:24551036

  16. GFam: a platform for automatic annotation of gene families

    PubMed Central

    Sasidharan, Rajkumar; Nepusz, Tamás; Swarbreck, David; Huala, Eva; Paccanaro, Alberto

    2012-01-01

    We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam’s capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/. PMID:22790981

  17. A guide to best practices for Gene Ontology (GO) manual annotation

    PubMed Central

    Balakrishnan, Rama; Harris, Midori A.; Huntley, Rachael; Van Auken, Kimberly; Cherry, J. Michael

    2013-01-01

    The Gene Ontology Consortium (GOC) is a community-based bioinformatics project that classifies gene product function through the use of structured controlled vocabularies. A fundamental application of the Gene Ontology (GO) is in the creation of gene product annotations, evidence-based associations between GO definitions and experimental or sequence-based analysis. Currently, the GOC disseminates 126 million annotations covering >374 000 species including all the kingdoms of life. This number includes two classes of GO annotations: those created manually by experienced biocurators reviewing the literature or by examination of biological data (1.1 million annotations covering 2226 species) and those generated computationally via automated methods. As manual annotations are often used to propagate functional predictions between related proteins within and between genomes, it is critical to provide accurate consistent manual annotations. Toward this goal, we present here the conventions defined by the GOC for the creation of manual annotation. This guide represents the best practices for manual annotation as established by the GOC project over the past 12 years. We hope this guide will encourage research communities to annotate gene products of their interest to enhance the corpus of GO annotations available to all. Database URL: http://www.geneontology.org PMID:23842463

  18. KEGG as a reference resource for gene and protein annotation

    PubMed Central

    Kanehisa, Minoru; Sato, Yoko; Kawashima, Masayuki; Furumichi, Miho; Tanabe, Mao

    2016-01-01

    KEGG (http://www.kegg.jp/ or http://www.genome.jp/kegg/) is an integrated database resource for biological interpretation of genome sequences and other high-throughput data. Molecular functions of genes and proteins are associated with ortholog groups and stored in the KEGG Orthology (KO) database. The KEGG pathway maps, BRITE hierarchies and KEGG modules are developed as networks of KO nodes, representing high-level functions of the cell and the organism. Currently, more than 4000 complete genomes are annotated with KOs in the KEGG GENES database, which can be used as a reference data set for KO assignment and subsequent reconstruction of KEGG pathways and other molecular networks. As an annotation resource, the following improvements have been made. First, each KO record is re-examined and associated with protein sequence data used in experiments of functional characterization. Second, the GENES database now includes viruses, plasmids, and the addendum category for functionally characterized proteins that are not represented in complete genomes. Third, new automatic annotation servers, BlastKOALA and GhostKOALA, are made available utilizing the non-redundant pangenome data set generated from the GENES database. As a resource for translational bioinformatics, various data sets are created for antimicrobial resistance and drug interaction networks. PMID:26476454

  19. High-resolution functional annotation of human transcriptome: predicting isoform functions by a

    E-print Network

    Zhou, Xianghong Jasmine

    function prediction, isoform function prediction faces a unique chal- lenge: the lack of the training dataHigh-resolution functional annotation of human transcriptome: predicting isoform functions mechanism for generating functional diversity in genes. However, little is known about the precise functions

  20. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology.

    PubMed

    Camon, Evelyn; Magrane, Michele; Barrell, Daniel; Lee, Vivian; Dimmer, Emily; Maslen, John; Binns, David; Harte, Nicola; Lopez, Rodrigo; Apweiler, Rolf

    2004-01-01

    The Gene Ontology Annotation (GOA) database (http://www.ebi.ac.uk/GOA) aims to provide high-quality electronic and manual annotations to the UniProt Knowledgebase (Swiss-Prot, TrEMBL and PIR-PSD) using the standardized vocabulary of the Gene Ontology (GO). As a supplementary archive of GO annotation, GOA promotes a high level of integration of the knowledge represented in UniProt with other databases. This is achieved by converting UniProt annotation into a recognized computational format. GOA provides annotated entries for nearly 60,000 species (GOA-SPTr) and is the largest and most comprehensive open-source contributor of annotations to the GO Consortium annotation effort. By integrating GO annotations from other model organism groups, GOA consolidates specialized knowledge and expertise to ensure the data remain a key reference for up-to-date biological information. Furthermore, the GOA database fully endorses the Human Proteomics Initiative by prioritizing the annotation of proteins likely to benefit human health and disease. In addition to a non-redundant set of annotations to the human proteome (GOA-Human) and monthly releases of its GO annotation for all species (GOA-SPTr), a series of GO mapping files and specific cross-references in other databases are also regularly distributed. GOA can be queried through a simple user-friendly web interface or downloaded in a parsable format via the EBI and GO FTP websites. The GOA data set can be used to enhance the annotation of particular model organism or gene expression data sets, although increasingly it has been used to evaluate GO predictions generated from text mining or protein interaction experiments. In 2004, the GOA team will build on its success and will continue to supplement the functional annotation of UniProt and work towards enhancing the ability of scientists to access all available biological information. Researchers wishing to query or contribute to the GOA project are encouraged to email: goa@ebi.ac.uk. PMID:14681408

  1. Critical Assessment of Function Annotation Meeting, 2011

    SciTech Connect

    Friedberg, Iddo

    2015-01-21

    The Critical Assessment of Function Annotation meeting was held July 14-15, 2011 at the Austria Conference Center in Vienna, Austria. There were 73 registered delegates at the meeting. We thank the DOE for this award. It helped us organize and support a scientific meeting AFP 2011 as a special interest group (SIG) meeting associated with the ISMB 2011 conference. The conference was held in Vienna, Austria, in July 2011. The AFP SIG was held on July 15-16, 2011 (immediately preceding the conference). The meeting consisted of two components, the first being a series of talks (invited and contributed) and discussion sections dedicated to protein function research, with an emphasis on the theory and practice of computational methods utilized in functional annotation. The second component provided a large-scale assessment of computational methods through participation in the Critical Assessment of Functional Annotation (CAFA).

  2. Gene Model Annotations for Drosophila melanogaster: Impact of High-Throughput Data

    PubMed Central

    Matthews, Beverley B.; dos Santos, Gilberto; Crosby, Madeline A.; Emmert, David B.; St. Pierre, Susan E.; Gramates, L. Sian; Zhou, Pinglei; Schroeder, Andrew J.; Falls, Kathleen; Strelets, Victor; Russo, Susan M.; Gelbart, William M.

    2015-01-01

    We report the current status of the FlyBase annotated gene set for Drosophila melanogaster and highlight improvements based on high-throughput data. The FlyBase annotated gene set consists entirely of manually annotated gene models, with the exception of some classes of small non-coding RNAs. All gene models have been reviewed using evidence from high-throughput datasets, primarily from the modENCODE project. These datasets include RNA-Seq coverage data, RNA-Seq junction data, transcription start site profiles, and translation stop-codon read-through predictions. New annotation guidelines were developed to take into account the use of the high-throughput data. We describe how this flood of new data was incorporated into thousands of new and revised annotations. FlyBase has adopted a philosophy of excluding low-confidence and low-frequency data from gene model annotations; we also do not attempt to represent all possible permutations for complex and modularly organized genes. This has allowed us to produce a high-confidence, manageable gene annotation dataset that is available at FlyBase (http://flybase.org). Interesting aspects of new annotations include new genes (coding, non-coding, and antisense), many genes with alternative transcripts with very long 3? UTRs (up to 15–18 kb), and a stunning mismatch in the number of male-specific genes (approximately 13% of all annotated gene models) vs. female-specific genes (less than 1%). The number of identified pseudogenes and mutations in the sequenced strain also increased significantly. We discuss remaining challenges, for instance, identification of functional small polypeptides and detection of alternative translation starts. PMID:26109357

  3. Gene Model Annotations for Drosophila melanogaster: Impact of High-Throughput Data.

    PubMed

    Matthews, Beverley B; Dos Santos, Gilberto; Crosby, Madeline A; Emmert, David B; St Pierre, Susan E; Gramates, L Sian; Zhou, Pinglei; Schroeder, Andrew J; Falls, Kathleen; Strelets, Victor; Russo, Susan M; Gelbart, William M

    2015-08-01

    We report the current status of the FlyBase annotated gene set for Drosophila melanogaster and highlight improvements based on high-throughput data. The FlyBase annotated gene set consists entirely of manually annotated gene models, with the exception of some classes of small non-coding RNAs. All gene models have been reviewed using evidence from high-throughput datasets, primarily from the modENCODE project. These datasets include RNA-Seq coverage data, RNA-Seq junction data, transcription start site profiles, and translation stop-codon read-through predictions. New annotation guidelines were developed to take into account the use of the high-throughput data. We describe how this flood of new data was incorporated into thousands of new and revised annotations. FlyBase has adopted a philosophy of excluding low-confidence and low-frequency data from gene model annotations; we also do not attempt to represent all possible permutations for complex and modularly organized genes. This has allowed us to produce a high-confidence, manageable gene annotation dataset that is available at FlyBase (http://flybase.org). Interesting aspects of new annotations include new genes (coding, non-coding, and antisense), many genes with alternative transcripts with very long 3' UTRs (up to 15-18 kb), and a stunning mismatch in the number of male-specific genes (approximately 13% of all annotated gene models) vs. female-specific genes (less than 1%). The number of identified pseudogenes and mutations in the sequenced strain also increased significantly. We discuss remaining challenges, for instance, identification of functional small polypeptides and detection of alternative translation starts. PMID:26109357

  4. Mining the Gene Wiki for functional genomic knowledge

    PubMed Central

    2011-01-01

    Background Ontology-based gene annotations are important tools for organizing and analyzing genome-scale biological data. Collecting these annotations is a valuable but costly endeavor. The Gene Wiki makes use of Wikipedia as a low-cost, mass-collaborative platform for assembling text-based gene annotations. The Gene Wiki is comprised of more than 10,000 review articles, each describing one human gene. The goal of this study is to define and assess a computational strategy for translating the text of Gene Wiki articles into ontology-based gene annotations. We specifically explore the generation of structured annotations using the Gene Ontology and the Human Disease Ontology. Results Our system produced 2,983 candidate gene annotations using the Disease Ontology and 11,022 candidate annotations using the Gene Ontology from the text of the Gene Wiki. Based on manual evaluations and comparisons to reference annotation sets, we estimate a precision of 90-93% for the Disease Ontology annotations and 48-64% for the Gene Ontology annotations. We further demonstrate that this data set can systematically improve the results from gene set enrichment analyses. Conclusions The Gene Wiki is a rapidly growing corpus of text focused on human gene function. Here, we demonstrate that the Gene Wiki can be a powerful resource for generating ontology-based gene annotations. These annotations can be used immediately to improve workflows for building curated gene annotation databases and knowledge-based statistical analyses. PMID:22165947

  5. Semantically Improved Genome-Wide Prediction of Gene Ontology Annotations

    E-print Network

    Tagliasacchi, Marco

    on the automatic prediction of Gene Ontology annotations based on the singular value decomposition (SVDSemantically Improved Genome-Wide Prediction of Gene Ontology Annotations Marco Masseroli, Marco by means of controlled terminologies and ontologies are extremely valuable, in particular for computational

  6. Predicting Novel Human Gene Ontology Annotations Using Semantic Analysis

    PubMed Central

    Done, Bogdan; Khatri, Purvesh; Done, Arina; Draghici, Sorin

    2013-01-01

    The correct interpretation of many molecular biology experiments depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are meant to act as repositories for our biological knowledge as we acquire and refine it. Hence, by definition, they are incomplete at any given time. In this paper, we describe a technique that improves our previous method for predicting novel GO annotations by extracting implicit semantic relationships between genes and functions. In this work, we use a vector space model and a number of weighting schemes in addition to our previous latent semantic indexing approach. The technique described here is able to take into consideration the hierarchical structure of the Gene Ontology (GO) and can weight differently GO terms situated at different depths. The prediction abilities of 15 different weighting schemes are compared and evaluated. Nine such schemes were previously used in other problem domains, while six of them are introduced in this paper. The best weighting scheme was a novel scheme, n2tn. Out of the top 50 functional annotations predicted using this weighting scheme, we found support in the literature for 84 percent of them, while 6 percent of the predictions were contradicted by the existing literature. For the remaining 10 percent, we did not find any relevant publications to confirm or contradict the predictions. The n2tn weighting scheme also outperformed the simple binary scheme used in our previous approach. PMID:20150671

  7. Comprehensive comparative homeobox gene annotation in human and mouse

    PubMed Central

    Wilming, Laurens G.; Boychenko, Veronika; Harrow, Jennifer L.

    2015-01-01

    Homeobox genes are a group of genes coding for transcription factors with a DNA-binding helix-turn-helix structure called a homeodomain and which play a crucial role in pattern formation during embryogenesis. Many homeobox genes are located in clusters and some of these, most notably the HOX genes, are known to have antisense or opposite strand long non-coding RNA (lncRNA) genes that play a regulatory role. Because automated annotation of both gene clusters and non-coding genes is fraught with difficulty (over-prediction, under-prediction, inaccurate transcript structures), we set out to manually annotate all homeobox genes in the mouse and human genomes. This includes all supported splice variants, pseudogenes and both antisense and flanking lncRNAs. One of the areas where manual annotation has a significant advantage is the annotation of duplicated gene clusters. After comprehensive annotation of all homeobox genes and their antisense genes in human and in mouse, we found some discrepancies with the current gene set in RefSeq regarding exact gene structures and coding versus pseudogene locus biotype. We also identified previously un-annotated pseudogenes in the DUX, Rhox and Obox gene clusters, which helped us re-evaluate and update the gene nomenclature in these regions. We found that human homeobox genes are enriched in antisense lncRNA loci, some of which are known to play a role in gene or gene cluster regulation, compared to their mouse orthologues. Of the annotated set of 241 human protein-coding homeobox genes, 98 have an antisense locus (41%) while of the 277 orthologous mouse genes, only 62 protein coding gene have an antisense locus (22%), based on publicly available transcriptional evidence. PMID:26412852

  8. Dependence Relationships between Gene Ontology Terms based TIGR Gene Product Annotations

    E-print Network

    Borgelt, Christian

    , biological processes annotation of genes and gene products. These constitute three separate ontologiesDependence Relationships between Gene Ontology Terms based TIGR Gene Product Annotations Anand.de/~borgelt/ Abstract Gene Ontology important tool representation and processing information about gene products

  9. HMM-Based Gene Annotation Methods

    SciTech Connect

    Haussler, David; Hughey, Richard; Karplus, Keven

    1999-09-20

    Development of new statistical methods and computational tools to identify genes in human genomic DNA, and to provide clues to their functions by identifying features such as transcription factor binding sites, tissue, specific expression and splicing patterns, and remove homologies at the protein level with genes of known function.

  10. Functional annotation of the human retinal pigment epithelium transcriptome

    PubMed Central

    Booij, Judith C; van Soest, Simone; Swagemakers, Sigrid MA; Essing, Anke HW; Verkerk, Annemieke JMH; van der Spek, Peter J; Gorgels, Theo GMF; Bergen, Arthur AB

    2009-01-01

    Background To determine level, variability and functional annotation of gene expression of the human retinal pigment epithelium (RPE), the key tissue involved in retinal diseases like age-related macular degeneration and retinitis pigmentosa. Macular RPE cells from six selected healthy human donor eyes (aged 63–78 years) were laser dissected and used for 22k microarray studies (Agilent technologies). Data were analyzed with Rosetta Resolver, the web tool DAVID and Ingenuity software. Results In total, we identified 19,746 array entries with significant expression in the RPE. Gene expression was analyzed according to expression levels, interindividual variability and functionality. A group of highly (n = 2,194) expressed RPE genes showed an overrepresentation of genes of the oxidative phosphorylation, ATP synthesis and ribosome pathways. In the group of moderately expressed genes (n = 8,776) genes of the phosphatidylinositol signaling system and aminosugars metabolism were overrepresented. As expected, the top 10 percent (n = 2,194) of genes with the highest interindividual differences in expression showed functional overrepresentation of the complement cascade, essential in inflammation in age-related macular degeneration, and other signaling pathways. Surprisingly, this same category also includes the genes involved in Bruch's membrane (BM) composition. Among the top 10 percent of genes with low interindividual differences, there was an overrepresentation of genes involved in local glycosaminoglycan turnover. Conclusion Our study expands current knowledge of the RPE transcriptome by assigning new genes, and adding data about expression level and interindividual variation. Functional annotation suggests that the RPE has high levels of protein synthesis, strong energy demands, and is exposed to high levels of oxidative stress and a variable degree of inflammation. Our data sheds new light on the molecular composition of BM, adjacent to the RPE, and is useful for candidate retinal disease gene identification or gene dose-dependent therapeutic studies. PMID:19379482

  11. Software Suite for Gene and Protein Annotation Prediction and Similarity Search.

    PubMed

    Chicco, Davide; Masseroli, Marco

    2015-01-01

    In the computational biology community, machine learning algorithms are key instruments for many applications, including the prediction of gene-functions based upon the available biomolecular annotations. Additionally, they may also be employed to compute similarity between genes or proteins. Here, we describe and discuss a software suite we developed to implement and make publicly available some of such prediction methods and a computational technique based upon Latent Semantic Indexing (LSI), which leverages both inferred and available annotations to search for semantically similar genes. The suite consists of three components. BioAnnotationPredictor is a computational software module to predict new gene-functions based upon Singular Value Decomposition of available annotations. SimilBio is a Web module that leverages annotations available or predicted by BioAnnotationPredictor to discover similarities between genes via LSI. The suite includes also SemSim, a new Web service built upon these modules to allow accessing them programmatically. We integrated SemSim in the Bio Search Computing framework (http://www.bioinformatics.deib. polimi.it/bio-seco/seco/), where users can exploit the Search Computing technology to run multi-topic complex queries on multiple integrated Web services. Accordingly, researchers may obtain ranked answers involving the computation of the functional similarity between genes in support of biomedical knowledge discovery. PMID:26357324

  12. CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations.

    PubMed

    Park, Julie; Costanzo, Maria C; Balakrishnan, Rama; Cherry, J Michael; Hong, Eurie L

    2012-01-01

    The set of annotations at the Saccharomyces Genome Database (SGD) that classifies the cellular function of S. cerevisiae gene products using Gene Ontology (GO) terms has become an important resource for facilitating experimental analysis. In addition to capturing and summarizing experimental results, the structured nature of GO annotations allows for functional comparison across organisms as well as propagation of functional predictions between related gene products. Due to their relevance to many areas of research, ensuring the accuracy and quality of these annotations is a priority at SGD. GO annotations are assigned either manually, by biocurators extracting experimental evidence from the scientific literature, or through automated methods that leverage computational algorithms to predict functional information. Here, we discuss the relationship between literature-based and computationally predicted GO annotations in SGD and extend a strategy whereby comparison of these two types of annotation identifies genes whose annotations need review. Our method, CvManGO (Computational versus Manual GO annotations), pairs literature-based GO annotations with computational GO predictions and evaluates the relationship of the two terms within GO, looking for instances of discrepancy. We found that this method will identify genes that require annotation updates, taking an important step towards finding ways to prioritize literature review. Additionally, we explored factors that may influence the effectiveness of CvManGO in identifying relevant gene targets to find in particular those genes that are missing literature-supported annotations, but our survey found that there are no immediately identifiable criteria by which one could enrich for these under-annotated genes. Finally, we discuss possible ways to improve this strategy, and the applicability of this method to other projects that use the GO for curation. DATABASE URL: http://www.yeastgenome.org. PMID:22434836

  13. Evolutionary Trace Annotation of Protein Function in the Structural Proteome

    PubMed Central

    Erdin, Serkan; Ward, R. Matthew; Venner, Eric

    2010-01-01

    By design, structural genomics (SG) solves many structures that cannot be assigned function based on homology to known proteins. Alternative function annotation methods are therefore needed and this study focuses on function prediction with three-dimensional (3D) templates: small structural motifs built of just a few functionally critical residues. Although experimentally proven functional residues are scarce, we show here that Evolutionary Trace (ET) rankings of residue importance are sufficient to build 3D templates, match them, and then assign Gene Ontology (GO) functions in enzymes and non-enzymes alike. In a high specificity mode, this Evolutionary Trace Annotation (ETA) method covered half (53%) of the 2384 annotated SG protein controls. Three-quarters (76%) of predictions were both correct and complete. The positive predictive value for all GO depths (all-depth PPV) was 84%, and it rose to 94% over GO depths 1– 3 (depth 3 PPV). In a high sensitivity mode coverage rose significantly (84%) while accuracy fell moderately: 68% of predictions were both correct and complete, all-depth PPV was 75%, and depth 3 PPV was 86%. These data concur with prior mutational experiments showing that ET rank information identifies key functional determinants in proteins. In practice, ETA predicted functions in 42% of 3461 un-annotated SG proteins. In 529 cases—including 280 non-enzymes and 21 for metal ion ligands—the expected accuracy is 84% at any GO depth and 94% down to GO depth 3, while for the remaining 931 the expected accuracies are 60% and 71%, respectively. Thus local structural comparisons of evolutionarily important residues can help decipher protein functions to known reliability levels and without prior assumption on functional mechanisms. ETA is available at http://mammoth.bcm.tmc.edu/eta. PMID:20036248

  14. Transcriptome assembly, gene annotation and tissue gene expression atlas of the rainbow trout.

    PubMed

    Salem, Mohamed; Paneru, Bam; Al-Tobasei, Rafet; Abdouni, Fatima; Thorgaard, Gary H; Rexroad, Caird E; Yao, Jianbo

    2015-01-01

    Efforts to obtain a comprehensive genome sequence for rainbow trout are ongoing and will be complemented by transcriptome information that will enhance genome assembly and annotation. Previously, transcriptome reference sequences were reported using data from different sources. Although the previous work added a great wealth of sequences, a complete and well-annotated transcriptome is still needed. In addition, gene expression in different tissues was not completely addressed in the previous studies. In this study, non-normalized cDNA libraries were sequenced from 13 different tissues of a single doubled haploid rainbow trout from the same source used for the rainbow trout genome sequence. A total of ~1.167 billion paired-end reads were de novo assembled using the Trinity RNA-Seq assembler yielding 474,524 contigs > 500 base-pairs. Of them, 287,593 had homologies to the NCBI non-redundant protein database. The longest contig of each cluster was selected as a reference, yielding 44,990 representative contigs. A total of 4,146 contigs (9.2%), including 710 full-length sequences, did not match any mRNA sequences in the current rainbow trout genome reference. Mapping reads to the reference genome identified an additional 11,843 transcripts not annotated in the genome. A digital gene expression atlas revealed 7,678 housekeeping and 4,021 tissue-specific genes. Expression of about 16,000-32,000 genes (35-71% of the identified genes) accounted for basic and specialized functions of each tissue. White muscle and stomach had the least complex transcriptomes, with high percentages of their total mRNA contributed by a small number of genes. Brain, testis and intestine, in contrast, had complex transcriptomes, with a large numbers of genes involved in their expression patterns. This study provides comprehensive de novo transcriptome information that is suitable for functional and comparative genomics studies in rainbow trout, including annotation of the genome. PMID:25793877

  15. Transcriptome Assembly, Gene Annotation and Tissue Gene Expression Atlas of the Rainbow Trout

    PubMed Central

    Salem, Mohamed; Paneru, Bam; Al-Tobasei, Rafet; Abdouni, Fatima; Thorgaard, Gary H.; Rexroad, Caird E.; Yao, Jianbo

    2015-01-01

    Efforts to obtain a comprehensive genome sequence for rainbow trout are ongoing and will be complemented by transcriptome information that will enhance genome assembly and annotation. Previously, transcriptome reference sequences were reported using data from different sources. Although the previous work added a great wealth of sequences, a complete and well-annotated transcriptome is still needed. In addition, gene expression in different tissues was not completely addressed in the previous studies. In this study, non-normalized cDNA libraries were sequenced from 13 different tissues of a single doubled haploid rainbow trout from the same source used for the rainbow trout genome sequence. A total of ~1.167 billion paired-end reads were de novo assembled using the Trinity RNA-Seq assembler yielding 474,524 contigs > 500 base-pairs. Of them, 287,593 had homologies to the NCBI non-redundant protein database. The longest contig of each cluster was selected as a reference, yielding 44,990 representative contigs. A total of 4,146 contigs (9.2%), including 710 full-length sequences, did not match any mRNA sequences in the current rainbow trout genome reference. Mapping reads to the reference genome identified an additional 11,843 transcripts not annotated in the genome. A digital gene expression atlas revealed 7,678 housekeeping and 4,021 tissue-specific genes. Expression of about 16,000–32,000 genes (35–71% of the identified genes) accounted for basic and specialized functions of each tissue. White muscle and stomach had the least complex transcriptomes, with high percentages of their total mRNA contributed by a small number of genes. Brain, testis and intestine, in contrast, had complex transcriptomes, with a large numbers of genes involved in their expression patterns. This study provides comprehensive de novo transcriptome information that is suitable for functional and comparative genomics studies in rainbow trout, including annotation of the genome. PMID:25793877

  16. News from PSB 2003 Statistically rigorous electronic gene annotation and

    E-print Network

    Spang, Rainer

    , Werner G.Krebs, Philip E. Bourne, UCSD Allows automatic extension of existing ontologies Needs: Cluster and classification of protein data bank sequences using gene ontology terms A Piecewise subtractive quasi electronic gene annotation and classification of protein data bank sequences using gene ontology terms

  17. Drosophila Gene Expression Pattern Annotation Using Sparse Features and Term-Term Interactions

    E-print Network

    Ji, Shuiwang

    Drosophila Gene Expression Pattern Annotation Using Sparse Features and Term-Term Interactions, Arizona State University, Tempe, AZ 85287 ABSTRACT The Drosophila gene expression pattern images document functions, interaction, and networks during Drosophila embryogene- sis. To provide text-based pattern

  18. A robust data-driven approach for gene ontology annotation

    PubMed Central

    Li, Yanpeng; Yu, Hong

    2014-01-01

    Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks. PMID:25425037

  19. Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks

    PubMed Central

    Daraselia, Nikolai; Yuryev, Anton; Egorov, Sergei; Mazo, Ilya; Ispolatov, Iaroslav

    2007-01-01

    Background Uncovering cellular roles of a protein is a task of tremendous importance and complexity that requires dedicated experimental work as well as often sophisticated data mining and processing tools. Protein functions, often referred to as its annotations, are believed to manifest themselves through topology of the networks of inter-proteins interactions. In particular, there is a growing body of evidence that proteins performing the same function are more likely to interact with each other than with proteins with other functions. However, since functional annotation and protein network topology are often studied separately, the direct relationship between them has not been comprehensively demonstrated. In addition to having the general biological significance, such demonstration would further validate the data extraction and processing methods used to compose protein annotation and protein-protein interactions datasets. Results We developed a method for automatic extraction of protein functional annotation from scientific text based on the Natural Language Processing (NLP) technology. For the protein annotation extracted from the entire PubMed, we evaluated the precision and recall rates, and compared the performance of the automatic extraction technology to that of manual curation used in public Gene Ontology (GO) annotation. In the second part of our presentation, we reported a large-scale investigation into the correspondence between communities in the literature-based protein networks and GO annotation groups of functionally related proteins. We found a comprehensive two-way match: proteins within biological annotation groups form significantly denser linked network clusters than expected by chance and, conversely, densely linked network communities exhibit a pronounced non-random overlap with GO groups. We also expanded the publicly available GO biological process annotation using the relations extracted by our NLP technology. An increase in the number and size of GO groups without any noticeable decrease of the link density within the groups indicated that this expansion significantly broadens the public GO annotation without diluting its quality. We revealed that functional GO annotation correlates mostly with clustering in a physical interaction protein network, while its overlap with indirect regulatory network communities is two to three times smaller. Conclusion Protein functional annotations extracted by the NLP technology expand and enrich the existing GO annotation system. The GO functional modularity correlates mostly with the clustering in the physical interaction network, suggesting that the essential role of structural organization maintained by these interactions. Reciprocally, clustering of proteins in physical interaction networks can serve as an evidence for their functional similarity. PMID:17620146

  20. A model selection criterion for model-based clustering of annotated gene expression data.

    PubMed

    Gallopin, Mélina; Celeux, Gilles; Jaffrézic, Florence; Rau, Andrea

    2015-11-01

    In co-expression analyses of gene expression data, it is often of interest to interpret clusters of co-expressed genes with respect to a set of external information, such as a potentially incomplete list of functional properties for which a subset of genes may be annotated. Based on the framework of finite mixture models, we propose a model selection criterion that takes into account such external gene annotations, providing an efficient tool for selecting a relevant number of clusters and clustering model. This criterion, called the integrated completed annotated likelihood (ICAL), is defined by adding an entropy term to a penalized likelihood to measure the concordance between a clustering partition and the external annotation information. The ICAL leads to the choice of a model that is more easily interpretable with respect to the known functional gene annotations. We illustrate the interest of this model selection criterion in conjunction with Gaussian mixture models on simulated gene expression data and on real RNA-seq data. PMID:26461845

  1. GeneDB—an annotation database for pathogens

    PubMed Central

    Logan-Klumpler, Flora J.; De Silva, Nishadi; Boehme, Ulrike; Rogers, Matthew B.; Velarde, Giles; McQuillan, Jacqueline A.; Carver, Tim; Aslett, Martin; Olsen, Christian; Subramanian, Sandhya; Phan, Isabelle; Farris, Carol; Mitra, Siddhartha; Ramasamy, Gowthaman; Wang, Haiming; Tivey, Adrian; Jackson, Andrew; Houston, Robin; Parkhill, Julian; Holden, Matthew; Harb, Omar S.; Brunk, Brian P.; Myler, Peter J.; Roos, David; Carrington, Mark; Smith, Deborah F.; Hertz-Fowler, Christiane; Berriman, Matthew

    2012-01-01

    GeneDB (http://www.genedb.org) is a genome database for prokaryotic and eukaryotic pathogens and closely related organisms. The resource provides a portal to genome sequence and annotation data, which is primarily generated by the Pathogen Genomics group at the Wellcome Trust Sanger Institute. It combines data from completed and ongoing genome projects with curated annotation, which is readily accessible from a web based resource. The development of the database in recent years has focused on providing database-driven annotation tools and pipelines, as well as catering for increasingly frequent assembly updates. The website has been significantly redesigned to take advantage of current web technologies, and improve usability. The current release stores 41 data sets, of which 17 are manually curated and maintained by biologists, who review and incorporate data from the scientific literature, as well as other sources. GeneDB is primarily a production and annotation database for the genomes of predominantly pathogenic organisms. PMID:22116062

  2. Drosophila Gene Expression Pattern Annotation through Multi-Instance

    E-print Network

    Ji, Shuiwang

    Drosophila Gene Expression Pattern Annotation through Multi-Instance Multi-Label Learning Ying-Xin Li, Shuiwang Ji, Sudhir Kumar, Jieping Ye, and Zhi-Hua Zhou Abstract--In the studies of DrosophilaExpress database (a digital library of standardized Drosophila gene expression pattern images) reveal

  3. Dizeez: an online game for human gene-disease annotation.

    PubMed

    Loguercio, Salvatore; Good, Benjamin M; Su, Andrew I

    2013-01-01

    Structured gene annotations are a foundation upon which many bioinformatics and statistical analyses are built. However the structured annotations available in public databases are a sparse representation of biological knowledge as a whole. The rate of biomedical data generation is such that centralized biocuration efforts struggle to keep up. New models for gene annotation need to be explored that expand the pace at which we are able to structure biomedical knowledge. Recently, online games have emerged as an effective way to recruit, engage and organize large numbers of volunteers to help address difficult biological challenges. For example, games have been successfully developed for protein folding (Foldit), multiple sequence alignment (Phylo) and RNA structure design (EteRNA). Here we present Dizeez, a simple online game built with the purpose of structuring knowledge of gene-disease associations. Preliminary results from game play online and at scientific conferences suggest that Dizeez is producing valid gene-disease annotations not yet present in any public database. These early results provide a basic proof of principle that online games can be successfully applied to the challenge of gene annotation. Dizeez is available at http://genegames.org. PMID:23951102

  4. CATH: comprehensive structural and functional annotations for genome sequences

    PubMed Central

    Sillitoe, Ian; Lewis, Tony E.; Cuff, Alison; Das, Sayoni; Ashford, Paul; Dawson, Natalie L.; Furnham, Nicholas; Laskowski, Roman A.; Lee, David; Lees, Jonathan G.; Lehtinen, Sonja; Studer, Romain A.; Thornton, Janet; Orengo, Christine A.

    2015-01-01

    The latest version of the CATH-Gene3D protein structure classification database (4.0, http://www.cathdb.info) provides annotations for over 235 000 protein domain structures and includes 25 million domain predictions. This article provides an update on the major developments in the 2 years since the last publication in this journal including: significant improvements to the predictive power of our functional families (FunFams); the release of our ‘current’ putative domain assignments (CATH-B); a new, strictly non-redundant data set of CATH domains suitable for homology benchmarking experiments (CATH-40) and a number of improvements to the web pages. PMID:25348408

  5. CATH: comprehensive structural and functional annotations for genome sequences.

    PubMed

    Sillitoe, Ian; Lewis, Tony E; Cuff, Alison; Das, Sayoni; Ashford, Paul; Dawson, Natalie L; Furnham, Nicholas; Laskowski, Roman A; Lee, David; Lees, Jonathan G; Lehtinen, Sonja; Studer, Romain A; Thornton, Janet; Orengo, Christine A

    2015-01-01

    The latest version of the CATH-Gene3D protein structure classification database (4.0, http://www.cathdb.info) provides annotations for over 235,000 protein domain structures and includes 25 million domain predictions. This article provides an update on the major developments in the 2 years since the last publication in this journal including: significant improvements to the predictive power of our functional families (FunFams); the release of our 'current' putative domain assignments (CATH-B); a new, strictly non-redundant data set of CATH domains suitable for homology benchmarking experiments (CATH-40) and a number of improvements to the web pages. PMID:25348408

  6. Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes

    PubMed Central

    2002-01-01

    Background Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation sites and/or selection of the correct coding regions remain difficult unless appropriate biological knowledge (about the structure of a gene) is imbedded in the approach. Results We have developed a new program that automatically identifies biologically significant candidate genes in a bacterial genome. Twenty-six complete prokaryotic genomes were analyzed using this tool, and the accuracy of gene finding was assessed by comparison with existing annotations. This analysis revealed that, despite the enormous effort of genome program annotators, a small but not negligible number of genes annotated within the framework of sequencing projects are likely to be partially inaccurate or plainly wrong. Moreover, the analysis of several putative new genes shows that, as expected, many short genes have escaped annotation. In most cases, these new genes revealed frameshifts that could be either artifacts or genuine frameshifts. Some entirely unexpected new genes have also been identified. This allowed us to get a more complete picture of prokaryotic genomes. The results of this procedure are progressively integrated into the SWISS-PROT reference databank. Conclusions The results described in the present study show that our procedure is very satisfactory in terms of gene finding accuracy. Except in few cases, discrepancies between our results and annotations provided by individual authors can be accounted for by the nature of each annotation process or by specific characteristics of some genomes. This stresses that close cooperation between scientists, regular update and curation of the findings in databases are clearly required to reduce the level of errors in genome annotation (and hence in reducing the unfortunate spreading of errors through centralized data libraries). PMID:11879526

  7. De Novo Assembly, Functional Annotation and Comparative Analysis of Withania somnifera Leaf and Root Transcriptomes to Identify Putative Genes Involved in the Withanolides Biosynthesis

    PubMed Central

    Gupta, Parul; Goel, Ridhi; Pathak, Sumya; Srivastava, Apeksha; Singh, Surya Pratap; Sangwan, Rajender Singh; Asif, Mehar Hasan; Trivedi, Prabodh Kumar

    2013-01-01

    Withania somnifera is one of the most valuable medicinal plants used in Ayurvedic and other indigenous medicine systems due to bioactive molecules known as withanolides. As genomic information regarding this plant is very limited, little information is available about biosynthesis of withanolides. To facilitate the basic understanding about the withanolide biosynthesis pathways, we performed transcriptome sequencing for Withania leaf (101L) and root (101R) which specifically synthesize withaferin A and withanolide A, respectively. Pyrosequencing yielded 8,34,068 and 7,21,755 reads which got assembled into 89,548 and 1,14,814 unique sequences from 101L and 101R, respectively. A total of 47,885 (101L) and 54,123 (101R) could be annotated using TAIR10, NR, tomato and potato databases. Gene Ontology and KEGG analyses provided a detailed view of all the enzymes involved in withanolide backbone synthesis. Our analysis identified members of cytochrome P450, glycosyltransferase and methyltransferase gene families with unique presence or differential expression in leaf and root and might be involved in synthesis of tissue-specific withanolides. We also detected simple sequence repeats (SSRs) in transcriptome data for use in future genetic studies. Comprehensive sequence resource developed for Withania, in this study, will help to elucidate biosynthetic pathway for tissue-specific synthesis of secondary plant products in non-model plant organisms as well as will be helpful in developing strategies for enhanced biosynthesis of withanolides through biotechnological approaches. PMID:23667511

  8. Proteomic Detection of Non-Annotated Protein-Coding Genes in Pseudomonas fluorescens Pf0-1

    SciTech Connect

    Kim, Wook; Silby, Mark W.; Purvine, Samuel O.; Nicoll, Julie S.; Hixson, Kim K.; Monroe, Matthew E.; Nicora, Carrie D.; Lipton, Mary S.; Levy, Stuart B.

    2009-12-24

    Genome sequences are annotated by computational prediction of coding sequences, followed by similarity searches such as BLAST, which provide a layer of (possible) functional information. While the existence of processes such as alternative splicing complicates matters for eukaryote genomes, the view of bacterial genomes as a linear series of closely spaced genes leads to the assumption that computational annotations which predict such arrangements completely describe the coding capacity of bacterial genomes. We undertook a proteomic study to identify proteins expressed by Pseudomonas fluorescens Pf0-1 from genes which were not predicted during the genome annotation. Mapping peptides to the Pf0-1 genome sequence identified sixteen non-annotated protein-coding regions, of which nine were antisense to predicted genes, six were intergenic, and one read in the same direction as an annotated gene but in a different frame. The expression of all but one of the newly discovered genes was verified by RT-PCR. Few clues as to the function of the new genes were gleaned from informatic analyses, but potential orthologues in other Pseudomonas genomes were identified for eight of the new genes. The 16 newly identified genes improve the quality of the Pf0-1 genome annotation, and the detection of antisense protein-coding genes indicates the under-appreciated complexity of bacterial genome organization.

  9. Detecting Coevolution of Functionally Related Proteins for Automated Protein Annotation

    E-print Network

    Stormo, Gary

    that proteins rarely act in isolation suggests an extended annotation approach where the function of a knownDetecting Coevolution of Functionally Related Proteins for Automated Protein Annotation Alan L St. Louis, Missouri stormo@ural.wustl.edu Abstract ­ Sequence similarity based protein clustering

  10. Lynx web services for annotations and systems analysis of multi-gene disorders

    PubMed Central

    Sulakhe, Dinanath; Taylor, Andrew; Balasubramanian, Sandhya; Feng, Bo; Xie, Bingqing; Börnigen, Daniela; Dave, Utpal J.; Foster, Ian T.; Gilliam, T. Conrad; Maltsev, Natalia

    2014-01-01

    Lynx is a web-based integrated systems biology platform that supports annotation and analysis of experimental data and generation of weighted hypotheses on molecular mechanisms contributing to human phenotypes and disorders of interest. Lynx has integrated multiple classes of biomedical data (genomic, proteomic, pathways, phenotypic, toxicogenomic, contextual and others) from various public databases as well as manually curated data from our group and collaborators (LynxKB). Lynx provides tools for gene list enrichment analysis using multiple functional annotations and network-based gene prioritization. Lynx provides access to the integrated database and the analytical tools via REST based Web Services (http://lynx.ci.uchicago.edu/webservices.html). This comprises data retrieval services for specific functional annotations, services to search across the complete LynxKB (powered by Lucene), and services to access the analytical tools built within the Lynx platform. PMID:24948611

  11. Lynx web services for annotations and systems analysis of multi-gene disorders.

    PubMed

    Sulakhe, Dinanath; Taylor, Andrew; Balasubramanian, Sandhya; Feng, Bo; Xie, Bingqing; Börnigen, Daniela; Dave, Utpal J; Foster, Ian T; Gilliam, T Conrad; Maltsev, Natalia

    2014-07-01

    Lynx is a web-based integrated systems biology platform that supports annotation and analysis of experimental data and generation of weighted hypotheses on molecular mechanisms contributing to human phenotypes and disorders of interest. Lynx has integrated multiple classes of biomedical data (genomic, proteomic, pathways, phenotypic, toxicogenomic, contextual and others) from various public databases as well as manually curated data from our group and collaborators (LynxKB). Lynx provides tools for gene list enrichment analysis using multiple functional annotations and network-based gene prioritization. Lynx provides access to the integrated database and the analytical tools via REST based Web Services (http://lynx.ci.uchicago.edu/webservices.html). This comprises data retrieval services for specific functional annotations, services to search across the complete LynxKB (powered by Lucene), and services to access the analytical tools built within the Lynx platform. PMID:24948611

  12. Draft Genome Sequence and Gene Annotation of Stemphylium lycopersici Strain CIDEFI-216

    PubMed Central

    Franco, Mario E. E.; López, Silvina; Medina, Rocio; Saparrat, Mario C. N.

    2015-01-01

    Stemphylium lycopersici is a plant-pathogenic fungus that is widely distributed throughout the world. In tomatoes, it is one of the etiological agents of gray leaf spot disease. Here, we report the first draft genome sequence of S. lycopersici, including its gene structure and functional annotation. PMID:26404600

  13. CATH FunFHMMer web server: protein functional annotations using functional family assignments

    PubMed Central

    Das, Sayoni; Sillitoe, Ian; Lee, David; Lees, Jonathan G.; Dawson, Natalie L.; Ward, John; Orengo, Christine A.

    2015-01-01

    The widening function annotation gap in protein databases and the increasing number and diversity of the proteins being sequenced presents new challenges to protein function prediction methods. Multidomain proteins complicate the protein sequence–structure–function relationship further as new combinations of domains can expand the functional repertoire, creating new proteins and functions. Here, we present the FunFHMMer web server, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classification of the domain-based CATH-Gene3D resource. Our server also provides valuable information for the prediction of functional sites. The predictive power of FunFHMMer has been validated on a set of 95 proteins where FunFHMMer performs better than BLAST, Pfam and CDD. Recent validation by an independent international competition ranks FunFHMMer as one of the top function prediction methods in predicting GO annotations for both the Biological Process and Molecular Function Ontology. The FunFHMMer web server is available at http://www.cathdb.info/search/by_funfhmmer. PMID:25964299

  14. Algal Functional Annotation Tool from the DOE-UCLA Institute for Genomics and Proteomics

    DOE Data Explorer

    Lopez, David

    The Algal Functional Annotation Tool is a bioinformatics resource to visualize pathway maps, identify enriched biological terms, or convert gene identifiers to elucidate biological function in silico. These types of analysis have been catered to support lists of gene identifiers, such as those coming from transcriptome gene expression analysis. By analyzing the functional annotation of an interesting set of genes, common biological motifs may be elucidated and a first-pass analysis can point further research in the right direction. Currently, the following databases have been parsed, processed, and added to the tool: 1( Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathways Database, 2) MetaCyc Encyclopedia of Metabolic Pathways, 3) Panther Pathways Database, 4) Reactome Pathways Database, 5) Gene Ontology, 6) MapMan Ontology, 7) KOG (Eukaryotic Clusters of Orthologous Groups), 5)Pfam, 6) InterPro.

  15. Functional annotation of introns in mitochondrial genome - a brief review.

    PubMed

    Anandakumar, Shanmugam; Ravindran, Suda Parimala; Shanmughavel, Piramanayagam

    2016-03-01

    The present study is to decipher the non-coding regions present in mitochondrial genomes that cause diseases in humans and predict their functional roles through comparative genomics approach followed by functional annotation of these segments. PMID:24845436

  16. Comparative analysis of functional metagenomic annotation and the mappability of short reads.

    PubMed

    Carr, Rogan; Borenstein, Elhanan

    2014-01-01

    To assess the functional capacities of microbial communities, including those inhabiting the human body, shotgun metagenomic reads are often aligned to a database of known genes. Such homology-based annotation practices critically rely on the assumption that short reads can map to orthologous genes of similar function. This assumption, however, and the various factors that impact short read annotation, have not been systematically evaluated. To address this challenge, we generated an extremely large database of simulated reads (totaling 15.9 Gb), spanning over 500,000 microbial genes and 170 curated genomes and including, for many genomes, every possible read of a given length. We annotated each read using common metagenomic protocols, fully characterizing the effect of read length, sequencing error, phylogeny, database coverage, and mapping parameters. We additionally rigorously quantified gene-, genome-, and protocol-specific annotation biases. Overall, our findings provide a first comprehensive evaluation of the capabilities and limitations of functional metagenomic annotation, providing crucial goal-specific best-practice guidelines to inform future metagenomic research. PMID:25148512

  17. ncFANs: a web server for functional annotation of long non-coding RNAs

    PubMed Central

    Liao, Qi; Xiao, Hui; Bu, Dechao; Xie, Chaoyong; Miao, Ruoyu; Luo, Haitao; Zhao, Guoguang; Yu, Kuntao; Zhao, Haitao; Skogerbø, Geir; Chen, Runsheng; Wu, Zhongdao; Liu, Changning; Zhao, Yi

    2011-01-01

    Recent interest in the non-coding transcriptome has resulted in the identification of large numbers of long non-coding RNAs (lncRNAs) in mammalian genomes, most of which have not been functionally characterized. Computational exploration of the potential functions of these lncRNAs will therefore facilitate further work in this field of research. We have developed a practical and user-friendly web interface called ncFANs (non-coding RNA Function ANnotation server), which is the first web service for functional annotation of human and mouse lncRNAs. On the basis of the re-annotated Affymetrix microarray data, ncFANs provides two alternative strategies for lncRNA functional annotation: one utilizing three aspects of a coding-non-coding gene co-expression (CNC) network, the other identifying condition-related differentially expressed lncRNAs. ncFANs introduces a highly efficient way of re-using the abundant pre-existing microarray data. The present version of ncFANs includes re-annotated CDF files for 10 human and mouse Affymetrix microarrays, and the server will be continuously updated with more re-annotated microarray platforms and lncRNA data. ncFANs is freely accessible at http://www.ebiomed.org/ncFANs/ or http://www.noncode.org/ncFANs/. PMID:21715382

  18. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools

    PubMed Central

    Lamesch, Philippe; Berardini, Tanya Z.; Li, Donghui; Swarbreck, David; Wilks, Christopher; Sasidharan, Rajkumar; Muller, Robert; Dreher, Kate; Alexander, Debbie L.; Garcia-Hernandez, Margarita; Karthikeyan, Athikkattuvalasu S.; Lee, Cynthia H.; Nelson, William D.; Ploetz, Larry; Singh, Shanker; Wensel, April; Huala, Eva

    2012-01-01

    The Arabidopsis Information Resource (TAIR, http://arabidopsis.org) is a genome database for Arabidopsis thaliana, an important reference organism for many fundamental aspects of biology as well as basic and applied plant biology research. TAIR serves as a central access point for Arabidopsis data, annotates gene function and expression patterns using controlled vocabulary terms, and maintains and updates the A. thaliana genome assembly and annotation. TAIR also provides researchers with an extensive set of visualization and analysis tools. Recent developments include several new genome releases (TAIR8, TAIR9 and TAIR10) in which the A. thaliana assembly was updated, pseudogenes and transposon genes were re-annotated, and new data from proteomics and next generation transcriptome sequencing were incorporated into gene models and splice variants. Other highlights include progress on functional annotation of the genome and the release of several new tools including Textpresso for Arabidopsis which provides the capability to carry out full text searches on a large body of research literature. PMID:22140109

  19. Bio301: A Web-Based EST Annotation Pipeline That Facilitates Functional Comparison Studies

    PubMed Central

    Chen, Yen-Chen; Chen, Yun-Ching; Lin, Wen-Dar; Hsiao, Chung-Der; Chiu, Hung-Wen; Ho, Jan-Ming

    2012-01-01

    In this postgenomic era, a huge volume of information derived from expressed sequence tags (ESTs) has been constructed for functional description of gene expression profiles. Comparative studies have become more and more important to researchers of biology. In order to facilitate these comparative studies, we have constructed a user-friendly EST annotation pipeline with comparison tools on an integrated EST service website, Bio301. Bio301 includes regular EST preprocessing, BLAST similarity search, gene ontology (GO) annotation, statistics reporting, a graphical GO browsing interface, and microarray probe selection tools. In addition, Bio301 is equipped with statistical library comparison functions using multiple EST libraries based on GO annotations for mining meaningful biological information. PMID:25969743

  20. Automated Eukaryotic Gene Structure Annotation Using EVidenceModeler and the Program to Assemble Spliced Alignments

    SciTech Connect

    Haas, B J; Salzberg, S L; Zhu, W; Pertea, M; Allen, J E; Orvis, J; White, O; Buell, C R; Wortman, J R

    2007-12-10

    EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.

  1. A computational and experimental approach to validating annotations and gene predictions in

    E-print Network

    Yandell, Mark

    genome concluded that the annotated 13,659 genes in the 3.1 release likely constitute 95% of all proteinA computational and experimental approach to validating annotations and gene predictions-coding genes it contains remains a matter of debate; the number of computational gene predictions greatly

  2. COGNIZER: A Framework for Functional Annotation of Metagenomic Datasets

    PubMed Central

    Bose, Tungadri; Haque, Mohammed Monzoorul; Reddy, CVSK; Mande, Sharmila S.

    2015-01-01

    Background Recent advances in sequencing technologies have resulted in an unprecedented increase in the number of metagenomes that are being sequenced world-wide. Given their volume, functional annotation of metagenomic sequence datasets requires specialized computational tools/techniques. In spite of having high accuracy, existing stand-alone functional annotation tools necessitate end-users to perform compute-intensive homology searches of metagenomic datasets against "multiple" databases prior to functional analysis. Although, web-based functional annotation servers address to some extent the problem of availability of compute resources, uploading and analyzing huge volumes of sequence data on a shared public web-service has its own set of limitations. In this study, we present COGNIZER, a comprehensive stand-alone annotation framework which enables end-users to functionally annotate sequences constituting metagenomic datasets. The COGNIZER framework provides multiple workflow options. A subset of these options employs a novel directed-search strategy which helps in reducing the overall compute requirements for end-users. The COGNIZER framework includes a cross-mapping database that enables end-users to simultaneously derive/infer KEGG, Pfam, GO, and SEED subsystem information from the COG annotations. Results Validation experiments performed with real-world metagenomes and metatranscriptomes, generated using diverse sequencing technologies, indicate that the novel directed-search strategy employed in COGNIZER helps in reducing the compute requirements without significant loss in annotation accuracy. A comparison of COGNIZER's results with pre-computed benchmark values indicate the reliability of the cross-mapping database employed in COGNIZER. Conclusion The COGNIZER framework is capable of comprehensively annotating any metagenomic or metatranscriptomic dataset from varied sequencing platforms in functional terms. Multiple search options in COGNIZER provide end-users the flexibility of choosing a homology search protocol based on available compute resources. The cross-mapping database in COGNIZER is of high utility since it enables end-users to directly infer/derive KEGG, Pfam, GO, and SEED subsystem annotations from COG categorizations. Furthermore, availability of COGNIZER as a stand-alone scalable implementation is expected to make it a valuable annotation tool in the field of metagenomic research. Availability and Implementation A Linux implementation of COGNIZER is freely available for download from the following links: http://metagenomics.atc.tcs.com/cognizer, https://metagenomics.atc.tcs.com/function/cognizer. PMID:26561344

  3. Disentangling the Effects of Colocalizing Genomic Annotations to Functionally Prioritize Non-coding Variants within Complex-Trait Loci.

    PubMed

    Trynka, Gosia; Westra, Harm-Jan; Slowikowski, Kamil; Hu, Xinli; Xu, Han; Stranger, Barbara E; Klein, Robert J; Han, Buhm; Raychaudhuri, Soumya

    2015-07-01

    Identifying genomic annotations that differentiate causal from trait-associated variants is essential to fine mapping disease loci. Although many studies have identified non-coding functional annotations that overlap disease-associated variants, these annotations often colocalize, complicating the ability to use these annotations for fine mapping causal variation. We developed a statistical approach (Genomic Annotation Shifter [GoShifter]) to assess whether enriched annotations are able to prioritize causal variation. GoShifter defines the null distribution of an annotation overlapping an allele by locally shifting annotations; this approach is less sensitive to biases arising from local genomic structure than commonly used enrichment methods that depend on SNP matching. Local shifting also allows GoShifter to identify independent causal effects from colocalizing annotations. Using GoShifter, we confirmed that variants in expression quantitative trail loci drive gene-expression changes though DNase-I hypersensitive sites (DHSs) near transcription start sites and independently through 3' UTR regulation. We also showed that (1) 15%-36% of trait-associated loci map to DHSs independently of other annotations; (2) loci associated with breast cancer and rheumatoid arthritis harbor potentially causal variants near the summits of histone marks rather than full peak bodies; (3) variants associated with height are highly enriched in embryonic stem cell DHSs; and (4) we can effectively prioritize causal variation at specific loci. PMID:26140449

  4. RNAmmer: consistent and rapid annotation of ribosomal RNA genes

    PubMed Central

    Lagesen, Karin; Hallin, Peter; Rødland, Einar Andreas; Stærfeldt, Hans-Henrik; Rognes, Torbjørn; Ussery, David W.

    2007-01-01

    The publication of a complete genome sequence is usually accompanied by annotations of its genes. In contrast to protein coding genes, genes for ribosomal RNA (rRNA) are often poorly or inconsistently annotated. This makes comparative studies based on rRNA genes difficult. We have therefore created computational predictors for the major rRNA species from all kingdoms of life and compiled them into a program called RNAmmer. The program uses hidden Markov models trained on data from the 5S ribosomal RNA database and the European ribosomal RNA database project. A pre-screening step makes the method fast with little loss of sensitivity, enabling the analysis of a complete bacterial genome in less than a minute. Results from running RNAmmer on a large set of genomes indicate that the location of rRNAs can be predicted with a very high level of accuracy. Novel, unannotated rRNAs are also predicted in many genomes. The software as well as the genome analysis results are available at the CBS web server. PMID:17452365

  5. Pegasus: a comprehensive annotation and prediction tool for detection of driver gene fusions in cancer

    PubMed Central

    2014-01-01

    Background The extraordinary success of imatinib in the treatment of BCR-ABL1 associated cancers underscores the need to identify novel functional gene fusions in cancer. RNA sequencing offers a genome-wide view of expressed transcripts, uncovering biologically functional gene fusions. Although several bioinformatics tools are already available for the detection of putative fusion transcripts, candidate event lists are plagued with non-functional read-through events, reverse transcriptase template switching events, incorrect mapping, and other systematic errors. Such lists lack any indication of oncogenic relevance, and they are too large for exhaustive experimental validation. Results We have designed and implemented a pipeline, Pegasus, for the annotation and prediction of biologically functional gene fusion candidates. Pegasus provides a common interface for various gene fusion detection tools, reconstruction of novel fusion proteins, reading-frame-aware annotation of preserved/lost functional domains, and data-driven classification of oncogenic potential. Pegasus dramatically streamlines the search for oncogenic gene fusions, bridging the gap between raw RNA-Seq data and a final, tractable list of candidates for experimental validation. Conclusion We show the effectiveness of Pegasus in predicting new driver fusions in 176 RNA-Seq samples of glioblastoma multiforme (GBM) and 23 cases of anaplastic large cell lymphoma (ALCL). Contact: fa2306@columbia.edu. PMID:25183062

  6. Use of Gene Ontology Annotation to understand the peroxisome proteome in humans.

    PubMed

    Mutowo-Meullenet, Prudence; Huntley, Rachael P; Dimmer, Emily C; Alam-Faruque, Yasmin; Sawford, Tony; Jesus Martin, Maria; O'Donovan, Claire; Apweiler, Rolf

    2013-01-01

    The Gene Ontology (GO) is the de facto standard for the functional description of gene products, providing a consistent, information-rich terminology applicable across species and information repositories. The UniProt Consortium uses both manual and automatic GO annotation approaches to curate UniProt Knowledgebase (UniProtKB) entries. The selection of a protein set prioritized for manual annotation has implications for the characteristics of the information provided to users working in a specific field or interested in particular pathways or processes. In this article, we describe an organelle-focused, manual curation initiative targeting proteins from the human peroxisome. We discuss the steps taken to define the peroxisome proteome and the challenges encountered in defining the boundaries of this protein set. We illustrate with the use of examples how GO annotations now capture cell and tissue type information and the advantages that such an annotation approach provides to users. Database URL: http://www.ebi.ac.uk/GOA/ and http://www.uniprot.org. PMID:23327938

  7. Annotating non-coding transcription using functional genomics strategies

    PubMed Central

    Forrest, Alistair R. R.; Abdelhamid, Rehab F.

    2009-01-01

    Non-coding RNA (ncRNA) transcripts are RNA molecules that do not code for proteins, but elicit function by other mechanisms. The vast majority of RNA produced in a cell is non-coding ribosomal RNA, produced from relatively few loci, however more recently complementary DNA (cDNA) cloning, tag sequencing, and genome tiling array studies suggest that ncRNAs also account for the majority of RNA species produced by a cell. ncRNA based regulation has been referred to as a ‘hidden layer’ of signals or ‘dark matter’ that control gene expression in cellular processes by poorly described mechanisms. These terms have appeared as ncRNAs until recently have been ignored by expression profiling and cDNA annotation projects and their mode of action is diverse (e.g. influencing chromatin structure and epigenetics, translational silencing, transcriptional silencing). Here, we highlight recent functional genomics strategies toward identifying and assigning function to ncRNA transcription. PMID:19833699

  8. Logical Gene Ontology Annotations (GOAL): Exploring gene ontology annotations with OWL

    E-print Network

    2012-04-24

    with the appropriate property described above. For example, for the mitochondrion class in GO we create a new class called mitochondrial gene product as follows: Class: ’mitochondrion gene product’ EquivalentTo: ’gene product’ that is_located_in some ’mitochondrion...

  9. Improved gene ontology annotation for biofilm formation, filamentous growth, and phenotypic switching in Candida albicans.

    PubMed

    Inglis, Diane O; Skrzypek, Marek S; Arnaud, Martha B; Binkley, Jonathan; Shah, Prachi; Wymore, Farrell; Sherlock, Gavin

    2013-01-01

    The opportunistic fungal pathogen Candida albicans is a significant medical threat, especially for immunocompromised patients. Experimental research has focused on specific areas of C. albicans biology, with the goal of understanding the multiple factors that contribute to its pathogenic potential. Some of these factors include cell adhesion, invasive or filamentous growth, and the formation of drug-resistant biofilms. The Gene Ontology (GO) (www.geneontology.org) is a standardized vocabulary that the Candida Genome Database (CGD) (www.candidagenome.org) and other groups use to describe the functions of gene products. To improve the breadth and accuracy of pathogenicity-related gene product descriptions and to facilitate the description of as yet uncharacterized but potentially pathogenicity-related genes in Candida species, CGD undertook a three-part project: first, the addition of terms to the biological process branch of the GO to improve the description of fungus-related processes; second, manual recuration of gene product annotations in CGD to use the improved GO vocabulary; and third, computational ortholog-based transfer of GO annotations from experimentally characterized gene products, using these new terms, to uncharacterized orthologs in other Candida species. Through genome annotation and analysis, we identified candidate pathogenicity genes in seven non-C. albicans Candida species and in one additional C. albicans strain, WO-1. We also defined a set of C. albicans genes at the intersection of biofilm formation, filamentous growth, pathogenesis, and phenotypic switching of this opportunistic fungal pathogen, which provides a compelling list of candidates for further experimentation. PMID:23143685

  10. Biocuration of functional annotation at the European nucleotide archive

    PubMed Central

    Gibson, Richard; Alako, Blaise; Amid, Clara; Cerdeño-Tárraga, Ana; Cleland, Iain; Goodgame, Neil; ten Hoopen, Petra; Jayathilaka, Suran; Kay, Simon; Leinonen, Rasko; Liu, Xin; Pallreddy, Swapna; Pakseresht, Nima; Rajan, Jeena; Rosselló, Marc; Silvester, Nicole; Smirnov, Dmitriy; Toribio, Ana Luisa; Vaughan, Daniel; Zalunin, Vadim; Cochrane, Guy

    2016-01-01

    The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena) is a repository for the submission, maintenance and presentation of nucleotide sequence data and related sample and experimental information. In this article we report on ENA in 2015 regarding general activity, notable published data sets and major achievements. This is followed by a focus on sustainable biocuration of functional annotation, an area which has particularly felt the pressure of sequencing growth. The importance of functional annotation, how it can be submitted and the shifting role of the biocurator in the context of increasing volumes of data are all discussed. PMID:26615190

  11. Gene Model Annotations for Drosophila melanogaster: The Rule-Benders.

    PubMed

    Crosby, Madeline A; Gramates, L Sian; Dos Santos, Gilberto; Matthews, Beverley B; St Pierre, Susan E; Zhou, Pinglei; Schroeder, Andrew J; Falls, Kathleen; Emmert, David B; Russo, Susan M; Gelbart, William M

    2015-08-01

    In the context of the FlyBase annotated gene models in Drosophila melanogaster, we describe the many exceptional cases we have curated from the literature or identified in the course of FlyBase analysis. These range from atypical but common examples such as dicistronic and polycistronic transcripts, noncanonical splices, trans-spliced transcripts, noncanonical translation starts, and stop-codon readthroughs, to single exceptional cases such as ribosomal frameshifting and HAC1-type intron processing. In FlyBase, exceptional genes and transcripts are flagged with Sequence Ontology terms and/or standardized comments. Because some of the rule-benders create problems for handlers of high-throughput data, we discuss plans for flagging these cases in bulk data downloads. PMID:26109356

  12. Gene Model Annotations for Drosophila melanogaster: The Rule-Benders

    PubMed Central

    Crosby, Madeline A.; Gramates, L. Sian; dos Santos, Gilberto; Matthews, Beverley B.; St. Pierre, Susan E.; Zhou, Pinglei; Schroeder, Andrew J.; Falls, Kathleen; Emmert, David B.; Russo, Susan M.; Gelbart, William M.

    2015-01-01

    In the context of the FlyBase annotated gene models in Drosophila melanogaster, we describe the many exceptional cases we have curated from the literature or identified in the course of FlyBase analysis. These range from atypical but common examples such as dicistronic and polycistronic transcripts, noncanonical splices, trans-spliced transcripts, noncanonical translation starts, and stop-codon readthroughs, to single exceptional cases such as ribosomal frameshifting and HAC1-type intron processing. In FlyBase, exceptional genes and transcripts are flagged with Sequence Ontology terms and/or standardized comments. Because some of the rule-benders create problems for handlers of high-throughput data, we discuss plans for flagging these cases in bulk data downloads. PMID:26109356

  13. Integrating information retrieval with distant supervision for Gene Ontology annotation

    PubMed Central

    Zhu, Dongqing; Li, Dingcheng; Carterette, Ben; Liu, Hongfang

    2014-01-01

    This article describes our participation of the Gene Ontology Curation task (GO task) in BioCreative IV where we participated in both subtasks: A) identification of GO evidence sentences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for relevant genes in full-text articles. For subtask A, we trained a logistic regression model to detect GOES based on annotations in the training data supplemented with more noisy negatives from an external resource. Then, a greedy approach was applied to associate genes with sentences. For subtask B, we designed two types of systems: (i) search-based systems, which predict GO terms based on existing annotations for GOESs that are of different textual granularities (i.e., full-text articles, abstracts, and sentences) using state-of-the-art information retrieval techniques (i.e., a novel application of the idea of distant supervision) and (ii) a similarity-based system, which assigns GO terms based on the distance between words in sentences and GO terms/synonyms. Our best performing system for subtask A achieves an F1 score of 0.27 based on exact match and 0.387 allowing relaxed overlap match. Our best performing system for subtask B, a search-based system, achieves an F1 score of 0.075 based on exact match and 0.301 considering hierarchical matches. Our search-based systems for subtask B significantly outperformed the similarity-based system. Database URL: https://github.com/noname2020/Bioc PMID:25183856

  14. Drosophila Gene Expression Pattern Annotation through Multi-Instance Multi-Label Learning

    E-print Network

    Ji, Shuiwang

    Drosophila Gene Expression Pattern Annotation through Multi-Instance Multi-Label Learning Ying.kumar}@asu.edu Abstract The Berkeley Drosophila Genome Project (BDGP) has produced a large number of gene expression the annotation task. Empirical study shows that the proposed method outperforms the state-of-the-art Drosophila

  15. Comprehensive Functional Annotation of 77 Prostate Cancer Risk Loci

    PubMed Central

    Hazelett, Dennis J.; Rhie, Suhn Kyong; Gaddis, Malaina; Yan, Chunli; Lakeland, Daniel L.; Coetzee, Simon G.; Henderson, Brian E.; Noushmehr, Houtan; Cozen, Wendy; Kote-Jarai, Zsofia; Eeles, Rosalind A.; Easton, Douglas F.; Haiman, Christopher A.; Lu, Wange; Farnham, Peggy J.; Coetzee, Gerhard A.

    2014-01-01

    Genome-wide association studies (GWAS) have revolutionized the field of cancer genetics, but the causal links between increased genetic risk and onset/progression of disease processes remain to be identified. Here we report the first step in such an endeavor for prostate cancer. We provide a comprehensive annotation of the 77 known risk loci, based upon highly correlated variants in biologically relevant chromatin annotations— we identified 727 such potentially functional SNPs. We also provide a detailed account of possible protein disruption, microRNA target sequence disruption and regulatory response element disruption of all correlated SNPs at . 88% of the 727 SNPs fall within putative enhancers, and many alter critical residues in the response elements of transcription factors known to be involved in prostate biology. We define as risk enhancers those regions with enhancer chromatin biofeatures in prostate-derived cell lines with prostate-cancer correlated SNPs. To aid the identification of these enhancers, we performed genomewide ChIP-seq for H3K27-acetylation, a mark of actively engaged enhancers, as well as the transcription factor TCF7L2. We analyzed in depth three variants in risk enhancers, two of which show significantly altered androgen sensitivity in LNCaP cells. This includes rs4907792, that is in linkage disequilibrium () with an eQTL for NUDT11 (on the X chromosome) in prostate tissue, and rs10486567, the index SNP in intron 3 of the JAZF1 gene on chromosome 7. Rs4907792 is within a critical residue of a strong consensus androgen response element that is interrupted in the protective allele, resulting in a 56% decrease in its androgen sensitivity, whereas rs10486567 affects both NKX3-1 and FOXA-AR motifs where the risk allele results in a 39% increase in basal activity and a 28% fold-increase in androgen stimulated enhancer activity. Identification of such enhancer variants and their potential target genes represents a preliminary step in connecting risk to disease process. PMID:24497837

  16. miRDB: an online resource for microRNA target prediction and functional annotations

    PubMed Central

    Wong, Nathan; Wang, Xiaowei

    2015-01-01

    MicroRNAs (miRNAs) are small non-coding RNAs that are extensively involved in many physiological and disease processes. One major challenge in miRNA studies is the identification of genes regulated by miRNAs. To this end, we have developed an online resource, miRDB (http://mirdb.org), for miRNA target prediction and functional annotations. Here, we describe recently updated features of miRDB, including 2.1 million predicted gene targets regulated by 6709 miRNAs. In addition to presenting precompiled prediction data, a new feature is the web server interface that allows submission of user-provided sequences for miRNA target prediction. In this way, users have the flexibility to study any custom miRNAs or target genes of interest. Another major update of miRDB is related to functional miRNA annotations. Although thousands of miRNAs have been identified, many of the reported miRNAs are not likely to play active functional roles or may even have been falsely identified as miRNAs from high-throughput studies. To address this issue, we have performed combined computational analyses and literature mining, and identified 568 and 452 functional miRNAs in humans and mice, respectively. These miRNAs, as well as associated functional annotations, are presented in the FuncMir Collection in miRDB. PMID:25378301

  17. Analysis of the mouse transcriptome based on functional annotation of

    E-print Network

    Gough, Julian

    Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length c ........................................................................................................................................................................................................................... Only a small proportion of the mouse genome is transcribed into mature messenger RNA transcripts. There is an international collaborative effort to identify all full-length mRNA transcripts from the mouse, and to ensure

  18. Annotating Nucleic Acid-Binding Function Based on Protein Structure

    E-print Network

    Mandel-Gutfreund, Yael

    Annotating Nucleic Acid-Binding Function Based on Protein Structure Eric W. Stawiski1 , Lydia M an automated approach to predict nucleic-acid-binding (NA-binding) proteins, specifi- cally DNA positive electrostatic patches on their surfaces, but that do not bind nucleic acids. q 2003 Elsevier

  19. Assessment of protein set coherence using functional annotations

    PubMed Central

    Chagoyen, Monica; Carazo, Jose M; Pascual-Montano, Alberto

    2008-01-01

    Background Analysis of large-scale experimental datasets frequently produces one or more sets of proteins that are subsequently mined for functional interpretation and validation. To this end, a number of computational methods have been devised that rely on the analysis of functional annotations. Although current methods provide valuable information (e.g. significantly enriched annotations, pairwise functional similarities), they do not specifically measure the degree of homogeneity of a protein set. Results In this work we present a method that scores the degree of functional homogeneity, or coherence, of a set of proteins on the basis of the global similarity of their functional annotations. The method uses statistical hypothesis testing to assess the significance of the set in the context of the functional space of a reference set. As such, it can be used as a first step in the validation of sets expected to be homogeneous prior to further functional interpretation. Conclusion We evaluate our method by analysing known biologically relevant sets as well as random ones. The known relevant sets comprise macromolecular complexes, cellular components and pathways described for Saccharomyces cerevisiae, which are mostly significantly coherent. Finally, we illustrate the usefulness of our approach for validating 'functional modules' obtained from computational analysis of protein-protein interaction networks. Matlab code and supplementary data are available at PMID:18937846

  20. Optimizing high performance computing workflow for protein functional annotation.

    PubMed

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-09-10

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296

  1. Optimizing high performance computing workflow for protein functional annotation

    PubMed Central

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-01-01

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296

  2. Function annotation for pseudoknot using structure similarity.

    PubMed

    Chen, Qingfeng; Chen, Yi-Ping Phoebe

    2011-01-01

    Many raw biological sequence data have been generated by the human genome project and related efforts. The understanding of structural information encoded by biological sequences is important to acquire knowledge of their biochemical functions but remains a fundamental challenge. Recent interest in RNA regulation has resulted in a rapid growth of deposited RNA secondary structures in varied databases. However, a functional classification and characterization of the RNA structure have only been partially addressed. This article aims to introduce a novel interval-based distance metric for structure-based RNA function assignment. The characterization of RNA structures relies on distance vectors learned from a collection of predicted structures. The distance measure considers the intersected, disjoint, and inclusion between intervals. A set of RNA pseudoknotted structures with known function are applied and the function of the query structure is determined by measuring structure similarity. This not only offers sequence distance criteria to measure the similarity of secondary structures but also aids the functional classification of RNA structures with pesudoknots. PMID:21383413

  3. GOASVM: PROTEIN SUBCELLULAR LOCALIZATION PREDICTION BASED ON GENE ONTOLOGY ANNOTATION AND SVM

    E-print Network

    Mak, Man-Wai

    GOASVM: PROTEIN SUBCELLULAR LOCALIZATION PREDICTION BASED ON GENE ONTOLOGY ANNOTATION AND SVM. of Electrical Engineering Princeton University New Jersey, USA email: kung@princeton.edu ABSTRACT Protein subcellular localization is an essential step to annotate proteins and to design drugs. This paper proposes

  4. A Memetic Co-Clustering Algorithm for Gene Expression Profiles and Biological Annotation

    E-print Network

    Zell, Andreas

    A Memetic Co-Clustering Algorithm for Gene Expression Profiles and Biological Annotation Nora Speer by computers. The use of this additional annotation information promises to improve biological data analysis this method at least incorporates additional biological knowledge, it still is a sequential data analysis

  5. Anomaly-free Prediction of Gene Ontology Annotations using Bayesian Marco Tagliasacchi and Marco Masseroli

    E-print Network

    Tagliasacchi, Marco

    Anomaly-free Prediction of Gene Ontology Annotations using Bayesian Networks Marco Tagliasacchi through controlled terminologies and ontologies are paramount especially for the aim of inferring new annotations are hence ex- tremely useful. In this paper we leverage a previous work on the automatic

  6. Gene predictions and annotations Roderic Guig (Insitut Municipal d'Investigaci Mdica,

    E-print Network

    1 Chapter 17 Gene predictions and annotations Roderic Guigó (Insitut Municipal d.Q. (Cold Spring Harbor Laboratory, NY, USA) Table of contents 1. Introduction 2. Ab initio gene prediction a. Prediction of signals b. Prediction of exons c. Exon assembly into genes d. Gene Prediction

  7. Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.)

    PubMed Central

    Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

    2015-01-01

    The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. PMID:25362073

  8. Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.).

    PubMed

    Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

    2015-02-01

    The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. PMID:25362073

  9. GOblet: annotation of anonymous sequence data with gene ontology and pathway terms.

    PubMed

    Groth, Detlef; Hartmann, Stefanie; Panopoulou, Georgia; Poustka, Albert J; Hennig, Steffen

    2008-01-01

    The functional annotation of genomic data has become a major task for the ever-growing number of sequencing projects. In order to address this challenge, we recently developed GOblet, a free web service for the annotation of anonymous sequences with Gene Ontology (GO) terms. However, to overcome limitations of the GO terminology, and to aid in understanding not only single components but as well systemic interactions between the individual components, we have now extended the GOblet web service to integrate also pathway annotations. Furthermore, we extended and upgraded the data analysis pipeline with improved summaries, and added term enrichment and clustering algorithms. Finally, we are now making GOblet available as a stand-alone application for high-throughput processing on local machines. The advantages of this frequently requested feature is that a) the user can avoid restrictions of our web service for uploading and processing large amounts of data, and that b) confidential data can be analysed without insecure transfer to a public web server. The stand-alone version of the web service has been implemented using platform independent Tcl-scripts, which can be run with just a single runtime file utilizing the Starkit technology. The GOblet web service and the stand-alone application are freely available at http://goblet.molgen.mpg.de. PMID:20134064

  10. RefSeq curation and annotation of antizyme and antizyme inhibitor genes in vertebrates

    PubMed Central

    Rajput, Bhanu; Murphy, Terence D.; Pruitt, Kim D.

    2015-01-01

    Polyamines are ubiquitous cations that are involved in regulating fundamental cellular processes such as cell growth and proliferation; hence, their intracellular concentration is tightly regulated. Antizyme and antizyme inhibitor have a central role in maintaining cellular polyamine levels. Antizyme is unique in that it is expressed via a novel programmed ribosomal frameshifting mechanism. Conventional computational tools are unable to predict a programmed frameshift, resulting in misannotation of antizyme transcripts and proteins on transcript and genomic sequences. Correct annotation of a programmed frameshifting event requires manual evaluation. Our goal was to provide an accurately curated and annotated Reference Sequence (RefSeq) data set of antizyme transcript and protein records across a broad taxonomic scope that would serve as standards for accurate representation of these gene products. As antizyme and antizyme inhibitor proteins are functionally connected, we also curated antizyme inhibitor genes to more fully represent the elegant biology of polyamine regulation. Manual review of genes for three members of the antizyme family and two members of the antizyme inhibitor family in 91 vertebrate organisms resulted in a total of 461 curated RefSeq records. PMID:26170238

  11. Initiating the mollusk genomics annotation community: toward creating the complete curated gene-set of the Japanese Pearl Oyster, Pinctada fucata.

    PubMed

    Kawashima, Takeshi; Takeuchi, Takeshi; Koyanagi, Ryo; Kinoshita, Shigeharu; Endo, Hirotoshi; Endo, Kazuyoshi

    2013-10-01

    The genome sequence of the Japanese pearl oyster, the first draft genome from a mollusk, was published in February 2012. In order to curate the draft genome assemblies and annotate the predicted gene models, two annotation Jamborees were held in Okinawa and Tokyo. To date, 761 genes have been surveyed and curated. A preparatory meeting and a debriefing were held at the Misaki Marine Biological Station before and after the Jamborees. These four events, in conjunction with the sequence-decoding project, have facilitated the first series of gene annotations. Genome annotators among the Jamboree participants added 22 functional categories to the annotation system to date. Of these, 17 are included in Generic Gene Ontology. The other five categories are specific to molluskan biology, such as "Byssus Formation" and "Shell Formation", including Biomineralization and Acidic Proteins. A total of 731 genes from our latest version of gene models are annotated and classified into these 22 categories. The resulting data will serve as a useful reference for future genomic analyses of this species as well as comparative analyses among mollusks. PMID:24125643

  12. Mining GO annotations for improving annotation consistency.

    PubMed

    Faria, Daniel; Schlicker, Andreas; Pesquita, Catia; Bastos, Hugo; Ferreira, António E N; Albrecht, Mario; Falcão, André O

    2012-01-01

    Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%. PMID:22848383

  13. Gene annotation and drug target discovery in Candida albicans with a tagged transposon mutant collection.

    PubMed

    Oh, Julia; Fung, Eula; Schlecht, Ulrich; Davis, Ronald W; Giaever, Guri; St Onge, Robert P; Deutschbauer, Adam; Nislow, Corey

    2010-01-01

    Candida albicans is the most common human fungal pathogen, causing infections that can be lethal in immunocompromised patients. Although Saccharomyces cerevisiae has been used as a model for C. albicans, it lacks C. albicans' diverse morphogenic forms and is primarily non-pathogenic. Comprehensive genetic analyses that have been instrumental for determining gene function in S. cerevisiae are hampered in C. albicans, due in part to limited resources to systematically assay phenotypes of loss-of-function alleles. Here, we constructed and screened a library of 3633 tagged heterozygous transposon disruption mutants, using them in a competitive growth assay to examine nutrient- and drug-dependent haploinsufficiency. We identified 269 genes that were haploinsufficient in four growth conditions, the majority of which were condition-specific. These screens identified two new genes necessary for filamentous growth as well as ten genes that function in essential processes. We also screened 57 chemically diverse compounds that more potently inhibited growth of C. albicans versus S. cerevisiae. For four of these compounds, we examined the genetic basis of this differential inhibition. Notably, Sec7p was identified as the target of brefeldin A in C. albicans screens, while S. cerevisiae screens with this compound failed to identify this target. We also uncovered a new C. albicans-specific target, Tfp1p, for the synthetic compound 0136-0228. These results highlight the value of haploinsufficiency screens directly in this pathogen for gene annotation and drug target identification. PMID:20949076

  14. A Syngeneic Variance Library for Functional Annotation of Human Variation: Application to BRCA2

    PubMed Central

    Hucl, Tomas; Rago, Carlo; Gallmeier, Eike; Brody, Jonathan R.; Gorospe, Myriam; Kern, Scott E.

    2008-01-01

    The enormous scope of natural human genetic variation is now becoming defined. To accurately annotate these variants, and to identify those with clinical importance, is often difficult to assess through functional assays. We explored systematic annotation by using homologous recombination to modify a native gene in hemizygous (wt/?exon) human cancer cells, generating a novel syngeneic variance library (SyVaL). We created a SyVaL of BRCA2 variants: nondeleterious, proposed deleterious, deleterious, and of uncertain significance. We found that the null states BRCA2?ex11/?ex11 and BRCA2?ex11/?3308X were deleterious as assessed by a loss of RAD51 focus formation on genotoxic damage and by acquisition of toxic hypersensitivity to mitomycin C and etoposide, whereas BRCA2?ex11/Y3308Y, BRCA2?ex11/P3292L, and BRCA2?ex11/P3280H had wild-type function. A proposed phosphorylation site at codon 3291 affecting function was confirmed by substitution of an acidic residue (glutamate, BRCA2?ex11/S3291E) for the native serine, but in contrast to a prior report, phosphorylation was dispensable (alanine, BRCA2?ex11/S3291A) for BRCA2-governed cellular phenotypes. These results show that SyVaLs offer a means to comprehensively annotate gene function, facilitating numerical and unambiguous readouts. SyVaLs may be especially useful for genes in which functional assays using exogenous expression are toxic or otherwise unreliable. They also offer a stable, distributable cellular resource for further research. PMID:18593900

  15. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production

    PubMed Central

    Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh

    2015-01-01

    Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac_0437 and Csac_0424 encode for glycoside hydrolases (GH) and are proposed to be involved in the decomposition of recalcitrant plant polysaccharides. Similarly, HPs: Csac_0732, Csac_1862, Csac_1294 and Csac_0668 are suggested to play a significant role in biohydrogen production. Function prediction of these HPs by using our integrated approach will considerably enhance the interpretation of large-scale experiments targeting this industrially important organism. PMID:26196387

  16. Large-scale collection and annotation of gene models for date palm (Phoenix dactylifera, L.).

    PubMed

    Zhang, Guangyu; Pan, Linlin; Yin, Yuxin; Liu, Wanfei; Huang, Dawei; Zhang, Tongwu; Wang, Lei; Xin, Chengqi; Lin, Qiang; Sun, Gaoyuan; Ba Abdullah, Mohammed M; Zhang, Xiaowei; Hu, Songnian; Al-Mssallem, Ibrahim S; Yu, Jun

    2012-08-01

    The date palm (Phoenix dactylifera L.), famed for its sugar-rich fruits (dates) and cultivated by humans since 4,000 B.C., is an economically important crop in the Middle East, Northern Africa, and increasingly other places where climates are suitable. Despite a long history of human cultivation, the understanding of P. dactylifera genetics and molecular biology are rather limited, hindered by lack of basic data in high quality from genomics and transcriptomics. Here we report a large-scale effort in generating gene models (assembled expressed sequence tags or ESTs and mapped to a genome assembly) for P. dactylifera, using the long-read pyrosequencing platform (Roche/454 GS FLX Titanium) in high coverage. We built fourteen cDNA libraries from different P. dactylifera tissues (cultivar Khalas) and acquired 15,778,993 raw sequencing reads-about one million sequencing reads per library-and the pooled sequences were assembled into 67,651 non-redundant contigs and 301,978 singletons. We annotated 52,725 contigs based on the plant databases and 45 contigs based on functional domains referencing to the Pfam database. From the annotated contigs, we assigned GO (Gene Ontology) terms to 36,086 contigs and KEGG pathways to 7,032 contigs. Our comparative analysis showed that 70.6 % (47,930), 69.4 % (47,089), 68.4 % (46,441), and 69.3 % (47,048) of the P. dactylifera gene models are shared with rice, sorghum, Arabidopsis, and grapevine, respectively. We also assigned our gene models into house-keeping and tissue-specific genes based on their tissue specificity. PMID:22736259

  17. High-resolution functional annotation of human transcriptome: predicting isoform functions by a novel multiple instance-based label propagation method

    PubMed Central

    Li, Wenyuan; Kang, Shuli; Liu, Chun-Chi; Zhang, Shihua; Shi, Yi; Liu, Yan; Zhou, Xianghong Jasmine

    2014-01-01

    Alternative transcript processing is an important mechanism for generating functional diversity in genes. However, little is known about the precise functions of individual isoforms. In fact, proteins (translated from transcript isoforms), not genes, are the function carriers. By integrating multiple human RNA-seq data sets, we carried out the first systematic prediction of isoform functions, enabling high-resolution functional annotation of human transcriptome. Unlike gene function prediction, isoform function prediction faces a unique challenge: the lack of the training data—all known functional annotations are at the gene level. To address this challenge, we modelled the gene–isoform relationships as multiple instance data and developed a novel label propagation method to predict functions. Our method achieved an average area under the receiver operating characteristic curve of 0.67 and assigned functions to 15 572 isoforms. Interestingly, we observed that different functions have different sensitivities to alternative isoform processing, and that the function diversity of isoforms from the same gene is positively correlated with their tissue expression diversity. Finally, we surveyed the literature to validate our predictions for a number of apoptotic genes. Strikingly, for the famous ‘TP53’ gene, we not only accurately identified the apoptosis regulation function of its five isoforms, but also correctly predicted the precise direction of the regulation. PMID:24369432

  18. Cloning, analysis and functional annotation of expressed sequence tags from the Earthworm Eisenia fetida

    PubMed Central

    Pirooznia, Mehdi; Gong, Ping; Guan, Xin; Inouye, Laura S; Yang, Kuan; Perkins, Edward J; Deng, Youping

    2007-01-01

    Background Eisenia fetida, commonly known as red wiggler or compost worm, belongs to the Lumbricidae family of the Annelida phylum. Little is known about its genome sequence although it has been extensively used as a test organism in terrestrial ecotoxicology. In order to understand its gene expression response to environmental contaminants, we cloned 4032 cDNAs or expressed sequence tags (ESTs) from two E. fetida libraries enriched with genes responsive to ten ordnance related compounds using suppressive subtractive hybridization-PCR. Results A total of 3144 good quality ESTs (GenBank dbEST accession number EH669363–EH672369 and EL515444–EL515580) were obtained from the raw clone sequences after cleaning. Clustering analysis yielded 2231 unique sequences including 448 contigs (from 1361 ESTs) and 1783 singletons. Comparative genomic analysis showed that 743 or 33% of the unique sequences shared high similarity with existing genes in the GenBank nr database. Provisional function annotation assigned 830 Gene Ontology terms to 517 unique sequences based on their homology with the annotated genomes of four model organisms Drosophila melanogaster, Mus musculus, Saccharomyces cerevisiae, and Caenorhabditis elegans. Seven percent of the unique sequences were further mapped to 99 Kyoto Encyclopedia of Genes and Genomes pathways based on their matching Enzyme Commission numbers. All the information is stored and retrievable at a highly performed, web-based and user-friendly relational database called EST model database or ESTMD version 2. Conclusion The ESTMD containing the sequence and annotation information of 4032 E. fetida ESTs is publicly accessible at . PMID:18047730

  19. Synergistic use of plant-prokaryote comparative genomics for functional annotations

    PubMed Central

    2011-01-01

    Background Identifying functions for all gene products in all sequenced organisms is a central challenge of the post-genomic era. However, at least 30-50% of the proteins encoded by any given genome are of unknown or vaguely known function, and a large number are wrongly annotated. Many of these ‘unknown’ proteins are common to prokaryotes and plants. We set out to predict and experimentally test the functions of such proteins. Our approach to functional prediction integrates comparative genomics based mainly on microbial genomes with functional genomic data from model microorganisms and post-genomic data from plants. This approach bridges the gap between automated homology-based annotations and the classical gene discovery efforts of experimentalists, and is more powerful than purely computational approaches to identifying gene-function associations. Results Among Arabidopsis genes, we focused on those (2,325 in total) that (i) are unique or belong to families with no more than three members, (ii) occur in prokaryotes, and (iii) have unknown or poorly known functions. Computer-assisted selection of promising targets for deeper analysis was based on homology-independent characteristics associated in the SEED database with the prokaryotic members of each family. In-depth comparative genomic analysis was performed for 360 top candidate families. From this pool, 78 families were connected to general areas of metabolism and, of these families, specific functional predictions were made for 41. Twenty-one predicted functions have been experimentally tested or are currently under investigation by our group in at least one prokaryotic organism (nine of them have been validated, four invalidated, and eight are in progress). Ten additional predictions have been independently validated by other groups. Discovering the function of very widespread but hitherto enigmatic proteins such as the YrdC or YgfZ families illustrates the power of our approach. Conclusions Our approach correctly predicted functions for 19 uncharacterized protein families from plants and prokaryotes; none of these functions had previously been correctly predicted by computational methods. The resulting annotations could be propagated with confidence to over six thousand homologous proteins encoded in over 900 bacterial, archaeal, and eukaryotic genomes currently available in public databases. PMID:21810204

  20. Combining heterogeneous data sources for accurate functional annotation of proteins

    PubMed Central

    2013-01-01

    Combining heterogeneous sources of data is essential for accurate prediction of protein function. The task is complicated by the fact that while sequence-based features can be readily compared across species, most other data are species-specific. In this paper, we present a multi-view extension to GOstruct, a structured-output framework for function annotation of proteins. The extended framework can learn from disparate data sources, with each data source provided to the framework in the form of a kernel. Our empirical results demonstrate that the multi-view framework is able to utilize all available information, yielding better performance than sequence-based models trained across species and models trained from collections of data within a given species. This version of GOstruct participated in the recent Critical Assessment of Functional Annotations (CAFA) challenge; since then we have significantly improved the natural language processing component of the method, which now provides performance that is on par with that provided by sequence information. The GOstruct framework is available for download at http://strut.sourceforge.net. PMID:23514123

  1. Functional Annotation of Putative Regulatory Elements at Cancer Susceptibility Loci

    PubMed Central

    Rosse, Stephanie A; Auer, Paul L; Carlson, Christopher S

    2014-01-01

    Most cancer-associated genetic variants identified from genome-wide association studies (GWAS) do not obviously change protein structure, leading to the hypothesis that the associations are attributable to regulatory polymorphisms. Translating genetic associations into mechanistic insights can be facilitated by knowledge of the causal regulatory variant (or variants) responsible for the statistical signal. Experimental validation of candidate functional variants is onerous, making bioinformatic approaches necessary to prioritize candidates for laboratory analysis. Thus, a systematic approach for recognizing functional (and, therefore, likely causal) variants in noncoding regions is an important step toward interpreting cancer risk loci. This review provides a detailed introduction to current regulatory variant annotations, followed by an overview of how to leverage these resources to prioritize candidate functional polymorphisms in regulatory regions. PMID:25288875

  2. Image Annotation Using Multi-label Correlated Green's Function Hua Wang, Heng Huang, Chris Ding

    E-print Network

    Huang, Heng

    Image Annotation Using Multi-label Correlated Green's Function Hua Wang, Heng Huang, Chris Ding frame- work. A novel multi-label correlated Green's function ap- proach is proposed to annotate images over a graph. The correlations among labels are integrated into the objective function which improves

  3. Insect Innate Immunity Database (IIID): An Annotation Tool for Identifying Immune Genes in Insect Genomes

    E-print Network

    Bordenstein, Seth

    Insect Innate Immunity Database (IIID): An Annotation Tool for Identifying Immune Genes in Insect The innate immune system is an ancient component of host defense. Since innate immunity pathways are well conserved throughout many eukaryotes, immune genes in model animals can be used to putatively identify

  4. Re-Annotation of Protein-Coding Genes in 10 Complete Genomes of Neisseriaceae Family by Combining Similarity-Based and Composition-Based Methods

    PubMed Central

    Guo, Feng-Biao; Xiong, Lifeng; Teng, Jade L. L.; Yuen, Kwok-Yung; Lau, Susanna K. P.; Woo, Patrick C. Y.

    2013-01-01

    In this paper, we performed a comprehensive re-annotation of protein-coding genes by a systematic method combining composition- and similarity-based approaches in 10 complete bacterial genomes of the family Neisseriaceae. First, 418 hypothetical genes were predicted as non-coding using the composition-based method and 413 were eliminated from the gene list. Both the scatter plot and cluster of orthologous groups (COG) fraction analyses supported the result. Second, from 20 to 400 hypothetical proteins were assigned with functions in each of the 10 strains based on the homology search. Among newly assigned functions, 397 are so detailed to have definite gene names. Third, 106 genes missed by the original annotations were picked up by an ab initio gene finder combined with similarity alignment. Transcriptional experiments validated the effectiveness of this method in Laribacter hongkongensis and Chromobacterium violaceum. Among the 106 newly found genes, some deserve particular interests. For example, 27 transposases were newly found in Neiserria meningitidis alpha14. In Neiserria gonorrhoeae NCCP11945, four new genes with putative functions and definite names (nusG, rpsN, rpmD and infA) were found and homologues of them usually are essential for survival in bacteria. The updated annotations for the 10 Neisseriaceae genomes provide a more accurate prediction of protein-coding genes and a more detailed functional information of hypothetical proteins. It will benefit research into the lifestyle, metabolism, environmental adaption and pathogenicity of the Neisseriaceae species. The re-annotation procedure could be used directly, or after the adaption of detailed methods, for checking annotations of any other bacterial or archaeal genomes. PMID:23571676

  5. Mining locus tags in PubMed Central to improve microbial gene annotation

    PubMed Central

    2014-01-01

    Background The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases. We propose to utilize the Open Access (OA) subset of PubMed Central (PMC) as a gene annotation database and have developed an R package called pmcXML to automatically mine and extract locus tags from full text, tables and supplements. Results We mined locus tags from 1835 OA publications in ten microbial genomes and extracted tags mentioned in 30,891 sentences in main text and 20,489 rows in tables. We identified locus tag pairs marking the start and end of a region such as an operon or genomic island and expanded these ranges to add another 13,043 tags. We also searched for locus tags in supplementary tables and publications outside the OA subset in Burkholderia pseudomallei K96243 for comparison. There were 168 publications containing 48,470 locus tags and 83% of mentions were from supplementary materials and 9% from publications outside the OA subset. Conclusions B. pseudomallei locus tags within the full text and tables of OA publications represent only a small fraction of the total mentions in the literature. For microbial genomes with very few functionally characterized proteins, the locus tags mentioned in supplementary tables and within ranges like genomic islands contain the majority of locus tags. Significantly, the functions in the R package provide access to additional resources in the OA subset that are not currently indexed or returned by searching PMC. PMID:24499370

  6. PANDA: pathway and annotation explorer for visualizing and interpreting gene-centric data.

    PubMed

    Hart, Steven N; Moore, Raymond M; Zimmermann, Michael T; Oliver, Gavin R; Egan, Jan B; Bryce, Alan H; Kocher, Jean-Pierre A

    2015-01-01

    Objective. Bringing together genomics, transcriptomics, proteomics, and other -omics technologies is an important step towards developing highly personalized medicine. However, instrumentation has advances far beyond expectations and now we are able to generate data faster than it can be interpreted. Materials and Methods. We have developed PANDA (Pathway AND Annotation) Explorer, a visualization tool that integrates gene-level annotation in the context of biological pathways to help interpret complex data from disparate sources. PANDA is a web-based application that displays data in the context of well-studied pathways like KEGG, BioCarta, and PharmGKB. PANDA represents data/annotations as icons in the graph while maintaining the other data elements (i.e., other columns for the table of annotations). Custom pathways from underrepresented diseases can be imported when existing data sources are inadequate. PANDA also allows sharing annotations among collaborators. Results. In our first use case, we show how easy it is to view supplemental data from a manuscript in the context of a user's own data. Another use-case is provided describing how PANDA was leveraged to design a treatment strategy from the somatic variants found in the tumor of a patient with metastatic sarcomatoid renal cell carcinoma. Conclusion. PANDA facilitates the interpretation of gene-centric annotations by visually integrating this information with context of biological pathways. The application can be downloaded or used directly from our website: http://bioinformaticstools.mayo.edu/research/panda-viewer/. PMID:26038725

  7. Rice DB: an Oryza Information Portal linking annotation, subcellular location, function, expression, regulation, and evolutionary information for rice and Arabidopsis

    PubMed Central

    Narsai, Reena; Devenish, James; Castleden, Ian; Narsai, Kabir; Xu, Lin; Shou, Huixia; Whelan, James

    2013-01-01

    Omics research in Oryza sativa (rice) relies on the use of multiple databases to obtain different types of information to define gene function. We present Rice DB, an Oryza information portal that is a functional genomics database, linking gene loci to comprehensive annotations, expression data and the subcellular location of encoded proteins. Rice DB has been designed to integrate the direct comparison of rice with Arabidopsis (Arabidopsis thaliana), based on orthology or ‘expressology’, thus using and combining available information from two pre-eminent plant models. To establish Rice DB, gene identifiers (more than 40 types) and annotations from a variety of sources were compiled, functional information based on large-scale and individual studies was manually collated, hundreds of microarrays were analysed to generate expression annotations, and the occurrences of potential functional regulatory motifs in promoter regions were calculated. A range of computational subcellular localization predictions were also run for all putative proteins encoded in the rice genome, and experimentally confirmed protein localizations have been collated, curated and linked to functional studies in rice. A single search box allows anything from gene identifiers (for rice and/or Arabidopsis), motif sequences, subcellular location, to keyword searches to be entered, with the capability of Boolean searches (such as AND/OR). To demonstrate the utility of Rice DB, several examples are presented including a rice mitochondrial proteome, which draws on a variety of sources for subcellular location data within Rice DB. Comparisons of subcellular location, functional annotations, as well as transcript expression in parallel with Arabidopsis reveals examples of conservation between rice and Arabidopsis, using Rice DB (http://ricedb.plantenergy.uwa.edu.au). PMID:24147765

  8. Improved white spruce (Picea glauca) genome assemblies and annotation of large gene families of conifer terpenoid and phenolic defense metabolism.

    PubMed

    Warren, René L; Keeling, Christopher I; Yuen, Macaire Man Saint; Raymond, Anthony; Taylor, Greg A; Vandervalk, Benjamin P; Mohamadi, Hamid; Paulino, Daniel; Chiu, Readman; Jackman, Shaun D; Robertson, Gordon; Yang, Chen; Boyle, Brian; Hoffmann, Margarete; Weigel, Detlef; Nelson, David R; Ritland, Carol; Isabel, Nathalie; Jaquish, Barry; Yanchuk, Alvin; Bousquet, Jean; Jones, Steven J M; MacKay, John; Birol, Inanc; Bohlmann, Joerg

    2015-07-01

    White spruce (Picea glauca), a gymnosperm tree, has been established as one of the models for conifer genomics. We describe the draft genome assemblies of two white spruce genotypes, PG29 and WS77111, innovative tools for the assembly of very large genomes, and the conifer genomics resources developed in this process. The two white spruce genotypes originate from distant geographic regions of western (PG29) and eastern (WS77111) North America, and represent elite trees in two Canadian tree-breeding programs. We present an update (V3 and V4) for a previously reported PG29 V2 draft genome assembly and introduce a second white spruce genome assembly for genotype WS77111. Assemblies of the PG29 and WS77111 genomes confirm the reconstructed white spruce genome size in the 20 Gbp range, and show broad synteny. Using the PG29 V3 assembly and additional white spruce genomics and transcriptomics resources, we performed MAKER-P annotation and meticulous expert annotation of very large gene families of conifer defense metabolism, the terpene synthases and cytochrome P450s. We also comprehensively annotated the white spruce mevalonate, methylerythritol phosphate and phenylpropanoid pathways. These analyses highlighted the large extent of gene and pseudogene duplications in a conifer genome, in particular for genes of secondary (i.e. specialized) metabolism, and the potential for gain and loss of function for defense and adaptation. PMID:26017574

  9. Identification and computational annotation of genes differentially expressed in pulp development of Cocos nucifera L. by suppression subtractive hybridization

    PubMed Central

    2014-01-01

    Background Coconut (Cocos nucifera L.) is one of the world’s most versatile, economically important tropical crops. Little is known about the physiological and molecular basis of coconut pulp (endosperm) development and only a few coconut genes and gene product sequences are available in public databases. This study identified genes that were differentially expressed during development of coconut pulp and functionally annotated these identified genes using bioinformatics analysis. Results Pulp from three different coconut developmental stages was collected. Four suppression subtractive hybridization (SSH) libraries were constructed (forward and reverse libraries A and B between stages 1 and 2, and C and D between stages 2 and 3), and identified sequences were computationally annotated using Blast2GO software. A total of 1272 clones were obtained for analysis from four SSH libraries with 63% showing similarity to known proteins. Pairwise comparing of stage-specific gene ontology ids from libraries B-D, A-C, B-C and A-D showed that 32 genes were continuously upregulated and seven downregulated; 28 were transiently upregulated and 23 downregulated. KEGG (Kyoto Encyclopedia of Genes and Genomes) analysis showed that 1-acyl-sn-glycerol-3-phosphate acyltransferase (LPAAT), phospholipase D, acetyl-CoA carboxylase carboxyltransferase beta subunit, 3-hydroxyisobutyryl-CoA hydrolase-like and pyruvate dehydrogenase E1 ? subunit were associated with fatty acid biosynthesis or metabolism. Triose phosphate isomerase, cellulose synthase and glucan 1,3-?-glucosidase were related to carbohydrate metabolism, and phosphoenolpyruvate carboxylase was related to both fatty acid and carbohydrate metabolism. Of 737 unigenes, 103 encoded enzymes were involved in fatty acid and carbohydrate biosynthesis and metabolism, and a number of transcription factors and other interesting genes with stage-specific expression were confirmed by real-time PCR, with validation of the SSH results as high as 66.6%. Based on determination of coconut endosperm fatty acids content by gas chromatography–mass spectrometry, a number of candidate genes in fatty acid anabolism were selected for further study. Conclusion Functional annotation of genes differentially expressed in coconut pulp development helped determine the molecular basis of coconut endosperm development. The SSH method identified genes related to fatty acids, carbohydrate and secondary metabolites. The results will be important for understanding gene functions and regulatory networks in coconut fruit. PMID:25084812

  10. Multi-Trait GWAS and New Candidate Genes Annotation for Growth Curve Parameters in Brahman Cattle

    PubMed Central

    Crispim, Aline Camporez; Kelly, Matthew John; Guimarães, Simone Eliza Facioni; e Silva, Fabyano Fonseca; Fortes, Marina Rufino Salinas; Wenceslau, Raphael Rocha; Moore, Stephen

    2015-01-01

    Understanding the genetic architecture of beef cattle growth cannot be limited simply to the genome-wide association study (GWAS) for body weight at any specific ages, but should be extended to a more general purpose by considering the whole growth trajectory over time using a growth curve approach. For such an approach, the parameters that are used to describe growth curves were treated as phenotypes under a GWAS model. Data from 1,255 Brahman cattle that were weighed at birth, 6, 12, 15, 18, and 24 months of age were analyzed. Parameter estimates, such as mature weight (A) and maturity rate (K) from nonlinear models are utilized as substitutes for the original body weights for the GWAS analysis. We chose the best nonlinear model to describe the weight-age data, and the estimated parameters were used as phenotypes in a multi-trait GWAS. Our aims were to identify and characterize associated SNP markers to indicate SNP-derived candidate genes and annotate their function as related to growth processes in beef cattle. The Brody model presented the best goodness of fit, and the heritability values for the parameter estimates for mature weight (A) and maturity rate (K) were 0.23 and 0.32, respectively, proving that these traits can be a feasible alternative when the objective is to change the shape of growth curves within genetic improvement programs. The genetic correlation between A and K was -0.84, indicating that animals with lower mature body weights reached that weight at younger ages. One hundred and sixty seven (167) and two hundred and sixty two (262) significant SNPs were associated with A and K, respectively. The annotated genes closest to the most significant SNPs for A had direct biological functions related to muscle development (RAB28), myogenic induction (BTG1), fetal growth (IL2), and body weights (APEX2); K genes were functionally associated with body weight, body height, average daily gain (TMEM18), and skeletal muscle development (SMN1). Candidate genes emerging from this GWAS may inform the search for causative mutations that could underpin genomic breeding for improved growth rates. PMID:26445451

  11. RNA-Seq Analysis of Quercus pubescens Leaves: De Novo Transcriptome Assembly, Annotation and Functional Markers Development

    PubMed Central

    Torre, Sara; Tattini, Massimiliano; Brunetti, Cecilia; Fineschi, Silvia; Fini, Alessio; Ferrini, Francesco; Sebastiani, Federico

    2014-01-01

    Quercus pubescens Willd., a species distributed from Spain to southwest Asia, ranks high for drought tolerance among European oaks. Q. pubescens performs a role of outstanding significance in most Mediterranean forest ecosystems, but few mechanistic studies have been conducted to explore its response to environmental constrains, due to the lack of genomic resources. In our study, we performed a deep transcriptomic sequencing in Q. pubescens leaves, including de novo assembly, functional annotation and the identification of new molecular markers. Our results are a pre-requisite for undertaking molecular functional studies, and may give support in population and association genetic studies. 254,265,700 clean reads were generated by the Illumina HiSeq 2000 platform, with an average length of 98 bp. De novo assembly, using CLC Genomics, produced 96,006 contigs, having a mean length of 618 bp. Sequence similarity analyses against seven public databases (Uniprot, NR, RefSeq and KOGs at NCBI, Pfam, InterPro and KEGG) resulted in 83,065 transcripts annotated with gene descriptions, conserved protein domains, or gene ontology terms. These annotations and local BLAST allowed identify genes specifically associated with mechanisms of drought avoidance. Finally, 14,202 microsatellite markers and 18,425 single nucleotide polymorphisms (SNPs) were, in silico, discovered in assembled and annotated sequences. We completed a successful global analysis of the Q. pubescens leaf transcriptome using RNA-seq. The assembled and annotated sequences together with newly discovered molecular markers provide genomic information for functional genomic studies in Q. pubescens, with special emphasis to response mechanisms to severe constrain of the Mediterranean climate. Our tools enable comparative genomics studies on other Quercus species taking advantage of large intra-specific ecophysiological differences. PMID:25393112

  12. RNA-seq analysis of Quercus pubescens Leaves: de novo transcriptome assembly, annotation and functional markers development.

    PubMed

    Torre, Sara; Tattini, Massimiliano; Brunetti, Cecilia; Fineschi, Silvia; Fini, Alessio; Ferrini, Francesco; Sebastiani, Federico

    2014-01-01

    Quercus pubescens Willd., a species distributed from Spain to southwest Asia, ranks high for drought tolerance among European oaks. Q. pubescens performs a role of outstanding significance in most Mediterranean forest ecosystems, but few mechanistic studies have been conducted to explore its response to environmental constrains, due to the lack of genomic resources. In our study, we performed a deep transcriptomic sequencing in Q. pubescens leaves, including de novo assembly, functional annotation and the identification of new molecular markers. Our results are a pre-requisite for undertaking molecular functional studies, and may give support in population and association genetic studies. 254,265,700 clean reads were generated by the Illumina HiSeq 2000 platform, with an average length of 98 bp. De novo assembly, using CLC Genomics, produced 96,006 contigs, having a mean length of 618 bp. Sequence similarity analyses against seven public databases (Uniprot, NR, RefSeq and KOGs at NCBI, Pfam, InterPro and KEGG) resulted in 83,065 transcripts annotated with gene descriptions, conserved protein domains, or gene ontology terms. These annotations and local BLAST allowed identify genes specifically associated with mechanisms of drought avoidance. Finally, 14,202 microsatellite markers and 18,425 single nucleotide polymorphisms (SNPs) were, in silico, discovered in assembled and annotated sequences. We completed a successful global analysis of the Q. pubescens leaf transcriptome using RNA-seq. The assembled and annotated sequences together with newly discovered molecular markers provide genomic information for functional genomic studies in Q. pubescens, with special emphasis to response mechanisms to severe constrain of the Mediterranean climate. Our tools enable comparative genomics studies on other Quercus species taking advantage of large intra-specific ecophysiological differences. PMID:25393112

  13. Gene function prediction using labeled and unlabeled data

    PubMed Central

    Zhao, Xing-Ming; Wang, Yong; Chen, Luonan; Aihara, Kazuyuki

    2008-01-01

    Background In general, gene function prediction can be formalized as a classification problem based on machine learning technique. Usually, both labeled positive and negative samples are needed to train the classifier. For the problem of gene function prediction, however, the available information is only about positive samples. In other words, we know which genes have the function of interested, while it is generally unclear which genes do not have the function, i.e. the negative samples. If all the genes outside of the target functional family are seen as negative samples, the imbalanced problem will arise because there are only a relatively small number of genes annotated in each family. Furthermore, the classifier may be degraded by the false negatives in the heuristically generated negative samples. Results In this paper, we present a new technique, namely Annotating Genes with Positive Samples (AGPS), for defining negative samples in gene function prediction. With the defined negative samples, it is straightforward to predict the functions of unknown genes. In addition, the AGPS algorithm is able to integrate various kinds of data sources to predict gene functions in a reliable and accurate manner. With the one-class and two-class Support Vector Machines as the core learning algorithm, the AGPS algorithm shows good performances for function prediction on yeast genes. Conclusion We proposed a new method for defining negative samples in gene function prediction. Experimental results on yeast genes show that AGPS yields good performances on both training and test sets. In addition, the overlapping between prediction results and GO annotations on unknown genes also demonstrates the effectiveness of the proposed method. PMID:18221567

  14. COMBREX: a project to accelerate the functional annotation of prokaryotic genomes

    E-print Network

    Roberts, Richard J.

    COMBREX (http://combrex.bu.edu) is a project to increase the speed of the functional annotation of new bacterial and archaeal genomes. It consists of a database of functional predictions produced by computational biologists ...

  15. The Physalis peruviana leaf transcriptome: assembly, annotation and gene model prediction

    PubMed Central

    2012-01-01

    Background Physalis peruviana commonly known as Cape gooseberry is a member of the Solanaceae family that has an increasing popularity due to its nutritional and medicinal values. A broad range of genomic tools is available for other Solanaceae, including tomato and potato. However, limited genomic resources are currently available for Cape gooseberry. Results We report the generation of a total of 652,614 P. peruviana Expressed Sequence Tags (ESTs), using 454 GS FLX Titanium technology. ESTs, with an average length of 371?bp, were obtained from a normalized leaf cDNA library prepared using a Colombian commercial variety. De novo assembling was performed to generate a collection of 24,014 isotigs and 110,921 singletons, with an average length of 1,638?bp and 354?bp, respectively. Functional annotation was performed using NCBI’s BLAST tools and Blast2GO, which identified putative functions for 21,191 assembled sequences, including gene families involved in all the major biological processes and molecular functions as well as defense response and amino acid metabolism pathways. Gene model predictions in P. peruviana were obtained by using the genomes of Solanum lycopersicum (tomato) and Solanum tuberosum (potato). We predict 9,436 P. peruviana sequences with multiple-exon models and conserved intron positions with respect to the potato and tomato genomes. Additionally, to study species diversity we developed 5,971 SSR markers from assembled ESTs. Conclusions We present the first comprehensive analysis of the Physalis peruviana leaf transcriptome, which will provide valuable resources for development of genetic tools in the species. Assembled transcripts with gene models could serve as potential candidates for marker discovery with a variety of applications including: functional diversity, conservation and improvement to increase productivity and fruit quality. P. peruviana was estimated to be phylogenetically branched out before the divergence of five other Solanaceae family members, S. lycopersicum, S. tuberosum, Capsicum spp, S. melongena and Petunia spp. PMID:22533342

  16. USING REASONING TO GUIDE ANNOTATION WITH GENE ONTOLOGY TERMS IN GOAT

    E-print Network

    Stevens, Robert

    USING REASONING TO GUIDE ANNOTATION WITH GENE ONTOLOGY TERMS IN GOAT MICHAEL BADA 1 , DANIELE TURI with a DAML+OIL version of GO and an instance store of mined GO-term-to-GO-term associations, GOAT aims to aid

  17. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences.

    PubMed

    Huerta-Cepas, Jaime; Szklarczyk, Damian; Forslund, Kristoffer; Cook, Helen; Heller, Davide; Walter, Mathias C; Rattei, Thomas; Mende, Daniel R; Sunagawa, Shinichi; Kuhn, Michael; Jensen, Lars Juhl; von Mering, Christian; Bork, Peer

    2016-01-01

    eggNOG is a public resource that provides Orthologous Groups (OGs) of proteins at different taxonomic levels, each with integrated and summarized functional annotations. Developments since the latest public release include changes to the algorithm for creating OGs across taxonomic levels, making nested groups hierarchically consistent. This allows for a better propagation of functional terms across nested OGs and led to the novel annotation of 95 890 previously uncharacterized OGs, increasing overall annotation coverage from 67% to 72%. The functional annotations of OGs have been expanded to also provide Gene Ontology terms, KEGG pathways and SMART/Pfam domains for each group. Moreover, eggNOG now provides pairwise orthology relationships within OGs based on analysis of phylogenetic trees. We have also incorporated a framework for quickly mapping novel sequences to OGs based on precomputed HMM profiles. Finally, eggNOG version 4.5 incorporates a novel data set spanning 2605 viral OGs, covering 5228 proteins from 352 viral proteomes. All data are accessible for bulk downloading, as a web-service, and through a completely redesigned web interface. The new access points provide faster searches and a number of new browsing and visualization capabilities, facilitating the needs of both experts and less experienced users. eggNOG v4.5 is available at http://eggnog.embl.de. PMID:26582926

  18. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences

    PubMed Central

    Huerta-Cepas, Jaime; Szklarczyk, Damian; Forslund, Kristoffer; Cook, Helen; Heller, Davide; Walter, Mathias C.; Rattei, Thomas; Mende, Daniel R.; Sunagawa, Shinichi; Kuhn, Michael; Jensen, Lars Juhl; von Mering, Christian; Bork, Peer

    2016-01-01

    eggNOG is a public resource that provides Orthologous Groups (OGs) of proteins at different taxonomic levels, each with integrated and summarized functional annotations. Developments since the latest public release include changes to the algorithm for creating OGs across taxonomic levels, making nested groups hierarchically consistent. This allows for a better propagation of functional terms across nested OGs and led to the novel annotation of 95 890 previously uncharacterized OGs, increasing overall annotation coverage from 67% to 72%. The functional annotations of OGs have been expanded to also provide Gene Ontology terms, KEGG pathways and SMART/Pfam domains for each group. Moreover, eggNOG now provides pairwise orthology relationships within OGs based on analysis of phylogenetic trees. We have also incorporated a framework for quickly mapping novel sequences to OGs based on precomputed HMM profiles. Finally, eggNOG version 4.5 incorporates a novel data set spanning 2605 viral OGs, covering 5228 proteins from 352 viral proteomes. All data are accessible for bulk downloading, as a web-service, and through a completely redesigned web interface. The new access points provide faster searches and a number of new browsing and visualization capabilities, facilitating the needs of both experts and less experienced users. eggNOG v4.5 is available at http://eggnog.embl.de. PMID:26582926

  19. In Silico Structural and Functional Annotation of Hypothetical Proteins of Vibrio cholerae O139

    PubMed Central

    Islam, Md. Saiful; Shahik, Shah Md.; Sohel, Md.; Patwary, Noman I. A.

    2015-01-01

    In developing countries threat of cholera is a significant health concern whenever water purification and sewage disposal systems are inadequate. Vibrio cholerae is one of the responsible bacteria involved in cholera disease. The complete genome sequence of V. cholerae deciphers the presence of various genes and hypothetical proteins whose function are not yet understood. Hence analyzing and annotating the structure and function of hypothetical proteins is important for understanding the V. cholerae. V. cholerae O139 is the most common and pathogenic bacterial strain among various V. cholerae strains. In this study sequence of six hypothetical proteins of V. cholerae O139 has been annotated from NCBI. Various computational tools and databases have been used to determine domain family, protein-protein interaction, solubility of protein, ligand binding sites etc. The three dimensional structure of two proteins were modeled and their ligand binding sites were identified. We have found domains and families of only one protein. The analysis revealed that these proteins might have antibiotic resistance activity, DNA breaking-rejoining activity, integrase enzyme activity, restriction endonuclease, etc. Structural prediction of these proteins and detection of binding sites from this study would indicate a potential target aiding docking studies for therapeutic designing against cholera. PMID:26175663

  20. Structural and Functional Annotation of the Porcine Immunome

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The domestic pig is known as an excellent model for human immunology and the two species share many pathogens. Susceptibility to infectious disease is one of the major constraints on swine performance, yet the structure and function of genes comprising the pig immunome are not well-characterized. H...

  1. Likelihood-based gene annotations for gap filling and quality assessment in genome-scale metabolic models

    SciTech Connect

    Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.; Chia, Nicholas; Price, Nathan D.; Maranas, Costas D.

    2014-10-16

    Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genes and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to obtain a more accurate network. All described workflows are implemented as part of the DOE Systems Biology Knowledgebase (KBase) and are publicly available via API or command-line web interface.

  2. Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation

    PubMed Central

    2014-01-01

    Background Microbiome-wide gene expression profiling through high-throughput RNA sequencing (‘metatranscriptomics’) offers a powerful means to functionally interrogate complex microbial communities. Key to successful exploitation of these datasets is the ability to confidently match relatively short sequence reads to known bacterial transcripts. In the absence of reference genomes, such annotation efforts may be enhanced by assembling reads into longer contiguous sequences (‘contigs’), prior to database search strategies. Since reads from homologous transcripts may derive from several species, represented at different abundance levels, it is not clear how well current assembly pipelines perform for metatranscriptomic datasets. Here we evaluate the performance of four currently employed assemblers including de novo transcriptome assemblers - Trinity and Oases; the metagenomic assembler - Metavelvet; and the recently developed metatranscriptomic assembler IDBA-MT. Results We evaluated the performance of the assemblers on a previously published dataset of single-end RNA sequence reads derived from the large intestine of an inbred non-obese diabetic mouse model of type 1 diabetes. We found that Trinity performed best as judged by contigs assembled, reads assigned to contigs, and number of reads that could be annotated to a known bacterial transcript. Only 15.5% of RNA sequence reads could be annotated to a known transcript in contrast to 50.3% with Trinity assembly. Paired-end reads generated from the same mouse samples resulted in modest performance gains. A database search estimated that the assemblies are unlikely to erroneously merge multiple unrelated genes sharing a region of similarity (<2% of contigs). A simulated dataset based on ten species confirmed these findings. A more complex simulated dataset based on 72 species found that greater assembly errors were introduced than is expected by sequencing quality. Through the detailed evaluation of assembly performance, the insights provided by this study will help drive the design of future metatranscriptomic analyses. Conclusion Assembly of metatranscriptome datasets greatly improved read annotation. Of the four assemblers evaluated, Trinity provided the best performance. For more complex datasets, reads generated from transcripts sharing considerable sequence similarity can be a source of significant assembly error, suggesting a need to collate reads on the basis of common taxonomic origin prior to assembly. PMID:25411636

  3. Annotation and comparative analysis of the glycoside hydrolase genes in Brachypodium distachyon

    SciTech Connect

    Tyler, Ludmila; Bragg, Jennifer; Wu, Jiajie; Yang, Xiaohan; Tuskan, Gerald A; Vogel, John

    2010-01-01

    Background Glycoside hydrolases cleave the bond between a carbohydrate and another carbohydrate, a protein, lipid or other moiety. Genes encoding glycoside hydrolases are found in a wide range of organisms, from archea to animals, and are relatively abundant in plant genomes. In plants, these enzymes are involved in diverse processes, including starch metabolism, defense, and cell-wall remodeling. Glycoside hydrolase genes have been previously cataloged for Oryza sativa (rice), the model dicotyledonous plant Arabidopsis thaliana, and the fast-growing tree Populus trichocarpa (poplar). To improve our understanding of glycoside hydrolases in plants generally and in grasses specifically, we annotated the glycoside hydrolase genes in the grasses Brachypodium distachyon (an emerging monocotyledonous model) and Sorghum bicolor (sorghum). We then compared the glycoside hydrolases across species, both at the whole-genome level and at the level of individual glycoside hydrolase families. Results We identified 356 glycoside hydrolase genes in Brachypodium and 404 in sorghum. The corresponding proteins fell into the same 34 families that are represented in rice, Arabidopsis, and poplar, helping to define a glycoside hydrolase family profile which may be common to flowering plants. Examination of individual glycoside hydrolase familes (GH5, GH13, GH18, GH19, GH28, and GH51) revealed both similarities and distinctions between monocots and dicots, as well as between species. Shared evolutionary histories appear to be modified by lineage-specific expansions or deletions. Within families, the Brachypodium and sorghum proteins generally cluster with those from other monocots. Conclusions This work provides the foundation for further comparative and functional analyses of plant glycoside hydrolases. Defining the Brachypodium glycoside hydrolases sets the stage for Brachypodium to be a monocot model for investigations of these enzymes and their diverse roles in planta. Insights gained from Brachypodium will inform translational research studies, with applications for the improvement of cereal crops and bioenergy grasses.

  4. ANNOTATION OF TRIBOLIUM CUTICLE PROTEIN AND PERITROPHIN GENES

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The recently completed genome sequence of the hard-bodied beetle, Tribolium, could reveal new insights into genetic mechanisms for chitin and cuticle production in pest insects. The genome sequence is being "mined" for cuticle genes using a combination of automated and manual gene-finding procedure...

  5. Comparative Analysis of Chloroplast Genomes: Functional Annotation, Genome-Based Phylogeny, and Deduced Evolutionary Patterns

    PubMed Central

    Rivas, Javier De Las; Lozano, Juan Jose; Ortiz, Angel R.

    2002-01-01

    All protein sequences from 19 complete chloroplast genomes (cpDNA) have been studied using a new computational method able to analyze functional correlations among series of protein sequences contained in complete proteomes. First, all open reading frames (ORFs) from the cpDNAs, comprising a total of 2266 protein sequences, were compared against the 3168 proteins from Synechocystis PCC6803 complete genome to find functionally related orthologous proteins. Additionally, all cpDNA genomes were pairwise compared to find orthologous groups not present in cyanobacteria. Annotations in the cluster of othologous proteins database and CyanoBase were used as reference for the functional assignments. Following this protocol, new functional assignments were made for ORFs of unknown function and for ycfs (hypothetical chloroplast frames), which still lack a functional assignment. Using this information, a matrix of functional relationships was derived from profiles of the presence and/or absence of orthologous proteins; the matrix included 1837 proteins in 277 orthologous clusters. A factor analysis study of this matrix, followed by cluster analysis, allowed us to obtain accurate phylogenetic reconstructions and the detection of genes probably involved in speciation as phylogenetic correlates. Finally, by grouping common evolutionary patterns, we show that it is possible to determine functionally linked protein networks. This has allowed us to suggest putative associations for some unknown ORFs. PMID:11932241

  6. DBH2H: vertebrate head-to-head gene pairs annotated at genomic and post-genomic levels

    PubMed Central

    Yu, Hui; Yu, Fu-Dong; Zhang, Guo-Qing; Shen, Xiang; Chen, Yun-Qin; Li, Yuan-Yuan; Li, Yi-Xue

    2009-01-01

    DBH2H collects head-to-head (h2h) gene pairs identified from human, mouse, rat, chicken and fugu genomes, and distinguishes the ortholog mapping relationship among them. The gene pairs in DBH2H are annotated with sequential features including single nucleotide polymorphisms, CpG islands and transcription factor binding sites, as well as functional terms and genetic disorders. In addition, the expression correlation information based on 117 microarray datasets is included. By providing user-friendly access to these data, DBH2H represents a valuable resource for further analyses of this important gene arrangement in terms of transcriptional regulation mechanisms, evolutionary conservation, disease relevance, etc. Database URL: http://lifecenter.sgst.cn/h2h/ PMID:20157479

  7. TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant Genomes

    PubMed Central

    Leroy, Philippe; Guilhot, Nicolas; Sakai, Hiroaki; Bernard, Aurélien; Choulet, Frédéric; Theil, Sébastien; Reboux, Sébastien; Amano, Naoki; Flutre, Timothée; Pelegrin, Céline; Ohyanagi, Hajime; Seidel, Michael; Giacomoni, Franck; Reichstadt, Mathieu; Alaux, Michael; Gicquello, Emmanuelle; Legeai, Fabrice; Cerutti, Lorenzo; Numa, Hisataka; Tanaka, Tsuyoshi; Mayer, Klaus; Itoh, Takeshi; Quesneville, Hadi; Feuillet, Catherine

    2012-01-01

    In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5?days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8?h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future. PMID:22645565

  8. Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data

    PubMed Central

    Gilchrist, Michael J; Christensen, Mikkel B; Harland, Richard; Pollet, Nicolas; Smith, James C; Ueno, Naoto; Papalopulu, Nancy

    2008-01-01

    Background Non-sequence gene data (images, literature, etc.) can be found in many different public databases. Access to these data is mostly by text based methods using gene names; however, gene annotation is neither complete, nor fully systematic between organisms, and is also not generally stable over time. This provides some challenges for text based access, especially for cross-species searches. We propose a method for non-sequence data retrieval based on sequence similarity, which removes dependence on annotation and text searches. This work was motivated by the need to provide better access to large numbers of in situ images, and the observation that such image data were usually associated with a specific gene sequence. Sequence similarity searches are found in existing gene oriented databases, but mostly give indirect access to non-sequence data via navigational links. Results Three applications were built to explore the proposed method: accessing image data, literature and gene names. Searches are initiated with the sequence of the user's gene of interest, which is searched against a database of sequences associated with the target data. The matching (non-sequence) target data are returned directly to the user's browser, organised by sequence similarity. The method worked well for the intended application in image data management. Comparison with text based searches of the image data set showed the accuracy of the method. Applied to literature searches it facilitated retrieval of mostly high relevance references. Applied to gene name data it provided a useful analysis of name variation of related genes within and between species. Conclusion This method makes a powerful and useful addition to existing methods for searching gene data based on text retrieval or curated gene lists. In particular the method facilitates cross-species comparisons, and enables the handling of novel or otherwise un-annotated genes. Applications using the method are quick and easy to build, and the data require little maintenance. This approach largely circumvents the need for annotation, which can be a major obstacle to the development of genomic scale data resources. PMID:18928517

  9. Likelihood-based gene annotations for gap filling and quality assessment in genome-scale metabolic models

    DOE PAGESBeta

    Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.; Chia, Nicholas; Price, Nathan D.; Maranas, Costas D.

    2014-10-16

    Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genesmore »and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to obtain a more accurate network. All described workflows are implemented as part of the DOE Systems Biology Knowledgebase (KBase) and are publicly available via API or command-line web interface.« less

  10. Functional Gene Networks: R/Bioc package to generate and analyse gene networks derived from functional enrichment and clustering

    PubMed Central

    Aibar, Sara; Fontanillo, Celia; Droste, Conrad; De Las Rivas, Javier

    2015-01-01

    Summary: Functional Gene Networks (FGNet) is an R/Bioconductor package that generates gene networks derived from the results of functional enrichment analysis (FEA) and annotation clustering. The sets of genes enriched with specific biological terms (obtained from a FEA platform) are transformed into a network by establishing links between genes based on common functional annotations and common clusters. The network provides a new view of FEA results revealing gene modules with similar functions and genes that are related to multiple functions. In addition to building the functional network, FGNet analyses the similarity between the groups of genes and provides a distance heatmap and a bipartite network of functionally overlapping genes. The application includes an interface to directly perform FEA queries using different external tools: DAVID, GeneTerm Linker, TopGO or GAGE; and a graphical interface to facilitate the use. Availability and implementation: FGNet is available in Bioconductor, including a tutorial. URL: http://bioconductor.org/packages/release/bioc/html/FGNet.html Contact: jrivas@usal.es Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25600944

  11. New in protein structure and function annotation: hotspots, single nucleotide polymorphisms and the 'Deep Web'.

    PubMed

    Bromberg, Yana; Yachdav, Guy; Ofran, Yanay; Schneider, Reinhard; Rost, Burkhard

    2009-05-01

    The rapidly increasing quantity of protein sequence data continues to widen the gap between available sequences and annotations. Comparative modeling suggests some aspects of the 3D structures of approximately half of all known proteins; homology- and network-based inferences annotate some aspect of function for a similar fraction of the proteome. For most known protein sequences, however, there is detailed knowledge about neither their function nor their structure. Comprehensive efforts towards the expert curation of sequence annotations have failed to meet the demand of the rapidly increasing number of available sequences. Only the automated prediction of protein function in the absence of homology can close the gap between available sequences and annotations in the foreseeable future. This review focuses on two novel methods for automated annotation, and briefly presents an outlook on how modern web software may revolutionize the field of protein sequence annotation. First, predictions of protein binding sites and functional hotspots, and the evolution of these into the most successful type of prediction of protein function from sequence will be discussed. Second, a new tool, comprehensive in silico mutagenesis, which contributes important novel predictions of function and at the same time prepares for the onset of the next sequencing revolution, will be described. While these two new sub-fields of protein prediction represent the breakthroughs that have been achieved methodologically, it will then be argued that a different development might further change the way biomedical researchers benefit from annotations: modern web software can connect the worldwide web in any browser with the 'Deep Web' (ie, proprietary data resources). The availability of this direct connection, and the resulting access to a wealth of data, may impact drug discovery and development more than any existing method that contributes to protein annotation. PMID:19396742

  12. BioGPS: building your own mash-up of gene annotations and expression profiles

    PubMed Central

    Wu, Chunlei; Jin, Xuefeng; Tsueng, Ginger; Afrasiabi, Cyrus; Su, Andrew I.

    2016-01-01

    BioGPS (http://biogps.org) is a centralized gene-annotation portal that enables researchers to access distributed gene annotation resources. This article focuses on the updates to BioGPS since our last paper (2013 database issue). The unique features of BioGPS, compared to those of other gene portals, are its community extensibility and user customizability. Users contribute the gene-specific resources accessible from BioGPS (‘plugins’), which helps ensure that the resource collection is always up-to-date and that it will continue expanding over time (since the 2013 paper, 162 resources have been added, for a 34% increase in the number of resources available). BioGPS users can create their own collections of relevant plugins and save them as customized gene-report pages or ‘layouts’ (since the 2013 paper, 488 user-created layouts have been added, for a 22% increase in the number of layouts). In addition, we recently updated the most popular plugin, the ‘Gene expression/activity chart’, to include ?6000 datasets (from ?2000 datasets) and we enhanced user interactivity. We also added a new ‘gene list’ feature that allows users to save query results for future reference. PMID:26578587

  13. Identification and annotation of small RNA genes using ShortStack.

    PubMed

    Shahid, Saima; Axtell, Michael J

    2014-05-01

    Highly parallel sequencing of cDNA derived from endogenous small RNAs (small RNA-seq) is a key method that has accelerated understanding of regulatory small RNAs in eukaryotes. Eukaryotic regulatory small RNAs, which include microRNAs (miRNAs), short interfering RNAs (siRNAs), and Piwi-associated RNAs (piRNAs), typically derive from the processing of longer precursor RNAs. Alignment of small RNA-seq data to a reference genome allows the inference of the longer precursor and thus the annotation of small RNA producing genes. ShortStack is a program that was developed to comprehensively analyze reference-aligned small RNA-seq data, and output detailed and useful annotations of the causal small RNA-producing genes. Here, we provide a step-by-step tutorial of ShortStack usage with the goal of introducing new users to the software and pointing out some common pitfalls. PMID:24139974

  14. Collaborative Annotation and Visualization of Functional and Discourse Structures

    E-print Network

    Applications of Language Studies, Department of Chinese, Translation and Linguistics, City University of Hong to raw linguistic data for descriptive or analytical purposes. In the tagging of complex Chinese witnessed an increasing need for large-scale high-quality annotated corpora on complex Chinese linguistic

  15. Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones

    SciTech Connect

    Imanishi, Tadashi; Itoh, Takeshi; Suzuki, Yutaka; O'Donovan, Claire; Fukuchi, Satoshi; Koyanagi, Kanako O.; Barrero, Roberto A.; Tamura, Takuro; Yamaguchi-Kabata, Yumi; Tanino, Motohiko; Yura, Kei; Miyazaki, Satoru; Ikeo, Kazuho; Homma, Keiichi; Kasprzyk, Arek; Nishikawa, Tetsuo; Hirakawa, Mika; Thierry-Mieg, Jean; Thierry-Mieg, Danielle; Ashurst, Jennifer; Jia, Libin; Nakao, Mitsuteru; Thomas, Michael A.; Mulder, Nicola; Karavidopoulou, Youla; Jin, Lihua; Kim, Sangsoo; Yasuda, Tomohiro; Lenhard, Boris; Eveno, Eric; Suzuki, Yoshiyuki; Yamasaki, Chisato; Takeda, Jun-ichi; Gough, Craig; Hilton, Phillip; Fujii, Yasuyuki; Sakai, Hiroaki; Tanaka, Susumu; Amid, Clara; Bellgard, Matthew; de Fatima Bonaldo, Maria; Bono Hidemasa; Bromberg, Susan K.; Brookes, Anthony J.; Bruford, Elspeth; Carninci Piero; Chelala, Claude; Couillault, Christine; de Souza, Sandro J.; Debily, Marie-Anne; Devignes, Marie-Dominique; Dubchak, Inna; Endo, Toshinori; Estreicher, Anne; Eyras, Eduardo; Fukami-Kobayashi, Kaoru; Gopinath, Gopal R.; Graudens, Esther; Hahn, Yoonsoo; Han, Michael; Han, Ze-Guang; Hanada, Kousuke; Hanaoka, Hideki; Harada, Erimi; Hashimoto, Katsuyuki; Hinz, Ursula; Hirai, Momoki; Hishiki, Teruyoshi; Hopkinson, Ian; Imbeaud, Sandrine; Inoko, Hidetoshi; Kanapin, Alexander; Kaneko, Yayoi; Kasukawa, Takeya; Kelso, Janet; Kersey, Paul; Kikuno Reiko; Kimura, Kouichi; Korn, Bernhard; Kuryshev, Vladimir; Makalowska, Izabela; Makino Takashi; Mano, Shuhei; Mariage-Samson, Regine; Mashima, Jun; Matsuda, Hideo; Mewes, Hans-Werner; Minoshima, Shinsei; Nagai, Keiichi; Nagasaki, Hideki; Nagata, Naoki; Nigam, Rajni; Ogasawara, Osamu; Ohara, Osamu; Ohtsubo, Masafumi; Okada, Norihiro; Okido, Toshihisa; Oota, Satoshi; Ota, Motonori; Ota, Toshio; Otsuki, Tetsuji; Piatier-Tonneau, Dominique; Poustka, Annemarie; Ren, Shuang-Xi; Saitou, Naruya; Sakai, Katsunaga; Sakamoto, Shigetaka; Sakate, Ryuichi; Schupp, Ingo; Servant, Florence; Sherry, Stephen; Shiba Rie; et al.

    2004-01-15

    The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4 percent of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5 percent of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for nonprotein-coding RNA genes . In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.

  16. Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones

    PubMed Central

    2004-01-01

    The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology. PMID:15103394

  17. PanCoreGen - Profiling, detecting, annotating protein-coding genes in microbial genomes.

    PubMed

    Paul, Sandip; Bhardwaj, Archana; Bag, Sumit K; Sokurenko, Evgeni V; Chattopadhyay, Sujay

    2015-12-01

    A large amount of genomic data, especially from multiple isolates of a single species, has opened new vistas for microbial genomics analysis. Analyzing the pan-genome (i.e. the sum of genetic repertoire) of microbial species is crucial in understanding the dynamics of molecular evolution, where virulence evolution is of major interest. Here we present PanCoreGen - a standalone application for pan- and core-genomic profiling of microbial protein-coding genes. PanCoreGen overcomes key limitations of the existing pan-genomic analysis tools, and develops an integrated annotation-structure for a species-specific pan-genomic profile. It provides important new features for annotating draft genomes/contigs and detecting unidentified genes in annotated genomes. It also generates user-defined group-specific datasets within the pan-genome. Interestingly, analyzing an example-set of Salmonella genomes, we detect potential footprints of adaptive convergence of horizontally transferred genes in two human-restricted pathogenic serovars - Typhi and Paratyphi A. Overall, PanCoreGen represents a state-of-the-art tool for microbial phylogenomics and pathogenomics study. PMID:26456591

  18. Improved systematic tRNA gene annotation allows new insights into the evolution of mitochondrial tRNA structures and into the mechanisms of mitochondrial genome rearrangements

    PubMed Central

    Jühling, Frank; Pütz, Joern; Bernt, Matthias; Donath, Alexander; Middendorf, Martin; Florentz, Catherine; Stadler, Peter F.

    2012-01-01

    Transfer RNAs (tRNAs) are present in all types of cells as well as in organelles. tRNAs of animal mitochondria show a low level of primary sequence conservation and exhibit ‘bizarre’ secondary structures, lacking complete domains of the common cloverleaf. Such sequences are hard to detect and hence frequently missed in computational analyses and mitochondrial genome annotation. Here, we introduce an automatic annotation procedure for mitochondrial tRNA genes in Metazoa based on sequence and structural information in manually curated covariance models. The method, applied to re-annotate 1876 available metazoan mitochondrial RefSeq genomes, allows to distinguish between remaining functional genes and degrading ‘pseudogenes’, even at early stages of divergence. The subsequent analysis of a comprehensive set of mitochondrial tRNA genes gives new insights into the evolution of structures of mitochondrial tRNA sequences as well as into the mechanisms of genome rearrangements. We find frequent losses of tRNA genes concentrated in basal Metazoa, frequent independent losses of individual parts of tRNA genes, particularly in Arthropoda, and wide-spread conserved overlaps of tRNAs in opposite reading direction. Direct evidence for several recent Tandem Duplication-Random Loss events is gained, demonstrating that this mechanism has an impact on the appearance of new mitochondrial gene orders. PMID:22139921

  19. Functional Annotation of Proteomic Data from Chicken Heterophils and Macrophages Induced by Carbon Nanotube Exposure

    PubMed Central

    Li, Yun-Ze; Cheng, Chung-Shi; Chen, Chao-Jung; Li, Zi-Lin; Lin, Yao-Tung; Chen, Shuen-Ei; Huang, San-Yuan

    2014-01-01

    With the expanding applications of carbon nanotubes (CNT) in biomedicine and agriculture, questions about the toxicity and biocompatibility of CNT in humans and domestic animals are becoming matters of serious concern. This study used proteomic methods to profile gene expression in chicken macrophages and heterophils in response to CNT exposure. Two-dimensional gel electrophoresis identified 12 proteins in macrophages and 15 in heterophils, with differential expression patterns in response to CNT co-incubation (0, 1, 10, and 100 ?g/mL of CNT for 6 h) (p < 0.05). Gene ontology analysis showed that most of the differentially expressed proteins are associated with protein interactions, cellular metabolic processes, and cell mobility, suggesting activation of innate immune functions. Western blot analysis with heat shock protein 70, high mobility group protein, and peptidylprolyl isomerase A confirmed the alterations of the profiled proteins. The functional annotations were further confirmed by effective cell migration, promoted interleukin-1? secretion, and more cell death in both macrophages and heterophils exposed to CNT (p < 0.05). In conclusion, results of this study suggest that CNT exposure affects protein expression, leading to activation of macrophages and heterophils, resulting in altered cytoskeleton remodeling, cell migration, and cytokine production, and thereby mediates tissue immune responses. PMID:24823882

  20. Generation, functional annotation and comparative analysis of black spruce (Picea mariana) ESTs: an important conifer genomic resource

    PubMed Central

    2013-01-01

    Background EST (expressed sequence tag) sequences and their annotation provide a highly valuable resource for gene discovery, genome sequence annotation, and other genomics studies that can be applied in genetics, breeding and conservation programs for non-model organisms. Conifers are long-lived plants that are ecologically and economically important globally, and have a large genome size. Black spruce (Picea mariana), is a transcontinental species of the North American boreal and temperate forests. However, there are limited transcriptomic and genomic resources for this species. The primary objective of our study was to develop a black spruce transcriptomic resource to facilitate on-going functional genomics projects related to growth and adaptation to climate change. Results We conducted bidirectional sequencing of cDNA clones from a standard cDNA library constructed from black spruce needle tissues. We obtained 4,594 high quality (2,455 5' end and 2,139 3' end) sequence reads, with an average read-length of 532 bp. Clustering and assembly of ESTs resulted in 2,731 unique sequences, consisting of 2,234 singletons and 497 contigs. Approximately two-thirds (63%) of unique sequences were functionally annotated. Genes involved in 36 molecular functions and 90 biological processes were discovered, including 24 putative transcription factors and 232 genes involved in photosynthesis. Most abundantly expressed transcripts were associated with photosynthesis, growth factors, stress and disease response, and transcription factors. A total of 216 full-length genes were identified. About 18% (493) of the transcripts were novel, representing an important addition to the Genbank EST database (dbEST). Fifty-seven di-, tri-, tetra- and penta-nucleotide simple sequence repeats were identified. Conclusions We have developed the first high quality EST resource for black spruce and identified 493 novel transcripts, which may be species-specific related to life history and ecological traits. We have also identified full-length genes and microsatellite-containing ESTs. Based on EST sequence similarities, black spruce showed close evolutionary relationships with congeneric Picea glauca and Picea sitchensis compared to other Pinaceae members and angiosperms. The EST sequences reported here provide an important resource for genome annotation, functional and comparative genomics, molecular breeding, conservation and management studies and applications in black spruce and related conifer species. PMID:24119028

  1. The power of EST sequence data: Relation to Acyrthosiphon pisum genome annotation and functional genomics initiatives

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Genes important to aphid biology, survival and reproduction were successfully identified by use of a genomics approach. We created and described the Sequencing, compilation, and annotation of the approxiamtely 525Mb nuclear genome of the pea aphid, Acyrthosiphon pisum, which represents an important ...

  2. A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data

    PubMed Central

    Lu, Qiongshi; Hu, Yiming; Sun, Jiehuan; Cheng, Yuwei; Cheung, Kei-Hoi; Zhao, Hongyu

    2015-01-01

    Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu PMID:26015273

  3. De Novo Assembly, Gene Annotation, and Marker Discovery in Stored-Product Pest Liposcelis entomophila (Enderlein) Using Transcriptome Sequences

    PubMed Central

    Wei, Dan-Dan; Chen, Er-Hu; Ding, Tian-Bo; Chen, Shi-Chun; Dou, Wei; Wang, Jin-Jun

    2013-01-01

    Background As a major stored-product pest insect, Liposcelis entomophila has developed high levels of resistance to various insecticides in grain storage systems. However, the molecular mechanisms underlying resistance and environmental stress have not been characterized. To date, there is a lack of genomic information for this species. Therefore, studies aimed at profiling the L. entomophila transcriptome would provide a better understanding of the biological functions at the molecular levels. Methodology/Principal Findings We applied Illumina sequencing technology to sequence the transcriptome of L. entomophila. A total of 54,406,328 clean reads were obtained and that de novo assembled into 54,220 unigenes, with an average length of 571 bp. Through a similarity search, 33,404 (61.61%) unigenes were matched to known proteins in the NCBI non-redundant (Nr) protein database. These unigenes were further functionally annotated with gene ontology (GO), cluster of orthologous groups of proteins (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. A large number of genes potentially involved in insecticide resistance were manually curated, including 68 putative cytochrome P450 genes, 37 putative glutathione S-transferase (GST) genes, 19 putative carboxyl/cholinesterase (CCE) genes, and other 126 transcripts to contain target site sequences or encoding detoxification genes representing eight types of resistance enzymes. Furthermore, to gain insight into the molecular basis of the L. entomophila toward thermal stresses, 25 heat shock protein (Hsp) genes were identified. In addition, 1,100 SSRs and 57,757 SNPs were detected and 231 pairs of SSR primes were designed for investigating the genetic diversity in future. Conclusions/Significance We developed a comprehensive transcriptomic database for L. entomophila. These sequences and putative molecular markers would further promote our understanding of the molecular mechanisms underlying insecticide resistance or environmental stress, and will facilitate studies on population genetics for psocids, as well as providing useful information for functional genomic research in the future. PMID:24244605

  4. Structuring osteosarcoma knowledge: an osteosarcoma-gene association database based on literature mining and manual annotation

    PubMed Central

    Poos, Kathrin; Smida, Jan; Nathrath, Michaela; Maugg, Doris; Baumhoer, Daniel; Neumann, Anna; Korsching, Eberhard

    2014-01-01

    Osteosarcoma (OS) is the most common primary bone cancer exhibiting high genomic instability. This genomic instability affects multiple genes and microRNAs to a varying extent depending on patient and tumor subtype. Massive research is ongoing to identify genes including their gene products and microRNAs that correlate with disease progression and might be used as biomarkers for OS. However, the genomic complexity hampers the identification of reliable biomarkers. Up to now, clinico-pathological factors are the key determinants to guide prognosis and therapeutic treatments. Each day, new studies about OS are published and complicate the acquisition of information to support biomarker discovery and therapeutic improvements. Thus, it is necessary to provide a structured and annotated view on the current OS knowledge that is quick and easily accessible to researchers of the field. Therefore, we developed a publicly available database and Web interface that serves as resource for OS-associated genes and microRNAs. Genes and microRNAs were collected using an automated dictionary-based gene recognition procedure followed by manual review and annotation by experts of the field. In total, 911 genes and 81 microRNAs related to 1331 PubMed abstracts were collected (last update: 29 October 2013). Users can evaluate genes and microRNAs according to their potential prognostic and therapeutic impact, the experimental procedures, the sample types, the biological contexts and microRNA target gene interactions. Additionally, a pathway enrichment analysis of the collected genes highlights different aspects of OS progression. OS requires pathways commonly deregulated in cancer but also features OS-specific alterations like deregulated osteoclast differentiation. To our knowledge, this is the first effort of an OS database containing manual reviewed and annotated up-to-date OS knowledge. It might be a useful resource especially for the bone tumor research community, as specific information about genes or microRNAs is quick and easily accessible. Hence, this platform can support the ongoing OS research and biomarker discovery. Database URL: http://osteosarcoma-db.uni-muenster.de PMID:24865352

  5. Annokey: an annotation tool based on key term search of the NCBI Entrez Gene database

    PubMed Central

    2014-01-01

    Background The NCBI Entrez Gene and PubMed databases contain a wealth of high-quality information about genes for many different organisms. The NCBI Entrez online web-search interface is convenient for simple manual search for a small number of genes but impractical for the kinds of outputs seen in typical genomics projects. Results We have developed an efficient open source tool implemented in Python called Annokey, which annotates gene lists with the results of a keyword search of the NCBI Entrez Gene database and linked Pubmed article information. The user steers the search by specifying a ranked list of keywords (including multi-word phrases and regular expressions) that are correlated with their topic of interest. Rank information of matched terms allows the user to guide further investigation. We applied Annokey to the entire human Entrez Gene database using the key-term “DNA repair” and assessed its performance in identifying the 176 members of a published “gold standard” list of genes established to be involved in this pathway. For this test case we observed a sensitivity and specificity of 97% and 96%, respectively. Conclusions Annokey facilitates the identification of genes related to an area of interest, a task which can be onerous if performed manually on a large number of genes. Annokey provides a way to capitalize on the high quality information provided by the Entrez Gene database allowing both scalability and compatibility with automated analysis pipelines, thus offering the potential to significantly enhance research productivity.

  6. VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment

    PubMed Central

    Habegger, Lukas; Balasubramanian, Suganthi; Chen, David Z.; Khurana, Ekta; Sboner, Andrea; Harmanci, Arif; Rozowsky, Joel; Clarke, Declan; Snyder, Michael; Gerstein, Mark

    2012-01-01

    Summary: The functional annotation of variants obtained through sequencing projects is generally assumed to be a simple intersection of genomic coordinates with genomic features. However, complexities arise for several reasons, including the differential effects of a variant on alternatively spliced transcripts, as well as the difficulty in assessing the impact of small insertions/deletions and large structural variants. Taking these factors into consideration, we developed the Variant Annotation Tool (VAT) to functionally annotate variants from multiple personal genomes at the transcript level as well as obtain summary statistics across genes and individuals. VAT also allows visualization of the effects of different variants, integrates allele frequencies and genotype data from the underlying individuals and facilitates comparative analysis between different groups of individuals. VAT can either be run through a command-line interface or as a web application. Finally, in order to enable on-demand access and to minimize unnecessary transfers of large data files, VAT can be run as a virtual machine in a cloud-computing environment. Availability and Implementation: VAT is implemented in C and PHP. The VAT web service, Amazon Machine Image, source code and detailed documentation are available at vat.gersteinlab.org. Contact: lukas.habegger@yale.edu or mark.gerstein@yale.edu Supplementary Information: Supplementary data are available at Bioinformatics online. PMID:22743228

  7. Annotation and retrieval system of CAD models based on functional semantics

    NASA Astrophysics Data System (ADS)

    Wang, Zhansong; Tian, Ling; Duan, Wenrui

    2014-11-01

    CAD model retrieval based on functional semantics is more significant than content-based 3D model retrieval during the mechanical conceptual design phase. However, relevant research is still not fully discussed. Therefore, a functional semantic-based CAD model annotation and retrieval method is proposed to support mechanical conceptual design and design reuse, inspire designer creativity through existing CAD models, shorten design cycle, and reduce costs. Firstly, the CAD model functional semantic ontology is constructed to formally represent the functional semantics of CAD models and describe the mechanical conceptual design space comprehensively and consistently. Secondly, an approach to represent CAD models as attributed adjacency graphs(AAG) is proposed. In this method, the geometry and topology data are extracted from STEP models. On the basis of AAG, the functional semantics of CAD models are annotated semi-automatically by matching CAD models that contain the partial features of which functional semantics have been annotated manually, thereby constructing CAD Model Repository that supports model retrieval based on functional semantics. Thirdly, a CAD model retrieval algorithm that supports multi-function extended retrieval is proposed to explore more potential creative design knowledge in the semantic level. Finally, a prototype system, called Functional Semantic-based CAD Model Annotation and Retrieval System(FSMARS), is implemented. A case demonstrates that FSMARS can successfully botain multiple potential CAD models that conform to the desired function. The proposed research addresses actual needs and presents a new way to acquire CAD models in the mechanical conceptual design phase.

  8. BioBuilder as a database development and functional annotation platform for proteins

    PubMed Central

    Navarro, J Daniel; Talreja, Naveen; Peri, Suraj; Vrushabendra, BM; Rashmi, BP; Padma, N; Surendranath, Vineeth; Jonnalagadda, Chandra Kiran; Kousthub, PS; Deshpande, Nandan; Shanker, K; Pandey, Akhilesh

    2004-01-01

    Background The explosion in biological information creates the need for databases that are easy to develop, easy to maintain and can be easily manipulated by annotators who are most likely to be biologists. However, deployment of scalable and extensible databases is not an easy task and generally requires substantial expertise in database development. Results BioBuilder is a Zope-based software tool that was developed to facilitate intuitive creation of protein databases. Protein data can be entered and annotated through web forms along with the flexibility to add customized annotation features to protein entries. A built-in review system permits a global team of scientists to coordinate their annotation efforts. We have already used BioBuilder to develop Human Protein Reference Database , a comprehensive annotated repository of the human proteome. The data can be exported in the extensible markup language (XML) format, which is rapidly becoming as the standard format for data exchange. Conclusions As the proteomic data for several organisms begins to accumulate, BioBuilder will prove to be an invaluable platform for functional annotation and development of customizable protein centric databases. BioBuilder is open source and is available under the terms of LGPL. PMID:15099404

  9. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic mode...

  10. Inferring mouse gene functions from genomic-scale data using a combined functional network/classification strategy

    PubMed Central

    Kim, Wan Kyu; Krumpelman, Chase; Marcotte, Edward M

    2008-01-01

    The complete set of mouse genes, as with the set of human genes, is still largely uncharacterized, with many pieces of experimental evidence accumulating regarding the activities and expression of the genes, but the majority of genes as yet still of unknown function. Within the context of the MouseFunc competition, we developed and applied two distinct large-scale data mining approaches to infer the functions (Gene Ontology annotations) of mouse genes from experimental observations from available functional genomics, proteomics, comparative genomics, and phenotypic data. The two strategies — the first using classifiers to map features to annotations, the second propagating annotations from characterized genes to uncharacterized genes along edges in a network constructed from the features — offer alternative and possibly complementary approaches to providing functional annotations. Here, we re-implement and evaluate these approaches and their combination for their ability to predict the proper functional annotations of genes in the MouseFunc data set. We show that, when controlling for the same set of input features, the network approach generally outperformed a naïve Bayesian classifier approach, while their combination offers some improvement over either independently. We make our observations of predictive performance on the MouseFunc competition hold-out set, as well as on a ten-fold cross-validation of the MouseFunc data. Across all 1,339 annotated genes in the MouseFunc test set, the median predictive power was quite strong (median area under a receiver operating characteristic plot of 0.865 and average precision of 0.195), indicating that a mining-based strategy with existing data is a promising path towards discovering mammalian gene functions. As one product of this work, a high-confidence subset of the functional mouse gene network was produced — spanning >70% of mouse genes with >1.6 million associations — that is predictive of mouse (and therefore often human) gene function and functional associations. The network should be generally useful for mammalian gene functional analyses, such as for predicting interactions, inferring functional connections between genes and pathways, and prioritizing candidate genes. The network and all predictions are available on the worldwide web. PMID:18613949

  11. Proteomics and transcriptomics of the BABA-induced resistance response in potato using a novel functional annotation approach

    PubMed Central

    2014-01-01

    Background Induced resistance (IR) can be part of a sustainable plant protection strategy against important plant diseases. ?-aminobutyric acid (BABA) can induce resistance in a wide range of plants against several types of pathogens, including potato infected with Phytophthora infestans. However, the molecular mechanisms behind this are unclear and seem to be dependent on the system studied. To elucidate the defence responses activated by BABA in potato, a genome-wide transcript microarray analysis in combination with label-free quantitative proteomics analysis of the apoplast secretome were performed two days after treatment of the leaf canopy with BABA at two concentrations, 1 and 10 mM. Results Over 5000 transcripts were differentially expressed and over 90 secretome proteins changed in abundance indicating a massive activation of defence mechanisms with 10 mM BABA, the concentration effective against late blight disease. To aid analysis, we present a more comprehensive functional annotation of the microarray probes and gene models by retrieving information from orthologous gene families across 26 sequenced plant genomes. The new annotation provided GO terms to 8616 previously un-annotated probes. Conclusions BABA at 10 mM affected several processes related to plant hormones and amino acid metabolism. A major accumulation of PR proteins was also evident, and in the mevalonate pathway, genes involved in sterol biosynthesis were down-regulated, whereas several enzymes involved in the sesquiterpene phytoalexin biosynthesis were up-regulated. Interestingly, abscisic acid (ABA) responsive genes were not as clearly regulated by BABA in potato as previously reported in Arabidopsis. Together these findings provide candidates and markers for improved resistance in potato, one of the most important crops in the world. PMID:24773703

  12. Evaluation of high-throughput functional categorization of human disease genes

    PubMed Central

    Chen, James L; Liu, Yang; Sam, Lee T; Li, Jianrong; Lussier, Yves A

    2007-01-01

    Background Biological data that are well-organized by an ontology, such as Gene Ontology, enables high-throughput availability of the semantic web. It can also be used to facilitate high throughput classification of biomedical information. However, to our knowledge, no evaluation has been published on automating classifications of human diseases genes using Gene Ontology. In this study, we evaluate automated classifications of well-defined human disease genes using their Gene Ontology annotations and compared them to a gold standard. This gold standard was independently conceived by Valle's research group, and contains 923 human disease genes organized in 14 categories of protein function. Results Two automated methods were applied to investigate the classification of human disease genes into independently pre-defined categories of protein function. One method used the structure of Gene Ontology by pre-selecting 74 Gene Ontology terms assigned to 11 protein function categories. The second method was based on the similarity of human disease genes clustered according to the information-theoretic distance of their Gene Ontology annotations. Compared to the categorization of human disease genes found in the gold standard, our automated methods can achieve an overall 56% and 47% precision with 62% and 71% recall respectively. However, approximately 15% of the studied human disease genes remain without GO annotations. Conclusion Automated methods can recapitulate a significant portion of classification of the human disease genes. The method using information-theoretic distance performs slightly better on the precision with some loss in recall. For some protein function categories, such as 'hormone' and 'transcription factor', the automated methods perform particularly well, achieving precision and recall levels above 75%. In summary, this study demonstrates that for semantic webs, methods to automatically classify or analyze a majority of human disease genes require significant progress in both the Gene Ontology annotations and particularly in the utilization of these annotations. PMID:17493290

  13. De Novo Assembly, Characterization and Functional Annotation of Pineapple Fruit Transcriptome through Massively Parallel Sequencing

    PubMed Central

    Ong, Wen Dee; Voo, Lok-Yung Christopher; Kumar, Vijay Subbiah

    2012-01-01

    Background Pineapple (Ananas comosus var. comosus), is an important tropical non-climacteric fruit with high commercial potential. Understanding the mechanism and processes underlying fruit ripening would enable scientists to enhance the improvement of quality traits such as, flavor, texture, appearance and fruit sweetness. Although, the pineapple is an important fruit, there is insufficient transcriptomic or genomic information that is available in public databases. Application of high throughput transcriptome sequencing to profile the pineapple fruit transcripts is therefore needed. Methodology/Principal Findings To facilitate this, we have performed transcriptome sequencing of ripe yellow pineapple fruit flesh using Illumina technology. About 4.7 millions Illumina paired-end reads were generated and assembled using the Velvet de novo assembler. The assembly produced 28,728 unique transcripts with a mean length of approximately 200 bp. Sequence similarity search against non-redundant NCBI database identified a total of 16,932 unique transcripts (58.93%) with significant hits. Out of these, 15,507 unique transcripts were assigned to gene ontology terms. Functional annotation against Kyoto Encyclopedia of Genes and Genomes pathway database identified 13,598 unique transcripts (47.33%) which were mapped to 126 pathways. The assembly revealed many transcripts that were previously unknown. Conclusions The unique transcripts derived from this work have rapidly increased of the number of the pineapple fruit mRNA transcripts as it is now available in public databases. This information can be further utilized in gene expression, genomics and other functional genomics studies in pineapple. PMID:23091603

  14. From Gene Networks to Gene Function

    PubMed Central

    Schlitt, Thomas; Palin, Kimmo; Rung, Johan; Dietmann, Sabine; Lappe, Michael; Ukkonen, Esko; Brazma, Alvis

    2003-01-01

    We propose a novel method to identify functionally related genes based on comparisons of neighborhoods in gene networks. This method does not rely on gene sequence or protein structure homologies, and it can be applied to any organism and a wide variety of experimental data sets. The character of the predicted gene relationships depends on the underlying networks;they concern biological processes rather than the molecular function. We used the method to analyze gene networks derived from genome-wide chromatin immunoprecipitation experiments, a large-scale gene deletion study, and from the genomic positions of consensus binding sites for transcription factors of the yeast Saccharomyces cerevisiae. We identified 816 functional relationships between 159 genes and show that these relationships correspond to protein–protein interactions, co-occurrence in the same protein complexes, and/or co-occurrence in abstracts of scientific articles. Our results suggest functions for seven previously uncharacterized yeast genes: KIN3 and YMR269W may be involved in biological processes related to cell growth and/or maintenance, whereas IES6, YEL008W, YEL033W, YHL029C, YMR010W, and YMR031W-A are likely to have metabolic functions. PMID:14656964

  15. Functional Analysis of the Molecular Interactions of TATA Box-Containing Genes and Essential Genes

    PubMed Central

    Moon, Jisook

    2015-01-01

    Genes can be divided into TATA-containing genes and TATA-less genes according to the presence of TATA box elements at promoter regions. TATA-containing genes tend to be stress-responsive, whereas many TATA-less genes are known to be related to cell growth or “housekeeping” functions. In a previous study, we demonstrated that there are striking differences among four gene sets defined by the presence of TATA box (TATA-containing) and essentiality (TATA-less) with respect to number of associated transcription factors, amino acid usage, and functional annotation. Extending this research in yeast, we identified KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways that are statistically enriched in TATA-containing or TATA-less genes and evaluated the possibility that the enriched pathways are related to stress or growth as reflected by the individual functions of the genes involved. According to their enrichment for either of these two gene sets, we sorted KEGG pathways into TATA-containing-gene-enriched pathways (TEPs) and essential-gene-enriched pathways (EEPs). As expected, genes in TEPs and EEPs exhibited opposite results in terms of functional category, transcriptional regulation, codon adaptation index, and network properties, suggesting the possibility that the bipolar patterns in these pathways also contribute to the regulation of the stress response and to cell survival. Our findings provide the novel insight that significant enrichment of TATA-binding or TATA-less genes defines pathways as stress-responsive or growth-related. PMID:25789484

  16. Comprehensive functional annotation of 18 missense mutations found in suspected hemochromatosis type 4 patients.

    PubMed

    Callebaut, Isabelle; Joubrel, Rozenn; Pissard, Serge; Kannengiesser, Caroline; Gérolami, Victoria; Ged, Cécile; Cadet, Estelle; Cartault, François; Ka, Chandran; Gourlaouen, Isabelle; Gourhant, Lénaick; Oudin, Claire; Goossens, Michel; Grandchamp, Bernard; De Verneuil, Hubert; Rochette, Jacques; Férec, Claude; Le Gac, Gérald

    2014-09-01

    Hemochromatosis type 4 is a rare form of primary iron overload transmitted as an autosomal dominant trait caused by mutations in the gene encoding the iron transport protein ferroportin 1 (SLC40A1). SLC40A1 mutations fall into two functional categories (loss- versus gain-of-function) underlying two distinct clinical entities (hemochromatosis type 4A versus type 4B). However, the vast majority of SLC40A1 mutations are rare missense variations, with only a few showing strong evidence of causality. The present study reports the results of an integrated approach collecting genetic and phenotypic data from 44 suspected hemochromatosis type 4 patients, with comprehensive structural and functional annotations. Causality was demonstrated for 10 missense variants, showing a clear dichotomy between the two hemochromatosis type 4 subtypes. Two subgroups of loss-of-function mutations were distinguished: one impairing cell-surface expression and one altering only iron egress. Additionally, a new gain-of-function mutation was identified, and the degradation of ferroportin on hepcidin binding was shown to probably depend on the integrity of a large extracellular loop outside of the hepcidin-binding domain. Eight further missense variations, on the other hand, were shown to have no discernible effects at either protein or RNA level; these were found in apparently isolated patients and were associated with a less severe phenotype. The present findings illustrate the importance of combining in silico and biochemical approaches to fully distinguish pathogenic SLC40A1 mutations from benign variants. This has profound implications for patient management. PMID:24714983

  17. Transcriptional dynamics of the developing sweet cherry (Prunus avium L.) fruit: sequencing, annotation and expression profiling of exocarp-associated genes

    PubMed Central

    Alkio, Merianne; Jonas, Uwe; Declercq, Myriam; Van Nocker, Steven; Knoche, Moritz

    2014-01-01

    The exocarp, or skin, of fleshy fruit is a specialized tissue that protects the fruit, attracts seed dispersing fruit eaters, and has large economical relevance for fruit quality. Development of the exocarp involves regulated activities of many genes. This research analyzed global gene expression in the exocarp of developing sweet cherry (Prunus avium L., ‘Regina’), a fruit crop species with little public genomic resources. A catalog of transcript models (contigs) representing expressed genes was constructed from de novo assembled short complementary DNA (cDNA) sequences generated from developing fruit between flowering and maturity at 14 time points. Expression levels in each sample were estimated for 34?695 contigs from numbers of reads mapping to each contig. Contigs were annotated functionally based on BLAST, gene ontology and InterProScan analyses. Coregulated genes were detected using partitional clustering of expression patterns. The results are discussed with emphasis on genes putatively involved in cuticle deposition, cell wall metabolism and sugar transport. The high temporal resolution of the expression patterns presented here reveals finely tuned developmental specialization of individual members of gene families. Moreover, the de novo assembled sweet cherry fruit transcriptome with 7760 full-length protein coding sequences and over 20?000 other, annotated cDNA sequences together with their developmental expression patterns is expected to accelerate molecular research on this important tree fruit crop. PMID:26504533

  18. A statistical framework for improving genomic annotations of transposon mutagenesis (TM) assigned essential genes.

    PubMed

    Deng, Jingyuan

    2015-01-01

    Whole-genome transposon mutagenesis (TM) experiment followed by sequence-based identification of insertion sites is the most popular genome-wise experiment to identify essential genes in Prokaryota. However, due to the limitation of high-throughput technique, this approach yields substantial systematic biases resulting in the incorrect assignments of many essential genes. To obtain unbiased and accurate annotations of essential genes from TM experiments, we developed a novel Poisson model based statistical framework to refine these TM assignments. In the model, first we identified and incorporated several potential factors such as gene length and TM insertion information which may cause the TM assignment biases into the basic Poisson model. Then we calculated the conditional probability of an essential gene given the observed TM insertion number. By factorizing this probability through introducing a latent variable the real insertion number, we formalized the statistical framework. Through iteratively updating and optimizing model parameters to maximize the goodness-of-fit of the model to the observed TM insertion data, we finalized the model. Using this model, we are able to assign the probability score of essentiality to each individual gene given its TM assignment, which subsequently correct the experimental biases. To enable our model widely useable, we established a user-friendly Web-server that is accessible to the public: http://research.cchmc.org/essentialgene/. PMID:25636618

  19. proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS Genome-wide enzyme annotation with

    E-print Network

    proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS Genome-wide enzyme annotation with precision control Medical Research and Materiel Command, Fort Detrick, Maryland, USA INTRODUCTION The continual advancements FPR allows for fine precision control of each profile and enables the generation of profile databases

  20. Cross-Population Joint Analysis of eQTLs: Fine Mapping and Functional Annotation

    PubMed Central

    Wen, Xiaoquan; Luca, Francesca; Pique-Regi, Roger

    2015-01-01

    Mapping expression quantitative trait loci (eQTLs) has been shown as a powerful tool to uncover the genetic underpinnings of many complex traits at molecular level. In this paper, we present an integrative analysis approach that leverages eQTL data collected from multiple population groups. In particular, our approach effectively identifies multiple independent cis-eQTL signals that are consistent across populations, accounting for population heterogeneity in allele frequencies and linkage disequilibrium patterns. Furthermore, by integrating genomic annotations, our analysis framework enables high-resolution functional analysis of eQTLs. We applied our statistical approach to analyze the GEUVADIS data consisting of samples from five population groups. From this analysis, we concluded that i) jointly analysis across population groups greatly improves the power of eQTL discovery and the resolution of fine mapping of causal eQTL ii) many genes harbor multiple independent eQTLs in their cis regions iii) genetic variants that disrupt transcription factor binding are significantly enriched in eQTLs (p-value = 4.93 × 10-22). PMID:25906321

  1. Analysis and functional annotation of expressed sequence tags from the Asian longhorned beetle, Anoplophora glabripennis.

    PubMed

    Hunter, Wayne B; Smith, Michael T; Hunnicutt, Laura E

    2009-01-01

    The Asian longhorned beetle, Anoplophora glabripennis (Motschulsky) (Coleoptera: Cerambycidae), is one of the most economically and ecologically devastating forest insects to invade North America in recent years. Despite its substantial impact, limited effort has been expended to define the genetic and molecular make-up of this species. Considering the significant role played by late-stadia larvae in host tree decimation, a small-scale EST sequencing project was done using a cDNA library constructed from 5(th) -instar A. glabripennis. The resultant dataset consisted of 599 high quality ESTs that, upon assembly, yielded 381 potentially unique transcripts. Each of these transcripts was catalogued as to putative molecular function, biological process, and associated cellular component according to the Gene Ontology classification system. Using this annotated dataset, a subset of assembled sequences was identified that are putatively associated with A. glabnpennis development and metamorphosis. This work will contribute to understanding of the diverse molecular mechanisms that underlie coleopteran morphogenesis and enable the future development of novel control strategies for management of this insect pest. PMID:19619025

  2. Analysis and Functional Annotation of Expressed Sequence Tags from the Asian Longhorned Beetle, Anoplophora glabripennis

    PubMed Central

    Hunter, Wayne B.; Smith, Michael T.; Hunnicutt, Laura E.

    2009-01-01

    The Asian longhorned beetle, Anoplophora glabripennis (Motschulsky) (Coleoptera: Cerambycidae), is one of the most economically and ecologically devastating forest insects to invade North America in recent years. Despite its substantial impact, limited effort has been expended to define the genetic and molecular make-up of this species. Considering the significant role played by late-stadia larvae in host tree decimation, a small-scale EST sequencing project was done using a cDNA library constructed from 5th -instar A. glabripennis. The resultant dataset consisted of 599 high quality ESTs that, upon assembly, yielded 381 potentially unique transcripts. Each of these transcripts was catalogued as to putative molecular function, biological process, and associated cellular component according to the Gene Ontology classification system. Using this annotated dataset, a subset of assembled sequences was identified that are putatively associated with A. glabnpennis development and metamorphosis. This work will contribute to understanding of the diverse molecular mechanisms that underlie coleopteran morphogenesis and enable the future development of novel control strategies for management of this insect pest. PMID:19619025

  3. Genome Annotation by Shotgun Inactivation of a Native Gene in Hemizygous Cells: Application to BRCA2 with Implication of Hypomorphic Variants

    PubMed Central

    Ghosh, Soma; Bhunia, Anil K.; Paun, Bogdan C.; Gilbert, Samuel F.; Dhru, Urmil; Patel, Kalpesh; Kern, Scott E.

    2015-01-01

    The greatest interpretive challenge of modern medicine may be to functionally annotate the vast variation of human genomes. Demonstrating a proposed approach, we created a library of BRCA2 exon 27 shotgun-mutant plasmids including solitary and multiplex mutations to generate human knockin clones using homologous recombination. This 55-mutation, 13-clone syngeneic variance library (SyVaL) comprised severely affected clones having early-stop nonsense mutations, functionally hypomorphic clones having multiple missense mutations emphasizing the potential to identify and assess hypomorphic mutations in novel proteomic and epidemiologic studies, and neutral clones having multiple missense mutations. Efficient coverage of nonessential amino acids was provided by mutation multiplexing. Severe mutations were distinguished from hypomorphic or neutral changes by chemosensitivity assays (hypersensitivity to mitomycin C and acetaldehyde), by analysis of RAD51 focus formation, and by mitotic multipolarity. A multiplex unbiased approach of generating all-human SyVaLs in medically important genes, with random mutations in native genes, would provide databases of variants that could be functionally annotated without concerns arising from exogenous cDNA constructs or interspecies interactions, as a basis for subsequent proteomic domain mapping or clinical calibration if desired. Such gene-irrelevant approaches could be scaled up for multiple genes of clinical interest, providing distributable cellular libraries linked to public-shared functional databases. PMID:25451944

  4. tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles

    PubMed Central

    Cejuela, Juan Miguel; McQuilton, Peter; Ponting, Laura; Marygold, Steven J.; Stefancsik, Raymund; Millburn, Gillian H.; Rost, Burkhard

    2014-01-01

    The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the ‘tagtog’ system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation. Database URL: www.tagtog.net, www.flybase.org PMID:24715220

  5. tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles.

    PubMed

    Cejuela, Juan Miguel; McQuilton, Peter; Ponting, Laura; Marygold, Steven J; Stefancsik, Raymund; Millburn, Gillian H; Rost, Burkhard

    2014-01-01

    The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the 'tagtog' system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation. DATABASE URL: www.tagtog.net, www.flybase.org. PMID:24715220

  6. Coordinated international action to accelerate genome-to-phenome with FAANG, The Functional Annotation of Animal Genomes project

    Technology Transfer Automated Retrieval System (TEKTRAN)

    We describe the organization of a nascent international effort - the "Functional Annotation of ANimal Genomes" project - whose aim is to produce comprehensive maps of functional elements in the genomes of domesticated animal species....

  7. Tiling Assembly: a new tool for reference annotation-independent transcript assembly and novel gene identification by RNA-sequencing.

    PubMed

    Watanabe, Kenneth A; Homayouni, Arielle; Tufano, Tara; Lopez, Jennifer; Ringler, Patricia; Rushton, Paul; Shen, Qingxi J

    2015-10-01

    Annotation of the rice (Oryza sativa) genome has evolved significantly since release of its draft sequence, but it is far from complete. Several published transcript assembly programmes were tested on RNA-sequencing (RNA-seq) data to determine their effectiveness in identifying novel genes to improve the rice genome annotation. Cufflinks, a popular assembly software, did not identify all transcripts suggested by the RNA-seq data. Other assembly software was CPU intensive, lacked documentation, or lacked software updates. To overcome these shortcomings, a heuristic ab initio transcript assembly algorithm, Tiling Assembly, was developed to identify genes based on short read and junction alignment. Tiling Assembly was compared with Cufflinks to evaluate its gene-finding capabilities. Additionally, a pipeline was developed to eliminate false-positive gene identification due to noise or repetitive regions in the genome. By combining Tiling Assembly and Cufflinks, 767 unannotated genes were identified in the rice genome, demonstrating that combining both programmes proved highly efficient for novel gene identification. We also demonstrated that Tiling Assembly can accurately determine transcription start sites by comparing the Tiling Assembly genes with their corresponding full-length cDNA. We applied our pipeline to additional organisms and identified numerous unannotated genes, demonstrating that Tiling Assembly is an organism-independent tool for genome annotation. PMID:26341416

  8. Tiling Assembly: a new tool for reference annotation-independent transcript assembly and novel gene identification by RNA-sequencing

    PubMed Central

    Watanabe, Kenneth A.; Homayouni, Arielle; Tufano, Tara; Lopez, Jennifer; Ringler, Patricia; Rushton, Paul; Shen, Qingxi J.

    2015-01-01

    Annotation of the rice (Oryza sativa) genome has evolved significantly since release of its draft sequence, but it is far from complete. Several published transcript assembly programmes were tested on RNA-sequencing (RNA-seq) data to determine their effectiveness in identifying novel genes to improve the rice genome annotation. Cufflinks, a popular assembly software, did not identify all transcripts suggested by the RNA-seq data. Other assembly software was CPU intensive, lacked documentation, or lacked software updates. To overcome these shortcomings, a heuristic ab initio transcript assembly algorithm, Tiling Assembly, was developed to identify genes based on short read and junction alignment. Tiling Assembly was compared with Cufflinks to evaluate its gene-finding capabilities. Additionally, a pipeline was developed to eliminate false-positive gene identification due to noise or repetitive regions in the genome. By combining Tiling Assembly and Cufflinks, 767 unannotated genes were identified in the rice genome, demonstrating that combining both programmes proved highly efficient for novel gene identification. We also demonstrated that Tiling Assembly can accurately determine transcription start sites by comparing the Tiling Assembly genes with their corresponding full-length cDNA. We applied our pipeline to additional organisms and identified numerous unannotated genes, demonstrating that Tiling Assembly is an organism-independent tool for genome annotation. PMID:26341416

  9. Connectionist Approaches for Predicting Mouse Gene Function from Gene Expression

    E-print Network

    Bonner, Anthony

    Therapy. Identifying gene function based on gene expression data is much easier in prokaryotes than ways, especially in Gene Therapy [5]. Identifying gene function in prokaryotes is much easier thanConnectionist Approaches for Predicting Mouse Gene Function from Gene Expression Emad Andrews

  10. SNPit: a federated data integration system for the purpose of functional SNP annotation.

    PubMed

    Shen, Terry H; Carlson, Christopher S; Tarczy-Hornoch, Peter

    2009-08-01

    Genome wide association studies can potentially identify the genetic causes behind the majority of human diseases. With the advent of more advanced genotyping techniques, there is now an explosion of data gathered on single nucleotide polymorphisms (SNPs). The need exists for an integrated system that can provide up-to-date functional annotation information on SNPs. We have developed the SNP Integration Tool (SNPit) system to address this need. Built upon a federated data integration system, SNPit provides current information on a comprehensive list of SNP data sources. Additional logical inference analysis was included through an inference engine plug in. The SNPit web servlet is available online for use. SNPit allows users to go to one source for up-to-date information on the functional annotation of SNPs. A tool that can help to integrate and analyze the potential functional significance of SNPs is important for understanding the results from genome wide association studies. PMID:19327864

  11. Gene prediction and annotation in Penstemon (Plantaginaceae): A workflow for marker development from extremely low-coverage genome sequencing1

    PubMed Central

    Blischak, Paul D.; Wenzel, Aaron J.; Wolfe, Andrea D.

    2014-01-01

    • Premise of the study: Penstemon (Plantaginaceae) is a large and diverse genus endemic to North America. However, determining the phylogenetic relationships among its 280 species has been difficult due to its recent evolutionary radiation. The development of a large, multilocus data set can help to resolve this challenge. • Methods: Using both previously sequenced genomic libraries and our own low-coverage whole-genome shotgun sequencing libraries, we used the MAKER2 Annotation Pipeline to identify gene regions for the development of sequencing loci from six extremely low-coverage Penstemon genomes (?0.005×–0.007×). We also compared this approach to BLAST searches, and conducted analyses to characterize sequence divergence across the species sequenced. • Results: Annotations and gene predictions were successfully added to more than 10,000 contigs for potential use in downstream primer design. Primers were then designed for chloroplast, mitochondrial, and nuclear loci from these annotated sequences. MAKER2 identified longer gene regions in all six Penstemon genomes when compared with BLASTN and BLASTX searches. The average level of sequence divergence among the six species was 7.14%. • Discussion: Combining bioinformatics tools into a workflow that produces annotations can be useful for creating potential phylogenetic markers from thousands of sequences even when genome coverage is extremely low and reference data are only available from distant relatives. Furthermore, the output from MAKER2 contains information about important gene features, such as exon boundaries, and can be easily integrated with visualization tools to facilitate the process of marker development. PMID:25506519

  12. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    DOE PAGESBeta

    Brettin, Thomas; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Olsen, Gary J.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; et al

    2015-02-10

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offersmore »a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.« less

  13. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    SciTech Connect

    Brettin, Thomas; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Olsen, Gary J.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Thomason, III, James A.; Stevens, Rick; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang

    2015-02-10

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.

  14. RASTtk: A modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes

    PubMed Central

    Brettin, Thomas; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Olsen, Gary J.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Shukla, Maulik; Thomason, James A.; Stevens, Rick; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang

    2015-01-01

    The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception. PMID:25666585

  15. Ribosome Profiling Reveals Pervasive Translation Outside of Annotated Protein-Coding Genes

    PubMed Central

    Ingolia, Nicholas T.; Brar, Gloria A.; Stern-Ginossar, Noam; Harris, Michael S.; Talhouarne, Gaëlle J. S.; Jackson, Sarah E.; Wills, Mark R.; Weissman, Jonathan S.

    2014-01-01

    SUMMARY Ribosome profiling suggests that ribosomes occupy many regions of the transcriptome thought to be non-coding, including 5? UTRs and lncRNAs. Apparent ribosome footprints outside of protein-coding regions raise the possibility of artifacts unrelated to translation, particularly when they occupy multiple, overlapping open reading frames (ORFs). Here we show hallmarks of translation in these footprints: co-purification with the large ribosomal subunit, response to drugs targeting elongation, trinucleotide periodicity, and initiation at early AUGs. We develop a metric for distinguishing between 80S footprints and nonribosomal sources using footprint size distributions, which validates the vast majority of footprints outside of coding regions. We present evidence for polypeptide production beyond annotated genes, including induction of immune responses following human cytomegalovirus (HCMV) infection. Translation is pervasive on cytosolic transcripts outside of conserved reading frames, and direct detection of this expanded universe of translated products enables efforts to understand how cells manage and exploit its consequences. PMID:25159147

  16. Comparison of lists of genes based on functional profiles

    PubMed Central

    2011-01-01

    Background How to compare studies on the basis of their biological significance is a problem of central importance in high-throughput genomics. Many methods for performing such comparisons are based on the information in databases of functional annotation, such as those that form the Gene Ontology (GO). Typically, they consist of analyzing gene annotation frequencies in some pre-specified GO classes, in a class-by-class way, followed by p-value adjustment for multiple testing. Enrichment analysis, where a list of genes is compared against a wider universe of genes, is the most common example. Results A new global testing procedure and a method incorporating it are presented. Instead of testing separately for each GO class, a single global test for all classes under consideration is performed. The test is based on the distance between the functional profiles, defined as the joint frequencies of annotation in a given set of GO classes. These classes may be chosen at one or more GO levels. The new global test is more powerful and accurate with respect to type I errors than the usual class-by-class approach. When applied to some real datasets, the results suggest that the method may also provide useful information that complements the tests performed using a class-by-class approach if gene counts are sparse in some classes. An R library, goProfiles, implements these methods and is available from Bioconductor, http://bioconductor.org/packages/release/bioc/html/goProfiles.html. Conclusions The method provides an inferential basis for deciding whether two lists are functionally different. For global comparisons it is preferable to the global chi-square test of homogeneity. Furthermore, it may provide additional information if used in conjunction with class-by-class methods. PMID:21999355

  17. Gene Ontology and the annotation of pathogen genomes: the case of Candida albicans.

    PubMed

    Arnaud, Martha B; Costanzo, Maria C; Shah, Prachi; Skrzypek, Marek S; Sherlock, Gavin

    2009-07-01

    The Gene Ontology (GO) is a structured controlled vocabulary developed to describe the roles and locations of gene products in a consistent manner and in a way that can be shared across organisms. The unicellular fungus Candida albicans is similar in many ways to the model organism Saccharomyces cerevisiae but, as both a commensal and a pathogen of humans, differs greatly in its lifestyle. With an expanding at-risk population of immunosuppressed patients, increased use of invasive medical procedures, the increasing prevalence of drug resistance and the emergence of additional Candida species as serious pathogens, it has never been more crucial to improve our understanding of Candida biology to guide the development of better treatments. In this brief review, we examine the importance of GO in the annotation of C. albicans gene products, with a focus on those involved in pathogenesis. We also discuss how sequence information combined with GO facilitates the transfer of knowledge across related species and the challenges and opportunities that such an approach presents. PMID:19577928

  18. PHYLOGENOMICS - GUIDED VALIDATION OF FUNCTION FOR CONSERVED UNKNOWN GENES

    SciTech Connect

    V, DE CRECY-LAGARD; D, HANSON A

    2012-01-03

    Identifying functions for all gene products in all sequenced organisms is a central challenge of the post-genomic era. However, at least 30-50% of the proteins encoded by any given genome are of unknown function, or wrongly or vaguely annotated. Many of these 'unknown' proteins are common to prokaryotes and plants. We accordingly set out to predict and experimentally test the functions of such proteins. Our approach to functional prediction is integrative, coupling the extensive post-genomic resources available for plants with comparative genomics based on hundreds of microbial genomes, and functional genomic datasets from model microorganisms. The early phase is computer-assisted; later phases incorporate intellectual input from expert plant and microbial biochemists. The approach thus bridges the gap between automated homology-based annotations and the classical gene discovery efforts of experimentalists, and is much more powerful than purely computational approaches to identifying gene-function associations. Among Arabidopsis genes, we focused on those (2,325 in total) that (i) are unique or belong to families with no more than three members, (ii) are conserved between plants and prokaryotes, and (iii) have unknown or poorly known functions. Computer-assisted selection of promising targets for deeper analysis was based on homology .. independent characteristics associated in the SEED database with the prokaryotic members of each family, specifically gene clustering and phyletic spread, as well as availability of functional genomics data, and publications that could link candidate families to general metabolic areas, or to specific functions. In-depth comparative genomic analysis was then performed for about 500 top candidate families, which connected ~55 of them to general areas of metabolism and led to specific functional predictions for a subset of ~25 more. Twenty predicted functions were experimentally tested in at least one prokaryotic organism via reverse genetics, metabolic profiling, functional complementation, and recombinant protein biochemistry. Our approach predicted and validated functions for 10 formerly uncharacterized protein families common to plants and prokaryotes; none of these functions had previously been correctly predicted by computational methods. The functions of five more are currently being validated. Experimental testing of diverse representatives of these families combined with in silica analysis allowed accurate projection of the annotations to hundreds more sequenced genomes.

  19. Automated Update, Revision, and Quality Control of the Maize Genome Annotations Using MAKER-P Improves the B73 RefGen_v3 Gene Models and Identifies New Genes1[OPEN

    PubMed Central

    Law, MeiYee; Childs, Kevin L.; Campbell, Michael S.; Stein, Joshua C.; Olson, Andrew J.; Holt, Carson; Panchy, Nicholas; Lei, Jikai; Jiao, Dian; Andorf, Carson M.; Lawrence, Carolyn J.; Ware, Doreen; Shiu, Shin-Han; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2015-01-01

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes. PMID:25384563

  20. Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes.

    PubMed

    Law, MeiYee; Childs, Kevin L; Campbell, Michael S; Stein, Joshua C; Olson, Andrew J; Holt, Carson; Panchy, Nicholas; Lei, Jikai; Jiao, Dian; Andorf, Carson M; Lawrence, Carolyn J; Ware, Doreen; Shiu, Shin-Han; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2015-01-01

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes. PMID:25384563

  1. Integration of multiethnic fine-mapping and genomic annotation to prioritize candidate functional SNPs at prostate cancer susceptibility regions.

    PubMed

    Han, Ying; Hazelett, Dennis J; Wiklund, Fredrik; Schumacher, Fredrick R; Stram, Daniel O; Berndt, Sonja I; Wang, Zhaoming; Rand, Kristin A; Hoover, Robert N; Machiela, Mitchell J; Yeager, Merideth; Burdette, Laurie; Chung, Charles C; Hutchinson, Amy; Yu, Kai; Xu, Jianfeng; Travis, Ruth C; Key, Timothy J; Siddiq, Afshan; Canzian, Federico; Takahashi, Atsushi; Kubo, Michiaki; Stanford, Janet L; Kolb, Suzanne; Gapstur, Susan M; Diver, W Ryan; Stevens, Victoria L; Strom, Sara S; Pettaway, Curtis A; Al Olama, Ali Amin; Kote-Jarai, Zsofia; Eeles, Rosalind A; Yeboah, Edward D; Tettey, Yao; Biritwum, Richard B; Adjei, Andrew A; Tay, Evelyn; Truelove, Ann; Niwa, Shelley; Chokkalingam, Anand P; Isaacs, William B; Chen, Constance; Lindstrom, Sara; Le Marchand, Loic; Giovannucci, Edward L; Pomerantz, Mark; Long, Henry; Li, Fugen; Ma, Jing; Stampfer, Meir; John, Esther M; Ingles, Sue A; Kittles, Rick A; Murphy, Adam B; Blot, William J; Signorello, Lisa B; Zheng, Wei; Albanes, Demetrius; Virtamo, Jarmo; Weinstein, Stephanie; Nemesure, Barbara; Carpten, John; Leske, M Cristina; Wu, Suh-Yuh; Hennis, Anselm J M; Rybicki, Benjamin A; Neslund-Dudas, Christine; Hsing, Ann W; Chu, Lisa; Goodman, Phyllis J; Klein, Eric A; Zheng, S Lilly; Witte, John S; Casey, Graham; Riboli, Elio; Li, Qiyuan; Freedman, Matthew L; Hunter, David J; Gronberg, Henrik; Cook, Michael B; Nakagawa, Hidewaki; Kraft, Peter; Chanock, Stephen J; Easton, Douglas F; Henderson, Brian E; Coetzee, Gerhard A; Conti, David V; Haiman, Christopher A

    2015-10-01

    Interpretation of biological mechanisms underlying genetic risk associations for prostate cancer is complicated by the relatively large number of risk variants (n = 100) and the thousands of surrogate SNPs in linkage disequilibrium. Here, we combined three distinct approaches: multiethnic fine-mapping, putative functional annotation (based upon epigenetic data and genome-encoded features), and expression quantitative trait loci (eQTL) analyses, in an attempt to reduce this complexity. We examined 67 risk regions using genotyping and imputation-based fine-mapping in populations of European (cases/controls: 8600/6946), African (cases/controls: 5327/5136), Japanese (cases/controls: 2563/4391) and Latino (cases/controls: 1034/1046) ancestry. Markers at 55 regions passed a region-specific significance threshold (P-value cutoff range: 3.9 × 10(-4)-5.6 × 10(-3)) and in 30 regions we identified markers that were more significantly associated with risk than the previously reported variants in the multiethnic sample. Novel secondary signals (P < 5.0 × 10(-6)) were also detected in two regions (rs13062436/3q21 and rs17181170/3p12). Among 666 variants in the 55 regions with P-values within one order of magnitude of the most-associated marker, 193 variants (29%) in 48 regions overlapped with epigenetic or other putative functional marks. In 11 of the 55 regions, cis-eQTLs were detected with nearby genes. For 12 of the 55 regions (22%), the most significant region-specific, prostate-cancer associated variant represented the strongest candidate functional variant based on our annotations; the number of regions increased to 20 (36%) and 27 (49%) when examining the 2 and 3 most significantly associated variants in each region, respectively. These results have prioritized subsets of candidate variants for downstream functional evaluation. PMID:26162851

  2. The Gene Ontology Annotation (GOA) Project: Implementation of GO in SWISS-PROT, TrEMBL, and InterPro

    PubMed Central

    Camon, Evelyn; Magrane, Michele; Barrell, Daniel; Binns, David; Fleischmann, Wolfgang; Kersey, Paul; Mulder, Nicola; Oinn, Tom; Maslen, John; Cox, Anthony; Apweiler, Rolf

    2003-01-01

    Gene Ontology Annotation (GOA) is a project run by the European Bioinformatics Institute (EBI) that aims to provide assignments of terms from the Gene Ontology (GO) resource to gene products in a number of its databases (http://www.ebi.ac.uk/GOA). In the first stage of this project, GO assignments have been applied to a data set representing the complete human proteome by a combination of electronic mappings and manual curation. This vocabulary has also been applied to the nonredundant proteome sets for all other completely sequenced organisms as well as to proteins from a wide range of organisms where the proteome is not yet complete. PMID:12654719

  3. The Zebrafish GenomeWiki: a crowdsourcing approach to connect the long tail for zebrafish gene annotation.

    PubMed

    Singh, Meghna; Bhartiya, Deeksha; Maini, Jayant; Sharma, Meenakshi; Singh, Angom Ramcharan; Kadarkaraisamy, Subburaj; Rana, Rajiv; Sabharwal, Ankit; Nanda, Srishti; Ramachandran, Aravindhakshan; Mittal, Ashish; Kapoor, Shruti; Sehgal, Paras; Asad, Zainab; Kaushik, Kriti; Vellarikkal, Shamsudheen Karuthedath; Jagga, Divya; Muthuswami, Muthulakshmi; Chauhan, Rajendra K; Leonard, Elvin; Priyadarshini, Ruby; Halimani, Mahantappa; Malhotra, Sunny; Patowary, Ashok; Vishwakarma, Harinder; Joshi, Prateek; Bhardwaj, Vivek; Bhaumik, Arijit; Bhatt, Bharat; Jha, Aamod; Kumar, Aalok; Budakoti, Prerna; Lalwani, Mukesh Kumar; Meli, Rajeshwari; Jalali, Saakshi; Joshi, Kandarp; Pal, Koustav; Dhiman, Heena; Laddha, Saurabh V; Jadhav, Vaibhav; Singh, Naresh; Pandey, Vikas; Sachidanandan, Chetana; Ekker, Stephen C; Klee, Eric W; Scaria, Vinod; Sivasubbu, Sridhar

    2014-01-01

    A large repertoire of gene-centric data has been generated in the field of zebrafish biology. Although the bulk of these data are available in the public domain, most of them are not readily accessible or available in nonstandard formats. One major challenge is to unify and integrate these widely scattered data sources. We tested the hypothesis that active community participation could be a viable option to address this challenge. We present here our approach to create standards for assimilation and sharing of information and a system of open standards for database intercommunication. We have attempted to address this challenge by creating a community-centric solution for zebrafish gene annotation. The Zebrafish GenomeWiki is a 'wiki'-based resource, which aims to provide an altruistic shared environment for collective annotation of the zebrafish genes. The Zebrafish GenomeWiki has features that enable users to comment, annotate, edit and rate this gene-centric information. The credits for contributions can be tracked through a transparent microattribution system. In contrast to other wikis, the Zebrafish GenomeWiki is a 'structured wiki' or rather a 'semantic wiki'. The Zebrafish GenomeWiki implements a semantically linked data structure, which in the future would be amenable to semantic search. Database URL: http://genome.igib.res.in/twiki. PMID:24578356

  4. The Zebrafish GenomeWiki: a crowdsourcing approach to connect the long tail for zebrafish gene annotation

    PubMed Central

    Singh, Meghna; Bhartiya, Deeksha; Maini, Jayant; Sharma, Meenakshi; Singh, Angom Ramcharan; Kadarkaraisamy, Subburaj; Rana, Rajiv; Sabharwal, Ankit; Nanda, Srishti; Ramachandran, Aravindhakshan; Mittal, Ashish; Kapoor, Shruti; Sehgal, Paras; Asad, Zainab; Kaushik, Kriti; Vellarikkal, Shamsudheen Karuthedath; Jagga, Divya; Muthuswami, Muthulakshmi; Chauhan, Rajendra K.; Leonard, Elvin; Priyadarshini, Ruby; Halimani, Mahantappa; Malhotra, Sunny; Patowary, Ashok; Vishwakarma, Harinder; Joshi, Prateek; Bhardwaj, Vivek; Bhaumik, Arijit; Bhatt, Bharat; Jha, Aamod; Kumar, Aalok; Budakoti, Prerna; Lalwani, Mukesh Kumar; Meli, Rajeshwari; Jalali, Saakshi; Joshi, Kandarp; Pal, Koustav; Dhiman, Heena; Laddha, Saurabh V.; Jadhav, Vaibhav; Singh, Naresh; Pandey, Vikas; Sachidanandan, Chetana; Ekker, Stephen C.; Klee, Eric W.; Scaria, Vinod; Sivasubbu, Sridhar

    2014-01-01

    A large repertoire of gene-centric data has been generated in the field of zebrafish biology. Although the bulk of these data are available in the public domain, most of them are not readily accessible or available in nonstandard formats. One major challenge is to unify and integrate these widely scattered data sources. We tested the hypothesis that active community participation could be a viable option to address this challenge. We present here our approach to create standards for assimilation and sharing of information and a system of open standards for database intercommunication. We have attempted to address this challenge by creating a community-centric solution for zebrafish gene annotation. The Zebrafish GenomeWiki is a ‘wiki’-based resource, which aims to provide an altruistic shared environment for collective annotation of the zebrafish genes. The Zebrafish GenomeWiki has features that enable users to comment, annotate, edit and rate this gene-centric information. The credits for contributions can be tracked through a transparent microattribution system. In contrast to other wikis, the Zebrafish GenomeWiki is a ‘structured wiki’ or rather a ‘semantic wiki’. The Zebrafish GenomeWiki implements a semantically linked data structure, which in the future would be amenable to semantic search. Database URL: http://genome.igib.res.in/twiki PMID:24578356

  5. De novo cloning and annotation of genes associated with immunity, detoxification and energy metabolism from the fat body of the oriental fruit fly, Bactrocera dorsalis.

    PubMed

    Yang, Wen-Jia; Yuan, Guo-Rui; Cong, Lin; Xie, Yi-Fei; Wang, Jin-Jun

    2014-01-01

    The oriental fruit fly, Bactrocera dorsalis, is a destructive pest in tropical and subtropical areas. In this study, we performed transcriptome-wide analysis of the fat body of B. dorsalis and obtained more than 59 million sequencing reads, which were assembled into 27,787 unigenes with an average length of 591 bp. Among them, 17,442 (62.8%) unigenes matched known proteins in the NCBI database. The assembled sequences were further annotated with gene ontology, cluster of orthologous group terms, and Kyoto encyclopedia of genes and genomes. In depth analysis was performed to identify genes putatively involved in immunity, detoxification, and energy metabolism. Many new genes were identified including serpins, peptidoglycan recognition proteins and defensins, which were potentially linked to immune defense. Many detoxification genes were identified, including cytochrome P450s, glutathione S-transferases and ATP-binding cassette (ABC) transporters. Many new transcripts possibly involved in energy metabolism, including fatty acid desaturases, lipases, alpha amylases, and trehalose-6-phosphate synthases, were identified. Moreover, we randomly selected some genes to examine their expression patterns in different tissues by quantitative real-time PCR, which indicated that some genes exhibited fat body-specific expression in B. dorsalis. The identification of a numerous transcripts in the fat body of B. dorsalis laid the foundation for future studies on the functions of these genes. PMID:24710118

  6. De novo Cloning and Annotation of Genes Associated with Immunity, Detoxification and Energy Metabolism from the Fat Body of the Oriental Fruit Fly, Bactrocera dorsalis

    PubMed Central

    Yang, Wen-Jia; Yuan, Guo-Rui; Cong, Lin; Xie, Yi-Fei; Wang, Jin-Jun

    2014-01-01

    The oriental fruit fly, Bactrocera dorsalis, is a destructive pest in tropical and subtropical areas. In this study, we performed transcriptome-wide analysis of the fat body of B. dorsalis and obtained more than 59 million sequencing reads, which were assembled into 27,787 unigenes with an average length of 591 bp. Among them, 17,442 (62.8%) unigenes matched known proteins in the NCBI database. The assembled sequences were further annotated with gene ontology, cluster of orthologous group terms, and Kyoto encyclopedia of genes and genomes. In depth analysis was performed to identify genes putatively involved in immunity, detoxification, and energy metabolism. Many new genes were identified including serpins, peptidoglycan recognition proteins and defensins, which were potentially linked to immune defense. Many detoxification genes were identified, including cytochrome P450s, glutathione S-transferases and ATP-binding cassette (ABC) transporters. Many new transcripts possibly involved in energy metabolism, including fatty acid desaturases, lipases, alpha amylases, and trehalose-6-phosphate synthases, were identified. Moreover, we randomly selected some genes to examine their expression patterns in different tissues by quantitative real-time PCR, which indicated that some genes exhibited fat body-specific expression in B. dorsalis. The identification of a numerous transcripts in the fat body of B. dorsalis laid the foundation for future studies on the functions of these genes. PMID:24710118

  7. Computing Gene Functional Similarity Using Combined Graphs

    E-print Network

    Al-Mubaid, Hisham

    way to explore the functional relationships between genes. We conducted the evaluation on a dataset gene functions and identifying gene disease associations are also based on GO. Gene ontologyComputing Gene Functional Similarity Using Combined Graphs Anurag Nagar Hisham Al-Mubaid Said

  8. Functionally Enigmatic Genes: A Case Study of the Brain Ignorome

    PubMed Central

    Pandey, Ashutosh K.; Lu, Lu; Wang, Xusheng; Homayouni, Ramin; Williams, Robert W.

    2014-01-01

    What proportion of genes with intense and selective expression in specific tissues, cells, or systems are still almost completely uncharacterized with respect to biological function? In what ways do these functionally enigmatic genes differ from well-studied genes? To address these two questions, we devised a computational approach that defines so-called ignoromes. As proof of principle, we extracted and analyzed a large subset of genes with intense and selective expression in brain. We find that publications associated with this set are highly skewed—the top 5% of genes absorb 70% of the relevant literature. In contrast, approximately 20% of genes have essentially no neuroscience literature. Analysis of the ignorome over the past decade demonstrates that it is stubbornly persistent, and the rapid expansion of the neuroscience literature has not had the expected effect on numbers of these genes. Surprisingly, ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum—a genomic bandwagon effect. Finally we ask to what extent massive genomic, imaging, and phenotype data sets can be used to provide high-throughput functional annotation for an entire ignorome. In a majority of cases we have been able to extract and add significant information for these neglected genes. In several cases—ELMOD1, TMEM88B, and DZANK1—we have exploited sequence polymorphisms, large phenome data sets, and reverse genetic methods to evaluate the function of ignorome genes. PMID:24523945

  9. Protein function annotation with Structurally Aligned Local Sites of Activity (SALSAs)

    PubMed Central

    2013-01-01

    Background The prediction of biochemical function from the 3D structure of a protein has proved to be much more difficult than was originally foreseen. A reliable method to test the likelihood of putative annotations and to predict function from structure would add tremendous value to structural genomics data. We report on a new method, Structurally Aligned Local Sites of Activity (SALSA), for the prediction of biochemical function based on a local structural match at the predicted catalytic or binding site. Results Implementation of the SALSA method is described. For the structural genomics protein PY01515 (PDB ID 2aqw) from Plasmodium yoelii, it is shown that the putative annotation, Orotidine 5'-monophosphate decarboxylase (OMPDC), is most likely correct. SALSA analysis of YP_001304206.1 (PDB ID 3h3l), a putative sugar hydrolase from Parabacteroides distasonis, shows that its active site does not bear close resemblance to any previously characterized member of its superfamily, the Concanavalin A-like lectins/glucanases. It is noted that three residues in the active site of the thermophilic beta-1,4-xylanase from Nonomuraea flexuosa (PDB ID 1m4w), Y78, E87, and E176, overlap with POOL-predicted residues of similar type, Y168, D153, and E232, in YP_001304206.1. The substrate recognition regions of the two proteins are rather different, suggesting that YP_001304206.1 is a new functional type within the superfamily. A structural genomics protein from Mycobacterium avium (PDB ID 3q1t) has been reported to be an enoyl-CoA hydratase (ECH), but SALSA analysis shows a poor match between the predicted residues for the SG protein and those of known ECHs. A better local structural match is obtained with Anabaena beta-diketone hydrolase (ABDH), a known ?-diketone hydrolase from Cyanobacterium anabaena (PDB ID 2j5s). This suggests that the reported ECH function of the SG protein is incorrect and that it is more likely a ?-diketone hydrolase. Conclusions A local site match provides a more compelling function prediction than that obtainable from a simple 3D structure match. The present method can confirm putative annotations, identify misannotation, and in some cases suggest a more probable annotation. PMID:23514271

  10. Annotation extension through protein family annotation coherence metrics.

    PubMed

    Bastos, Hugo P; Clarke, Luka A; Couto, Francisco M

    2013-01-01

    Protein functional annotation consists in associating proteins with textual descriptors elucidating their biological roles. The bulk of annotation is done via automated procedures that ultimately rely on annotation transfer. Despite a large number of existing protein annotation procedures the ever growing protein space is never completely annotated. One of the facets of annotation incompleteness derives from annotation uncertainty. Often when protein function cannot be predicted with enough specificity it is instead conservatively annotated with more generic terms. In a scenario of protein families or functionally related (or even dissimilar) sets this leads to a more difficult task of using annotations to compare the extent of functional relatedness among all family or set members. However, we postulate that identifying sub-sets of functionally coherent proteins annotated at a very specific level, can help the annotation extension of other incompletely annotated proteins within the same family or functionally related set. As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class. We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension. Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families. PMID:24130572

  11. Parallel-META 2.0: Enhanced Metagenomic Data Analysis with Functional Annotation, High Performance Computing and Advanced Visualization

    PubMed Central

    Song, Baoxing; Xu, Jian; Ning, Kang

    2014-01-01

    The metagenomic method directly sequences and analyses genome information from microbial communities. The main computational tasks for metagenomic analyses include taxonomical and functional structure analysis for all genomes in a microbial community (also referred to as a metagenomic sample). With the advancement of Next Generation Sequencing (NGS) techniques, the number of metagenomic samples and the data size for each sample are increasing rapidly. Current metagenomic analysis is both data- and computation- intensive, especially when there are many species in a metagenomic sample, and each has a large number of sequences. As such, metagenomic analyses require extensive computational power. The increasing analytical requirements further augment the challenges for computation analysis. In this work, we have proposed Parallel-META 2.0, a metagenomic analysis software package, to cope with such needs for efficient and fast analyses of taxonomical and functional structures for microbial communities. Parallel-META 2.0 is an extended and improved version of Parallel-META 1.0, which enhances the taxonomical analysis using multiple databases, improves computation efficiency by optimized parallel computing, and supports interactive visualization of results in multiple views. Furthermore, it enables functional analysis for metagenomic samples including short-reads assembly, gene prediction and functional annotation. Therefore, it could provide accurate taxonomical and functional analyses of the metagenomic samples in high-throughput manner and on large scale. PMID:24595159

  12. De Novo Assembly and Annotation of the Transcriptome of the Agricultural Weed Ipomoea purpurea Uncovers Gene Expression Changes Associated with Herbicide Resistance

    PubMed Central

    Leslie, Trent; Baucom, Regina S.

    2014-01-01

    Human-mediated selection can lead to rapid evolution in very short time scales, and the evolution of herbicide resistance in agricultural weeds is an excellent example of this phenomenon. The common morning glory, Ipomoea purpurea, is resistant to the herbicide glyphosate, but genetic investigations of this trait have been hampered by the lack of genomic resources for this species. Here, we present the annotated transcriptome of the common morning glory, Ipomoea purpurea, along with an examination of whole genome expression profiling to assess potential gene expression differences between three artificially selected herbicide resistant lines and three susceptible lines. The assembled Ipomoea transcriptome reported in this work contains 65,459 assembled transcripts, ~28,000 of which were functionally annotated by assignment to Gene Ontology categories. Our RNA-seq survey using this reference transcriptome identified 19 differentially expressed genes associated with resistance—one of which, a cytochrome P450, belongs to a large plant family of genes involved in xenobiotic detoxification. The differentially expressed genes also broadly implicated receptor-like kinases, which were down-regulated in the resistant lines, and other growth and defense genes, which were up-regulated in resistant lines. Interestingly, the target of glyphosate—EPSP synthase—was not overexpressed in the resistant Ipomoea lines as in other glyphosate resistant weeds. Overall, this work identifies potential candidate resistance loci for future investigations and dramatically increases genomic resources for this species. The assembled transcriptome presented herein will also provide a valuable resource to the Ipomoea community, as well as to those interested in utilizing the close relationship between the Convolvulaceae and the Solanaceae for phylogenetic and comparative genomics examinations. PMID:25155274

  13. An Innovative Plant Genomics and Gene Annotation Program for High School, Community College, and University Faculty

    PubMed Central

    Hilgert, Uwe; Nash, E. Bruce; Micklos, David A.

    2008-01-01

    Today's biology educators face the challenge of training their students in modern molecular biology techniques including genomics and bioinformatics. The Dolan DNA Learning Center (DNALC) of Cold Spring Harbor Laboratory has developed and disseminated a bench- and computer-based plant genomics curriculum for biology faculty. In 2007, a five-day “Plant Genomics and Gene Annotation” workshop was held at Florida A&M University in Tallahassee, FL, to enhance participants' knowledge and understanding of plant molecular genetics and assist them in developing and honing their laboratory and computer skills. Florida A&M University is a historically black university with over 95% African-American student enrollment. Sixteen participants, including high school (56%) and community college faculty (25%), attended the workshop. Participants carried out in vitro and in silico experiments with maize, Arabidopsis, soybean, and food products to determine the genotype of the samples. Benefits of the workshop included increased awareness of plant biology research for high school and college level students. Participants completed pre- and postworkshop evaluations for the measurement of effectiveness. Participants demonstrated an overall improvement in their postworkshop evaluation scores. This article provides a detailed description of workshop activities, as well as assessment and long-term support for broad classroom implementation. PMID:18765753

  14. Functional Grouping of Genes Using Spectral Clustering and Gene Ontology

    E-print Network

    Zell, Andreas

    Functional Grouping of Genes Using Spectral Clustering and Gene Ontology Nora Speer, Holger amounts of biological data. During the analysis of such data the need for a functional grouping of genes arises. In this paper, we propose a new method based on spectral clustering for the partitioning of genes

  15. FungiFun: a web-based application for functional categorization of fungal genes and proteins.

    PubMed

    Priebe, Steffen; Linde, Jörg; Albrecht, Daniela; Guthke, Reinhard; Brakhage, Axel A

    2011-04-01

    FungiFun assigns functional annotations to fungal genes or proteins and performs gene set enrichment analysis. Based on three different classification methods (FunCat, GO and KEGG), FungiFun categorizes genes and proteins for several fungal species on different levels of annotation detail. It is web-based and accessible to users without any programming skills. FungiFun is the first tool offering gene set enrichment analysis including the FunCat categorization. Two biological datasets for Aspergillus fumigatus and Candida albicans were analyzed using FungiFun, providing an overview of the usage and functions of the tool. FungiFun is freely accessible at https://www.omnifung.hki-jena.de/FungiFun/. PMID:21073976

  16. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

    PubMed Central

    O'Leary, Nuala A.; Wright, Mathew W.; Brister, J. Rodney; Ciufo, Stacy; Haddad, Diana; McVeigh, Rich; Rajput, Bhanu; Robbertse, Barbara; Smith-White, Brian; Ako-Adjei, Danso; Astashyn, Alexander; Badretdin, Azat; Bao, Yiming; Blinkova, Olga; Brover, Vyacheslav; Chetvernin, Vyacheslav; Choi, Jinna; Cox, Eric; Ermolaeva, Olga; Farrell, Catherine M.; Goldfarb, Tamara; Gupta, Tripti; Haft, Daniel; Hatcher, Eneida; Hlavina, Wratko; Joardar, Vinita S.; Kodali, Vamsi K.; Li, Wenjun; Maglott, Donna; Masterson, Patrick; McGarvey, Kelly M.; Murphy, Michael R.; O'Neill, Kathleen; Pujar, Shashikant; Rangwala, Sanjida H.; Rausch, Daniel; Riddick, Lillian D.; Schoch, Conrad; Shkeda, Andrei; Storz, Susan S.; Sun, Hanzhen; Thibaud-Nissen, Francoise; Tolstoy, Igor; Tully, Raymond E.; Vatsan, Anjana R.; Wallin, Craig; Webb, David; Wu, Wendy; Landrum, Melissa J.; Kimchi, Avi; Tatusova, Tatiana; DiCuccio, Michael; Kitts, Paul; Murphy, Terence D.; Pruitt, Kim D.

    2016-01-01

    The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management. PMID:26553804

  17. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation.

    PubMed

    O'Leary, Nuala A; Wright, Mathew W; Brister, J Rodney; Ciufo, Stacy; Haddad, Diana; McVeigh, Rich; Rajput, Bhanu; Robbertse, Barbara; Smith-White, Brian; Ako-Adjei, Danso; Astashyn, Alexander; Badretdin, Azat; Bao, Yiming; Blinkova, Olga; Brover, Vyacheslav; Chetvernin, Vyacheslav; Choi, Jinna; Cox, Eric; Ermolaeva, Olga; Farrell, Catherine M; Goldfarb, Tamara; Gupta, Tripti; Haft, Daniel; Hatcher, Eneida; Hlavina, Wratko; Joardar, Vinita S; Kodali, Vamsi K; Li, Wenjun; Maglott, Donna; Masterson, Patrick; McGarvey, Kelly M; Murphy, Michael R; O'Neill, Kathleen; Pujar, Shashikant; Rangwala, Sanjida H; Rausch, Daniel; Riddick, Lillian D; Schoch, Conrad; Shkeda, Andrei; Storz, Susan S; Sun, Hanzhen; Thibaud-Nissen, Francoise; Tolstoy, Igor; Tully, Raymond E; Vatsan, Anjana R; Wallin, Craig; Webb, David; Wu, Wendy; Landrum, Melissa J; Kimchi, Avi; Tatusova, Tatiana; DiCuccio, Michael; Kitts, Paul; Murphy, Terence D; Pruitt, Kim D

    2016-01-01

    The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management. PMID:26553804

  18. BASys: a web server for automated bacterial genome annotation.

    PubMed

    Van Domselaar, Gary H; Stothard, Paul; Shrivastava, Savita; Cruz, Joseph A; Guo, AnChi; Dong, Xiaoli; Lu, Paul; Szafron, Duane; Greiner, Russ; Wishart, David S

    2005-07-01

    BASys (Bacterial Annotation System) is a web server that supports automated, in-depth annotation of bacterial genomic (chromosomal and plasmid) sequences. It accepts raw DNA sequence data and an optional list of gene identification information and provides extensive textual annotation and hyperlinked image output. BASys uses >30 programs to determine approximately 60 annotation subfields for each gene, including gene/protein name, GO function, COG function, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3D structure, reactions and pathways. The depth and detail of a BASys annotation matches or exceeds that found in a standard SwissProt entry. BASys also generates colorful, clickable and fully zoomable maps of each query chromosome to permit rapid navigation and detailed visual analysis of all resulting gene annotations. The textual annotations and images that are provided by BASys can be generated in approximately 24 h for an average bacterial chromosome (5 Mb). BASys annotations may be viewed and downloaded anonymously or through a password protected access system. The BASys server and databases can also be downloaded and run locally. BASys is accessible at http://wishart.biology.ualberta.ca/basys. PMID:15980511

  19. BASys: a web server for automated bacterial genome annotation

    PubMed Central

    Van Domselaar, Gary H.; Stothard, Paul; Shrivastava, Savita; Cruz, Joseph A.; Guo, AnChi; Dong, Xiaoli; Lu, Paul; Szafron, Duane; Greiner, Russ; Wishart, David S.

    2005-01-01

    BASys (Bacterial Annotation System) is a web server that supports automated, in-depth annotation of bacterial genomic (chromosomal and plasmid) sequences. It accepts raw DNA sequence data and an optional list of gene identification information and provides extensive textual annotation and hyperlinked image output. BASys uses >30 programs to determine ?60 annotation subfields for each gene, including gene/protein name, GO function, COG function, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3D structure, reactions and pathways. The depth and detail of a BASys annotation matches or exceeds that found in a standard SwissProt entry. BASys also generates colorful, clickable and fully zoomable maps of each query chromosome to permit rapid navigation and detailed visual analysis of all resulting gene annotations. The textual annotations and images that are provided by BASys can be generated in ?24 h for an average bacterial chromosome (5 Mb). BASys annotations may be viewed and downloaded anonymously or through a password protected access system. The BASys server and databases can also be downloaded and run locally. BASys is accessible at . PMID:15980511

  20. Automated update, revision, and quality control of the maize genome annotations using MAKER-P improves the B73 RefGen_v3 gene models and identifies new genes

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-...

  1. Transcriptome Analysis of the Emerald Ash Borer (EAB), Agrilus planipennis: De Novo Assembly, Functional Annotation and Comparative Analysis

    PubMed Central

    Duan, Jun; Ladd, Tim; Doucet, Daniel; Cusson, Michel; vanFrankenhuyzen, Kees; Mittapalli, Omprakash; Krell, Peter J.; Quan, Guoxing

    2015-01-01

    Background The Emerald ash borer (EAB), Agrilus planipennis, is an invasive phloem-feeding insect pest of ash trees. Since its initial discovery near the Detroit, US- Windsor, Canada area in 2002, the spread of EAB has had strong negative economic, social and environmental impacts in both countries. Several transcriptomes from specific tissues including midgut, fat body and antenna have recently been generated. However, the relatively low sequence depth, gene coverage and completeness limited the usefulness of these EAB databases. Methodology and Principal Findings High-throughput deep RNA-Sequencing (RNA-Seq) was used to obtain 473.9 million pairs of 100 bp length paired-end reads from various life stages and tissues. These reads were assembled into 88,907 contigs using the Trinity strategy and integrated into 38,160 unigenes after redundant sequences were removed. We annotated 11,229 unigenes by searching against the public nr, Swiss-Prot and COG. The EAB transcriptome assembly was compared with 13 other sequenced insect species, resulting in the prediction of 536 unigenes that are Coleoptera-specific. Differential gene expression revealed that 290 unigenes are expressed during larval molting and 3,911 unigenes during metamorphosis from larvae to pupae, respectively (FDR< 0.01 and log2 FC>2). In addition, 1,167 differentially expressed unigenes were identified from larval and adult midguts, 435 unigenes were up-regulated in larval midgut and 732 unigenes were up-regulated in adult midgut. Most of the genes involved in RNA interference (RNAi) pathways were identified, which implies the existence of a system RNAi in EAB. Conclusions and Significance This study provides one of the most fundamental and comprehensive transcriptome resources available for EAB to date. Identification of the tissue- stage- or species- specific unigenes will benefit the further study of gene functions during growth and metamorphosis processes in EAB and other pest insects. PMID:26244979

  2. A semantic analysis of the annotations of the human genome

    PubMed Central

    Khatri, Purvesh; Done, Bogdan; Rao, Archana; Done, Arina

    2008-01-01

    The correct interpretation of any biological experiment depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are ubiquitous and used by all life scientists in most experiments. However, it is well known that such databases are incomplete and many annotations may also be incorrect. In this paper we describe a technique that can be used to analyze the semantic content of such annotation databases. Our approach is able to extract implicit semantic relationships between genes and functions. This ability allows us to discover novel functions for known genes. This approach is able to identify missing and inaccurate annotations in existing annotation databases, and thus help improve their accuracy. We used our technique to analyze the current annotations of the human genome. From this body of annotations, we were able to predict 212 additional gene–function assignments. A subsequent literature search found that 138 of these gene–functions assignments are supported by existing peer-reviewed papers. An additional 23 assignments have been confirmed in the meantime by the addition of the respective annotations in later releases of the Gene Ontology database. Overall, the 161 confirmed assignments represent 75.95% of the proposed gene–function assignments. Only one of our predictions (0.4%) was contradicted by the existing literature. We could not find any relevant articles for 50 of our predictions (23.58%). The method is independent of the organism and can be used to analyze and improve the quality of the data of any public or private annotation database. Availability http://vortex.cs.wayne.edu/papers/semantic_analysis_bioinfo.pdf Contact sod@cs.wayne.edu PMID:15955782

  3. Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-Prot

    PubMed Central

    Ehrler, Frédéric; Geissbühler, Antoine; Jimeno, Antonio; Ruch, Patrick

    2005-01-01

    Background In the context of the BioCreative competition, where training data were very sparse, we investigated two complementary tasks: 1) given a Swiss-Prot triplet, containing a protein, a GO (Gene Ontology) term and a relevant article, extraction of a short passage that justifies the GO category assignement; 2) given a Swiss-Prot pair, containing a protein and a relevant article, automatic assignement of a set of categories. Methods Sentence is the basic retrieval unit. Our classifier computes a distance between each sentence and the GO category provided with the Swiss-Prot entry. The Text Categorizer computes a distance between each GO term and the text of the article. Evaluations are reported both based on annotator judgements as established by the competition and based on mean average precision measures computed using a curated sample of Swiss-Prot. Results Our system achieved the best recall and precision combination both for passage retrieval and text categorization as evaluated by official evaluators. However, text categorization results were far below those in other data-poor text categorization experiments The top proposed term is relevant in less that 20% of cases, while categorization with other biomedical controlled vocabulary, such as the Medical Subject Headings, we achieved more than 90% precision. We also observe that the scoring methods used in our experiments, based on the retrieval status value of our engines, exhibits effective confidence estimation capabilities. Conclusion From a comparative perspective, the combination of retrieval and natural language processing methods we designed, achieved very competitive performances. Largely data-independent, our systems were no less effective that data-intensive approaches. These results suggests that the overall strategy could benefit a large class of information extraction tasks, especially when training data are missing. However, from a user perspective, results were disappointing. Further investigations are needed to design applicable end-user text mining tools for biologists. PMID:15960836

  4. A comprehensive evaluation of collapsing methods using simulated and real data: excellent annotation of functionality and large sample sizes required

    PubMed Central

    Dering, Carmen; König, Inke R.; Ramsey, Laura B.; Relling, Mary V.; Yang, Wenjian; Ziegler, Andreas

    2014-01-01

    The advent of next generation sequencing (NGS) technologies enabled the investigation of the rare variant-common disease hypothesis in unrelated individuals, even on the genome-wide level. Analysis of this hypothesis requires tailored statistical methods as single marker tests fail on rare variants. An entire class of statistical methods collapses rare variants from a genomic region of interest (ROI), thereby aggregating rare variants. In an extensive simulation study using data from the Genetic Analysis Workshop 17 we compared the performance of 15 collapsing methods by means of a variety of pre-defined ROIs regarding minor allele frequency thresholds and functionality. Findings of the simulation study were additionally confirmed by a real data set investigating the association between methotrexate clearance and the SLCO1B1 gene in patients with acute lymphoblastic leukemia. Our analyses showed substantially inflated type I error levels for many of the proposed collapsing methods. Only four approaches yielded valid type I errors in all considered scenarios. None of the statistical tests was able to detect true associations over a substantial proportion of replicates in the simulated data. Detailed annotation of functionality of variants is crucial to detect true associations. These findings were confirmed in the analysis of the real data. Recent theoretical work showed that large power is achieved in gene-based analyses only if large sample sizes are available and a substantial proportion of causing rare variants is present in the gene-based analysis. Many of the investigated statistical approaches use permutation requiring high computational cost. There is a clear need for valid, powerful and fast to calculate test statistics for studies investigating rare variants. PMID:25309579

  5. A comprehensive evaluation of collapsing methods using simulated and real data: excellent annotation of functionality and large sample sizes required.

    PubMed

    Dering, Carmen; König, Inke R; Ramsey, Laura B; Relling, Mary V; Yang, Wenjian; Ziegler, Andreas

    2014-01-01

    The advent of next generation sequencing (NGS) technologies enabled the investigation of the rare variant-common disease hypothesis in unrelated individuals, even on the genome-wide level. Analysis of this hypothesis requires tailored statistical methods as single marker tests fail on rare variants. An entire class of statistical methods collapses rare variants from a genomic region of interest (ROI), thereby aggregating rare variants. In an extensive simulation study using data from the Genetic Analysis Workshop 17 we compared the performance of 15 collapsing methods by means of a variety of pre-defined ROIs regarding minor allele frequency thresholds and functionality. Findings of the simulation study were additionally confirmed by a real data set investigating the association between methotrexate clearance and the SLCO1B1 gene in patients with acute lymphoblastic leukemia. Our analyses showed substantially inflated type I error levels for many of the proposed collapsing methods. Only four approaches yielded valid type I errors in all considered scenarios. None of the statistical tests was able to detect true associations over a substantial proportion of replicates in the simulated data. Detailed annotation of functionality of variants is crucial to detect true associations. These findings were confirmed in the analysis of the real data. Recent theoretical work showed that large power is achieved in gene-based analyses only if large sample sizes are available and a substantial proportion of causing rare variants is present in the gene-based analysis. Many of the investigated statistical approaches use permutation requiring high computational cost. There is a clear need for valid, powerful and fast to calculate test statistics for studies investigating rare variants. PMID:25309579

  6. The DOE-JGI Standard Operating Procedure for the Annotations of the Microbial Genomes

    SciTech Connect

    Mavromatis, Konstantinos; Ivanova, Natalia; Chen, I-Min A.; Szeto, Ernest; Markowitz, Victor; Kyrpides, Nikos C.

    2009-05-20

    The DOE-JGI Microbial Annotation Pipeline (DOE-JGI MAP) supports gene prediction and/or functional annotation of microbial genomes towards comparative analysis with the Integrated Microbial Genome (IMG) system. DOE-JGI MAP annotation is applied on nucleotide sequence datasets included in the IMG-ER (Expert Review) version of IMG via the IMG ER submission site. Users can submit the sequence datasets consisting of one or more contigs in a multi-fasta file. DOE-JGI MAP annotation includes prediction of protein coding and RNA genes, as well as repeats and assignment of product names to these genes.

  7. Improving the Annotation of Arabidopsis lyrata Using RNA-Seq Data

    PubMed Central

    Rawat, Vimal; Abdelsamad, Ahmed; Pietzenuk, Björn; Seymour, Danelle K.; Koenig, Daniel; Weigel, Detlef; Pecinka, Ales; Schneeberger, Korbinian

    2015-01-01

    Gene model annotations are important community resources that ensure comparability and reproducibility of analyses and are typically the first step for functional annotation of genomic regions. Without up-to-date genome annotations, genome sequences cannot be used to maximum advantage. It is therefore essential to regularly update gene annotations by integrating the latest information to guarantee that reference annotations can remain a common basis for various types of analyses. Here, we report an improvement of the Arabidopsis lyrata gene annotation using extensive RNA-seq data. This new annotation consists of 31,132 protein coding gene models in addition to 2,089 genes with high similarity to transposable elements. Overall, ~87% of the gene models are corroborated by evidence of expression and 2,235 of these models feature multiple transcripts. Our updated gene annotation corrects hundreds of incorrectly split or merged gene models in the original annotation, and as a result the identification of alternative splicing events and differential isoform usage are vastly improved. PMID:26382944

  8. Re-Annotator: Annotation Pipeline for Microarray Probe Sequences

    PubMed Central

    Röh, Simone; Altmann, Andre

    2015-01-01

    Microarray technologies are established approaches for high throughput gene expression, methylation and genotyping analysis. An accurate mapping of the array probes is essential to generate reliable biological findings. However, manufacturers of the microarray platforms typically provide incomplete and outdated annotation tables, which often rely on older genome and transcriptome versions that differ substantially from up-to-date sequence databases. Here, we present the Re-Annotator, a re-annotation pipeline for microarray probe sequences. It is primarily designed for gene expression microarrays but can also be adapted to other types of microarrays. The Re-Annotator uses a custom-built mRNA reference database to identify the positions of gene expression array probe sequences. We applied Re-Annotator to the Illumina Human-HT12 v4 microarray platform and found that about one quarter (25%) of the probes differed from the manufacturer’s annotation. In further computational experiments on experimental gene expression data, we compared Re-Annotator to another probe re-annotation tool, ReMOAT, and found that Re-Annotator provided an improved re-annotation of microarray probes. A thorough re-annotation of probe information is crucial to any microarray analysis. The Re-Annotator pipeline is freely available at http://sourceforge.net/projects/reannotator along with re-annotated files for Illumina microarrays HumanHT-12 v3/v4 and MouseRef-8 v2. PMID:26426330

  9. Gene identification and annotation We principally used two strategies for gene prediction and combined the results (see

    E-print Network

    Sperling, George

    motifs were identified respectively by the BLASTP program with GenBank nr database, or a HMMER program with a Pfam database. A functional classification was performed based on the NCBI KOG. The tRNA genes were of ultrastructures between the ultra-small eukaryote Cyanidioschyzon merolae and Cyanidium caldarium. Cytologia

  10. Functional annotation of proteomic sequences based on consensus of sequence and structural analysis.

    PubMed

    Kitson, David H; Badretdinov, Azat; Zhu, Zhan-yang; Velikanov, Mikhail; Edwards, David J; Olszewski, Krzysztof; Szalma, Sándor; Yan, Lisa

    2002-03-01

    To maximise the assignment of function of the proteins encoded by a genome and to aid the search for novel drug targets, there is an emerging need for sensitive methods of predicting protein function on a genome-wide basis. GeneAtlas is an automated, high-throughput pipeline for the prediction of protein structure and function using sequence similarity detection, homology modelling and fold recognition methods. GeneAtlas is described in detail here. To test GeneAtlas, a 'virtual' genome was used, a subset of PDB structures from the SCOP database, in which the functional relationships are known. GeneAtlas detects additional relationships by building 3D models in comparison with the sequence searching method PSI-BLAST. Functionally related proteins with sequence identity below the twilight zone can be recognised correctly. PMID:12002222

  11. Functional Annotation of Conserved Hypothetical Proteins from Haemophilus influenzae Rd KW20

    PubMed Central

    Shahbaaz, Mohd; Md. ImtaiyazHassan; Ahmad, Faizan

    2013-01-01

    Haemophilus influenzae is a Gram negative bacterium that belongs to the family Pasteurellaceae, causes bacteremia, pneumonia and acute bacterial meningitis in infants. The emergence of multi-drug resistance H. influenzae strain in clinical isolates demands the development of better/new drugs against this pathogen. Our study combines a number of bioinformatics tools for function predictions of previously not assigned proteins in the genome of H. influenzae. This genome was extensively analyzed and found 1,657 functional proteins in which function of 429 proteins are unknown, termed as hypothetical proteins (HPs). Amino acid sequences of all 429 HPs were extensively annotated and we successfully assigned the function to 296 HPs with high confidence. We also characterized the function of 124 HPs precisely, but with less confidence. We believed that sequence of a protein can be used as a framework to explain known functional properties. Here we have combined the latest versions of protein family databases, protein motifs, intrinsic features from the amino acid sequence, pathway and genome context methods to assign a precise function to hypothetical proteins for which no experimental information is available. We found these HPs belong to various classes of proteins such as enzymes, transporters, carriers, receptors, signal transducers, binding proteins, virulence and other proteins. The outcome of this work will be helpful for a better understanding of the mechanism of pathogenesis and in finding novel therapeutic targets for H. influenzae. PMID:24391926

  12. Spectral Clustering Gene Ontology Terms to Group Genes by Function

    E-print Network

    Zell, Andreas

    Spectral Clustering Gene Ontology Terms to Group Genes by Function Nora Speer, Christian Spieth throughput me- thods like DNA microarrays, biologists are capable of producing huge amounts of data. During the analysis of such data the need for a group- ing of the genes according to their biological function arises

  13. Rotavirus gene structure and function.

    PubMed Central

    Estes, M K; Cohen, J

    1989-01-01

    Knowledge of the structure and function of the genes and proteins of the rotaviruses has expanded rapidly. Information obtained in the last 5 years has revealed unexpected and unique molecular properties of rotavirus proteins of general interest to virologists, biochemists, and cell biologists. Rotaviruses share some features of replication with reoviruses, yet antigenic and molecular properties of the outer capsid proteins, VP4 (a protein whose cleavage is required for infectivity, possibly by mediating fusion with the cell membrane) and VP7 (a glycoprotein), show more similarities with those of other viruses such as the orthomyxoviruses, paramyxoviruses, and alphaviruses. Rotavirus morphogenesis is a unique process, during which immature subviral particles bud through the membrane of the endoplasmic reticulum (ER). During this process, transiently enveloped particles form, the outer capsid proteins are assembled onto particles, and mature particles accumulate in the lumen of the ER. Two ER-specific viral glycoproteins are involved in virus maturation, and these glycoproteins have been shown to be useful models for studying protein targeting and retention in the ER and for studying mechanisms of virus budding. New ideas and approaches to understanding how each gene functions to replicate and assemble the segmented viral genome have emerged from knowledge of the primary structure of rotavirus genes and their proteins and from knowledge of the properties of domains on individual proteins. Localization of type-specific and cross-reactive neutralizing epitopes on the outer capsid proteins is becoming increasingly useful in dissecting the protective immune response, including evaluation of vaccine trials, with the practical possibility of enhancing the production of new, more effective vaccines. Finally, future analyses with recently characterized immunologic and gene probes and new animal models can be expected to provide a basic understanding of what regulates the primary interactions of these viruses with the gastrointestinal tract and the subsequent responses of infected hosts. Images PMID:2556635

  14. The institute for genomic research Osa1 rice genome annotation database.

    PubMed

    Yuan, Qiaoping; Ouyang, Shu; Wang, Aihui; Zhu, Wei; Maiti, Rama; Lin, Haining; Hamilton, John; Haas, Brian; Sultana, Razvan; Cheung, Foo; Wortman, Jennifer; Buell, C Robin

    2005-05-01

    We have developed a rice (Oryza sativa) genome annotation database (Osa1) that provides structural and functional annotation for this emerging model species. Using the sequence of O. sativa subsp. japonica cv Nipponbare from the International Rice Genome Sequencing Project, pseudomolecules, or virtual contigs, of the 12 rice chromosomes were constructed. Our most recent release, version 3, represents our third build of the pseudomolecules and is composed of 98% finished sequence. Genes were identified using a series of computational methods developed for Arabidopsis (Arabidopsis thaliana) that were modified for use with the rice genome. In release 3 of our annotation, we identified 57,915 genes, of which 14,196 are related to transposable elements. Of these 43,719 non-transposable element-related genes, 18,545 (42.4%) were annotated with a putative function, 5,777 (13.2%) were annotated as encoding an expressed protein with no known function, and the remaining 19,397 (44.4%) were annotated as encoding a hypothetical protein. Multiple splice forms (5,873) were detected for 2,538 genes, resulting in a total of 61,250 gene models in the rice genome. We incorporated experimental evidence into 18,252 gene models to improve the quality of the structural annotation. A series of functional data types has been annotated for the rice genome that includes alignment with genetic markers, assignment of gene ontologies, identification of flanking sequence tags, alignment with homologs from related species, and syntenic mapping with other cereal species. All structural and functional annotation data are available through interactive search and display windows as well as through download of flat files. To integrate the data with other genome projects, the annotation data are available through a Distributed Annotation System and a Genome Browser. All data can be obtained through the project Web pages at http://rice.tigr.org. PMID:15888674

  15. Elucidating gene function and function evolution through comparison of co-expression networks of plants

    PubMed Central

    Hansen, Bjoern O.; Vaid, Neha; Musialak-Lange, Magdalena; Janowski, Marcin; Mutwil, Marek

    2014-01-01

    The analysis of gene expression data has shown that transcriptionally coordinated (co-expressed) genes are often functionally related, enabling scientists to use expression data in gene function prediction. This Focused Review discusses our original paper (Large-scale co-expression approach to dissect secondary cell wall formation across plant species, Frontiers in Plant Science 2:23). In this paper we applied cross-species analysis to co-expression networks of genes involved in cellulose biosynthesis. We showed that the co-expression networks from different species are highly similar, indicating that whole biological pathways are conserved across species. This finding has two important implications. First, the analysis can transfer gene function annotation from well-studied plants, such as Arabidopsis, to other, uncharacterized plant species. As the analysis finds genes that have similar sequence and similar expression pattern across different organisms, functionally equivalent genes can be identified. Second, since co-expression analyses are often noisy, a comparative analysis should have higher performance, as parts of co-expression networks that are conserved are more likely to be functionally relevant. In this Focused Review, we outline the comparative analysis done in the original paper and comment on the recent advances and approaches that allow comparative analyses of co-function networks. We hypothesize that in comparison to simple co-expression analysis, comparative analysis would yield more accurate gene function predictions. Finally, by combining comparative analysis with genomic information of green plants, we propose a possible composition of cellulose biosynthesis machinery during earlier stages of plant evolution. PMID:25191328

  16. Transcriptome sequencing and annotation of the microalgae Dunaliella tertiolecta: Pathway description and gene discovery for production of next-generation biofuels

    PubMed Central

    2011-01-01

    Background Biodiesel or ethanol derived from lipids or starch produced by microalgae may overcome many of the sustainability challenges previously ascribed to petroleum-based fuels and first generation plant-based biofuels. The paucity of microalgae genome sequences, however, limits gene-based biofuel feedstock optimization studies. Here we describe the sequencing and de novo transcriptome assembly for the non-model microalgae species, Dunaliella tertiolecta, and identify pathways and genes of importance related to biofuel production. Results Next generation DNA pyrosequencing technology applied to D. tertiolecta transcripts produced 1,363,336 high quality reads with an average length of 400 bases. Following quality and size trimming, ~ 45% of the high quality reads were assembled into 33,307 isotigs with a 31-fold coverage and 376,482 singletons. Assembled sequences and singletons were subjected to BLAST similarity searches and annotated with Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology (KO) identifiers. These analyses identified the majority of lipid and starch biosynthesis and catabolism pathways in D. tertiolecta. Conclusions The construction of metabolic pathways involved in the biosynthesis and catabolism of fatty acids, triacylglycrols, and starch in D. tertiolecta as well as the assembled transcriptome provide a foundation for the molecular genetics and functional genomics required to direct metabolic engineering efforts that seek to enhance the quantity and character of microalgae-based biofuel feedstock. PMID:21401935

  17. Annotation of Differential Gene Expression in Small Yellow Follicles of a Broiler-Type Strain of Taiwan Country Chickens in Response to Acute Heat Stress

    PubMed Central

    Wang, Shih-Han; Tang, Pin-Chi; Chen, Chih-Feng; Chen, Hsin-Hsin; Lee, Yen-Pai; Chen, Shuen-Ei; Huang, San-Yuan

    2015-01-01

    This study investigated global gene expression in the small yellow follicles (6–8 mm diameter) of broiler-type B strain Taiwan country chickens (TCCs) in response to acute heat stress. Twelve 30-wk-old TCC hens were divided into four groups: control hens maintained at 25°C and hens subjected to 38°C acute heat stress for 2 h without recovery (H2R0), with 2-h recovery (H2R2), and with 6-h recovery (H2R6). Small yellow follicles were collected for RNA isolation and microarray analysis at the end of each time point. Results showed that 69, 51, and 76 genes were upregulated and 58, 15, 56 genes were downregulated after heat treatment of H2R0, H2R2, and H2R6, respectively, using a cutoff value of two-fold or higher. Gene ontology analysis revealed that these differentially expressed genes are associated with the biological processes of cell communication, developmental process, protein metabolic process, immune system process, and response to stimuli. Upregulation of heat shock protein 25, interleukin 6, metallopeptidase 1, and metalloproteinase 13, and downregulation of type II alpha 1 collagen, discoidin domain receptor tyrosine kinase 2, and Kruppel-like factor 2 suggested that acute heat stress induces proteolytic disintegration of the structural matrix and inflamed damage and adaptive responses of gene expression in the follicle cells. These suggestions were validated through gene expression, using quantitative real-time polymerase chain reaction. Functional annotation clarified that interleukin 6-related pathways play a critical role in regulating acute heat stress responses in the small yellow follicles of TCC hens. PMID:26587838

  18. Discovering Functions of Unannotated Genes from a Transcriptome Survey of Wild Fungal Isolates

    PubMed Central

    Ellison, Christopher E.; Kowbel, David; Glass, N. Louise; Taylor, John W.

    2014-01-01

    ABSTRACT Most fungal genomes are poorly annotated, and many fungal traits of industrial and biomedical relevance are not well suited to classical genetic screens. Assigning genes to phenotypes on a genomic scale thus remains an urgent need in the field. We developed an approach to infer gene function from expression profiles of wild fungal isolates, and we applied our strategy to the filamentous fungus Neurospora crassa. Using transcriptome measurements in 70 strains from two well-defined clades of this microbe, we first identified 2,247 cases in which the expression of an unannotated gene rose and fell across N. crassa strains in parallel with the expression of well-characterized genes. We then used image analysis of hyphal morphologies, quantitative growth assays, and expression profiling to test the functions of four genes predicted from our population analyses. The results revealed two factors that influenced regulation of metabolism of nonpreferred carbon and nitrogen sources, a gene that governed hyphal architecture, and a gene that mediated amino acid starvation resistance. These findings validate the power of our population-transcriptomic approach for inference of novel gene function, and we suggest that this strategy will be of broad utility for genome-scale annotation in many fungal systems. PMID:24692637

  19. Gene3D: merging structure and function for a Thousand genomes.

    PubMed

    Lees, Jonathan; Yeats, Corin; Redfern, Oliver; Clegg, Andrew; Orengo, Christine

    2010-01-01

    Over the last 2 years the Gene3D resource has been significantly improved, and is now more accurate and with a much richer interactive display via the Gene3D website (http://gene3d.biochem.ucl.ac.uk/). Gene3D provides accurate structural domain family assignments for over 1100 genomes and nearly 10,000,000 proteins. A hidden Markov model library, constructed from the manually curated CATH structural domain hierarchy, is used to search UniProt, RefSeq and Ensembl protein sequences. The resulting matches are refined into simple multi-domain architectures using a recently developed in-house algorithm, DomainFinder 3 (available at: ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/DomainFinder3/). The domain assignments are integrated with multiple external protein function descriptions (e.g. Gene Ontology and KEGG), structural annotations (e.g. coiled coils, disordered regions and sequence polymorphisms) and family resources (e.g. Pfam and eggNog) and displayed on the Gene3D website. The website allows users to view descriptions for both single proteins and genes and large protein sets, such as superfamilies or genomes. Subsets can then be selected for detailed investigation or associated functions and interactions can be used to expand explorations to new proteins. Gene3D also provides a set of services, including an interactive genome coverage graph visualizer, DAS annotation resources, sequence search facilities and SOAP services. PMID:19906693

  20. Expressed Sequence Tags With cDNA Termini: Previously Overlooked Resources for Gene Annotation and Transcriptome Exploration in Chlamydomonas reinhardtii

    PubMed Central

    Liang, Chun; Liu, Yuansheng; Liu, Lin; Davis, Adam C.; Shen, Yingjia; Li, Qingshun Quinn

    2008-01-01

    Many of Chlamydomonas reinhardtii expressed sequence tags (ESTs) in GenBank dbEST and community EST assemblies were either over- or undertrimmed in terms of their cDNA termini, which are defined as the diagnostic sequence elements that delineate 3?/5? ends of mRNA transcripts. Overtrimming represents a loss of directional, positional, and structural information of transcript ends whereas undertrimming causes unclean spurious sequences retained in ESTs that exert deleterious impacts on downstream EST-based applications. We examined 309,278 raw EST sequencing trace files of C. reinhardtii and found that only 57% had cDNA termini that matched the expected structures specified in their cDNA library constructions while satisfying our minimum length requirement for their final clean sequences. Using GMAP, 156,963 individual ESTs were mapped to the genome successfully, with their in silico-verified cDNA termini anchored to the genome. Our data analysis suggested strong macro- and microheterogeneity of 3?/5? end positions of individual transcripts derived from the same genes in C. reinhardtii. This work annotating differential ends of individual transcripts in the draft genome presents the research community with a new stream of data that will facilitate accurate determination of gene structures, genome annotation, and exploration of the transcriptome and mRNA metabolism in C. reinhardtii. PMID:18493042

  1. On Anomalies in Annotation Systems

    E-print Network

    Brust, Matthias R

    2007-01-01

    Today's computer-based annotation systems implement a wide range of functionalities that often go beyond those available in traditional paper-and-pencil annotations. Conceptually, annotation systems are based on thoroughly investigated psycho-sociological and pedagogical learning theories. They offer a huge diversity of annotation types that can be placed in textual as well as in multimedia format. Additionally, annotations can be published or shared with a group of interested parties via well-organized repositories. Although highly sophisticated annotation systems exist both conceptually as well as technologically, we still observe that their acceptance is somewhat limited. In this paper, we argue that nowadays annotation systems suffer from several fundamental problems that are inherent in the traditional paper-and-pencil annotation paradigm. As a solution, we propose to shift the annotation paradigm for the implementation of annotation system.

  2. Gene Ontology annotation highlights shared and divergent pathogenic strategies of type III effector proteins deployed by the plant pathogen Pseudomonas syringae pv tomato DC3000 and animal pathogenic Escherichia coli strains.

    PubMed

    Lindeberg, Magdalen; Biehl, Bryan S; Glasner, Jeremy D; Perna, Nicole T; Collmer, Alan; Collmer, Candace W

    2009-01-01

    Genome-informed identification and characterization of Type III effector repertoires in various bacterial strains and species is revealing important insights into the critical roles that these proteins play in the pathogenic strategies of diverse bacteria. However, non-systematic discipline-specific approaches to their annotation impede analysis of the accumulating wealth of data and inhibit easy communication of findings among researchers working on different experimental systems. The development of Gene Ontology (GO) terms to capture biological processes occurring during the interaction between organisms creates a common language that facilitates cross-genome analyses. The application of these terms to annotate type III effector genes in different bacterial species - the plant pathogen Pseudomonas syringae pv tomato DC3000 and animal pathogenic strains of Escherichia coli - illustrates how GO can effectively describe fundamental similarities and differences among different gene products deployed as part of diverse pathogenic strategies. In depth descriptions of the GO annotations for P. syringae pv tomato DC3000 effector AvrPtoB and the E. coli effector Tir are described, with special emphasis given to GO capability for capturing information about interacting proteins and taxa. GO-highlighted similarities in biological process and molecular function for effectors from additional pathosystems are also discussed. PMID:19278552

  3. Leveraging Functional-Annotation Data in Trans-ethnic Fine-Mapping Studies.

    PubMed

    Kichaev, Gleb; Pasaniuc, Bogdan

    2015-08-01

    Localization of causal variants underlying known risk loci is one of the main research challenges following genome-wide association studies. Risk loci are typically dissected through fine-mapping experiments in trans-ethnic cohorts for leveraging the variability in the local genetic structure across populations. More recent works have shown that genomic functional annotations (i.e., localization of tissue-specific regulatory marks) can be integrated for increasing fine-mapping performance within single-population studies. Here, we introduce methods that integrate the strength of association between genotype and phenotype, the variability in the genetic backgrounds across populations, and the genomic map of tissue-specific functional elements to increase trans-ethnic fine-mapping accuracy. Through extensive simulations and empirical data, we have demonstrated that our approach increases fine-mapping resolution over existing methods. We analyzed empirical data from a large-scale trans-ethnic rheumatoid arthritis (RA) study and showed that the functional genetic architecture of RA is consistent across European and Asian ancestries. In these data, we used our proposed methods to reduce the average size of the 90% credible set from 29 variants per locus for standard non-integrative approaches to 22 variants. PMID:26189819

  4. Heterologous expression of plasmodial proteins for structural studies and functional annotation

    PubMed Central

    Birkholtz, Lyn-Marie; Blatch, Gregory; Coetzer, Theresa L; Hoppe, Heinrich C; Human, Esmaré; Morris, Elizabeth J; Ngcete, Zoleka; Oldfield, Lyndon; Roth, Robyn; Shonhai, Addmore; Stephens, Linda; Louw, Abraham I

    2008-01-01

    Malaria remains the world's most devastating tropical infectious disease with as many as 40% of the world population living in risk areas. The widespread resistance of Plasmodium parasites to the cost-effective chloroquine and antifolates has forced the introduction of more costly drug combinations, such as Coartem®. In the absence of a vaccine in the foreseeable future, one strategy to address the growing malaria problem is to identify and characterize new and durable antimalarial drug targets, the majority of which are parasite proteins. Biochemical and structure-activity analysis of these proteins is ultimately essential in the characterization of such targets but requires large amounts of functional protein. Even though heterologous protein production has now become a relatively routine endeavour for most proteins of diverse origins, the functional expression of soluble plasmodial proteins is highly problematic and slows the progress of antimalarial drug target discovery. Here the status quo of heterologous production of plasmodial proteins is presented, constraints are highlighted and alternative strategies and hosts for functional expression and annotation of plasmodial proteins are reviewed. PMID:18828893

  5. An Introduction to Genome Annotation.

    PubMed

    Campbell, Michael S; Yandell, Mark

    2015-01-01

    Genome projects have evolved from large international undertakings to tractable endeavors for a single lab. Accurate genome annotation is critical for successful genomic, genetic, and molecular biology experiments. These annotations can be generated using a number of approaches and available software tools. This unit describes methods for genome annotation and a number of software tools commonly used in gene annotation. © 2015 by John Wiley & Sons, Inc. PMID:26678385

  6. PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees

    PubMed Central

    Mi, Huaiyu; Muruganujan, Anushya; Thomas, Paul D.

    2013-01-01

    The data and tools in PANTHER—a comprehensive, curated database of protein families, trees, subfamilies and functions available at http://pantherdb.org—have undergone continual, extensive improvement for over a decade. Here, we describe the current PANTHER process as a whole, as well as the website tools for analysis of user-uploaded data. The main goals of PANTHER remain essentially unchanged: the accurate inference (and practical application) of gene and protein function over large sequence databases, using phylogenetic trees to extrapolate from the relatively sparse experimental information from a few model organisms. Yet the focus of PANTHER has continually shifted toward more accurate and detailed representations of evolutionary events in gene family histories. The trees are now designed to represent gene family evolution, including inference of evolutionary events, such as speciation and gene duplication. Subfamilies are still curated and used to define HMMs, but gene ontology functional annotations can now be made at any node in the tree, and are designed to represent gain and loss of function by ancestral genes during evolution. Finally, PANTHER now includes stable database identifiers for inferred ancestral genes, which are used to associate inferred gene attributes with particular genes in the common ancestral genomes of extant species. PMID:23193289

  7. FunGene: the functional gene pipeline and repository

    PubMed Central

    Fish, Jordan A.; Chai, Benli; Wang, Qiong; Sun, Yanni; Brown, C. Titus; Tiedje, James M.; Cole, James R.

    2013-01-01

    Ribosomal RNA genes have become the standard molecular markers for microbial community analysis for good reasons, including universal occurrence in cellular organisms, availability of large databases, and ease of rRNA gene region amplification and analysis. As markers, however, rRNA genes have some significant limitations. The rRNA genes are often present in multiple copies, unlike most protein-coding genes. The slow rate of change in rRNA genes means that multiple species sometimes share identical 16S rRNA gene sequences, while many more species share identical sequences in the short 16S rRNA regions commonly analyzed. In addition, the genes involved in many important processes are not distributed in a phylogenetically coherent manner, potentially due to gene loss or horizontal gene transfer. While rRNA genes remain the most commonly used markers, key genes in ecologically important pathways, e.g., those involved in carbon and nitrogen cycling, can provide important insights into community composition and function not obtainable through rRNA analysis. However, working with ecofunctional gene data requires some tools beyond those required for rRNA analysis. To address this, our Functional Gene Pipeline and Repository (FunGene; http://fungene.cme.msu.edu/) offers databases of many common ecofunctional genes and proteins, as well as integrated tools that allow researchers to browse these collections and choose subsets for further analysis, build phylogenetic trees, test primers and probes for coverage, and download aligned sequences. Additional FunGene tools are specialized to process coding gene amplicon data. For example, FrameBot produces frameshift-corrected protein and DNA sequences from raw reads while finding the most closely related protein reference sequence. These tools can help provide better insight into microbial communities by directly studying key genes involved in important ecological processes. PMID:24101916

  8. Annotation of Protein Domains Reveals Remarkable Conservation in the Functional Make up of Proteomes Across Superkingdoms

    PubMed Central

    Nasir, Arshan; Naeem, Aisha; Khan, Muhammad Jawad; Lopez-Nicora, Horacio D.; Caetano-Anollés, Gustavo

    2011-01-01

    The functional repertoire of a cell is largely embodied in its proteome, the collection of proteins encoded in the genome of an organism. The molecular functions of proteins are the direct consequence of their structure and structure can be inferred from sequence using hidden Markov models of structural recognition. Here we analyze the functional annotation of protein domain structures in almost a thousand sequenced genomes, exploring the functional and structural diversity of proteomes. We find there is a remarkable conservation in the distribution of domains with respect to the molecular functions they perform in the three superkingdoms of life. In general, most of the protein repertoire is spent in functions related to metabolic processes but there are significant differences in the usage of domains for regulatory and extra-cellular processes both within and between superkingdoms. Our results support the hypotheses that the proteomes of superkingdom Eukarya evolved via genome expansion mechanisms that were directed towards innovating new domain architectures for regulatory and extra/intracellular process functions needed for example to maintain the integrity of multicellular structure or to interact with environmental biotic and abiotic factors (e.g., cell signaling and adhesion, immune responses, and toxin production). Proteomes of microbial superkingdoms Archaea and Bacteria retained fewer numbers of domains and maintained simple and smaller protein repertoires. Viruses appear to play an important role in the evolution of superkingdoms. We finally identify few genomic outliers that deviate significantly from the conserved functional design. These include Nanoarchaeum equitans, proteobacterial symbionts of insects with extremely reduced genomes, Tenericutes and Guillardia theta. These organisms spend most of their domains on information functions, including translation and transcription, rather than on metabolism and harbor a domain repertoire characteristic of parasitic organisms. In contrast, the functional repertoire of the proteomes of the Planctomycetes-Verrucomicrobia-Chlamydiae superphylum was no different than the rest of bacteria, failing to support claims of them representing a separate superkingdom. In turn, Protista and Bacteria shared similar functional distribution patterns suggesting an ancestral evolutionary link between these groups. PMID:24710297

  9. RNA-seq-Based Gene Annotation and Comparative Genomics of Four Fungal Grass Pathogens in the Genus Zymoseptoria Identify Novel Orphan Genes and Species-Specific Invasions of Transposable Elements

    PubMed Central

    Grandaubert, Jonathan; Bhattacharyya, Amitava; Stukenbrock, Eva H.

    2015-01-01

    The fungal pathogen Zymoseptoria tritici (synonym Mycosphaerella graminicola) is a prominent pathogen of wheat. The reference genome of the isolate IPO323 is one of the best-assembled eukaryotic genomes and encodes more than 10,000 predicted genes. However, a large proportion of the previously annotated gene models are incomplete, with either no start or no stop codons. The availability of RNA-seq data allows better predictions of gene structure. We here used two different RNA-seq datasets, de novo transcriptome assemblies, homology-based comparisons, and trained ab initio gene callers to generate a new gene annotation of Z. tritici IPO323. The annotation pipeline was also applied to re-sequenced genomes of three closely related species of Z. tritici: Z. pseudotritici, Z. ardabiliae, and Z. brevis. Comparative analyses of the predicted gene models using the four Zymoseptoria species revealed sets of species-specific orphan genes enriched with putative pathogenicity-related genes encoding small secreted proteins that may play essential roles in virulence and host specificity. De novo repeat identification allowed us to show that few families of transposable elements are shared between Zymoseptoria species while we observe many species-specific invasions and expansions. The annotation data presented here provide a high-quality resource for future studies of Z. tritici and its sister species and provide detailed insight into gene and genome evolution of fungal plant pathogens. PMID:25917918

  10. Genomewide Structural Annotation and Evolutionary Analysis of the Type I MADS-Box Genes in Plants

    E-print Network

    Gent, Universiteit

    -box genes by the absence of the keratin-like box. In this in silico study, we have structurally an- notated and type II genes are also referred to as MADS SRF-like and MADS MEF2- like genes, respectively (Alvarez conserved MADS domain, animal type I (SRF-like) and type II (MEF2-like) genes contain an additionally

  11. The functional diversity of essential genes required for mammalian cardiac development

    PubMed Central

    Clowes, Christopher; Boylan, Michael GS; Ridge, Liam A; Barnes, Emma; Wright, Jayne A; Hentges, Kathryn E

    2014-01-01

    Genes required for an organism to develop to maturity (for which no other gene can compensate) are considered essential. The continuing functional annotation of the mouse genome has enabled the identification of many essential genes required for specific developmental processes including cardiac development. Patterns are now emerging regarding the functional nature of genes required at specific points throughout gestation. Essential genes required for development beyond cardiac progenitor cell migration and induction include a small and functionally homogenous group encoding transcription factors, ligands and receptors. Actions of core cardiogenic transcription factors from the Gata, Nkx, Mef, Hand, and Tbx families trigger a marked expansion in the functional diversity of essential genes from midgestation onwards. As the embryo grows in size and complexity, genes required to maintain a functional heartbeat and to provide muscular strength and regulate blood flow are well represented. These essential genes regulate further specialization and polarization of cell types along with proliferative, migratory, adhesive, contractile, and structural processes. The identification of patterns regarding the functional nature of essential genes across numerous developmental systems may aid prediction of further essential genes and those important to development and/or progression of disease. genesis 52:713–737, 2014. PMID:24866031

  12. The UniProt-GO Annotation database in 2011.

    PubMed

    Dimmer, Emily C; Huntley, Rachael P; Alam-Faruque, Yasmin; Sawford, Tony; O'Donovan, Claire; Martin, Maria J; Bely, Benoit; Browne, Paul; Mun Chan, Wei; Eberhardt, Ruth; Gardner, Michael; Laiho, Kati; Legge, Duncan; Magrane, Michele; Pichler, Klemens; Poggioli, Diego; Sehra, Harminder; Auchincloss, Andrea; Axelsen, Kristian; Blatter, Marie-Claude; Boutet, Emmanuel; Braconi-Quintaje, Silvia; Breuza, Lionel; Bridge, Alan; Coudert, Elizabeth; Estreicher, Anne; Famiglietti, Livia; Ferro-Rojas, Serenella; Feuermann, Marc; Gos, Arnaud; Gruaz-Gumowski, Nadine; Hinz, Ursula; Hulo, Chantal; James, Janet; Jimenez, Silvia; Jungo, Florence; Keller, Guillaume; Lemercier, Phillippe; Lieberherr, Damien; Masson, Patrick; Moinat, Madelaine; Pedruzzi, Ivo; Poux, Sylvain; Rivoire, Catherine; Roechert, Bernd; Schneider, Michael; Stutz, Andre; Sundaram, Shyamala; Tognolli, Michael; Bougueleret, Lydie; Argoud-Puy, Ghislaine; Cusin, Isabelle; Duek-Roggli, Paula; Xenarios, Ioannis; Apweiler, Rolf

    2012-01-01

    The GO annotation dataset provided by the UniProt Consortium (GOA: http://www.ebi.ac.uk/GOA) is a comprehensive set of evidenced-based associations between terms from the Gene Ontology resource and UniProtKB proteins. Currently supplying over 100 million annotations to 11 million proteins in more than 360,000 taxa, this resource has increased 2-fold over the last 2 years and has benefited from a wealth of checks to improve annotation correctness and consistency as well as now supplying a greater information content enabled by GO Consortium annotation format developments. Detailed, manual GO annotations obtained from the curation of peer-reviewed papers are directly contributed by all UniProt curators and supplemented with manual and electronic annotations from 36 model organism and domain-focused scientific resources. The inclusion of high-quality, automatic annotation predictions ensures the UniProt GO annotation dataset supplies functional information to a wide range of proteins, including those from poorly characterized, non-model organism species. UniProt GO annotations are freely available in a range of formats accessible by both file downloads and web-based views. In addition, the introduction of a new, normalized file format in 2010 has made for easier handling of the complete UniProt-GOA data set. PMID:22123736

  13. De Novo Assembly and Functional Annotation of the Olive (Olea europaea) Transcriptome

    PubMed Central

    Muñoz-Mérida, Antonio; González-Plaza, Juan José; Cañada, Andrés; Blanco, Ana María; García-López, Maria del Carmen; Rodríguez, José Manuel; Pedrola, Laia; Sicardo, M. Dolores; Hernández, M. Luisa; De la Rosa, Raúl; Belaj, Angjelina; Gil-Borja, Mayte; Luque, Francisco; Martínez-Rivas, José Manuel; Pisano, David G.; Trelles, Oswaldo; Valpuesta, Victoriano; Beuzón, Carmen R.

    2013-01-01

    Olive breeding programmes are focused on selecting for traits as short juvenile period, plant architecture suited for mechanical harvest, or oil characteristics, including fatty acid composition, phenolic, and volatile compounds to suit new markets. Understanding the molecular basis of these characteristics and improving the efficiency of such breeding programmes require the development of genomic information and tools. However, despite its economic relevance, genomic information on olive or closely related species is still scarce. We have applied Sanger and 454 pyrosequencing technologies to generate close to 2 million reads from 12 cDNA libraries obtained from the Picual, Arbequina, and Lechin de Sevilla cultivars and seedlings from a segregating progeny of a Picual × Arbequina cross. The libraries include fruit mesocarp and seeds at three relevant developmental stages, young stems and leaves, active juvenile and adult buds as well as dormant buds, and juvenile and adult roots. The reads were assembled by library or tissue and then assembled together into 81 020 unigenes with an average size of 496 bases. Here, we report their assembly and their functional annotation. PMID:23297299

  14. GenoQuery: a new querying module for functional annotation in a genomic warehouse

    PubMed Central

    Lemoine, Frédéric; Labedan, Bernard; Froidevaux, Christine

    2008-01-01

    Motivation: We have to cope with both a deluge of new genome sequences and a huge amount of data produced by high-throughput approaches used to exploit these genomic features. Crossing and comparing such heterogeneous and disparate data will help improving functional annotation of genomes. This requires designing elaborate integration systems such as warehouses for storing and querying these data. Results: We have designed a relational genomic warehouse with an original multi-layer architecture made of a databases layer and an entities layer. We describe a new querying module, GenoQuery, which is based on this architecture. We use the entities layer to define mixed queries. These mixed queries allow searching for instances of biological entities and their properties in the different databases, without specifying in which database they should be found. Accordingly, we further introduce the central notion of alternative queries. Such queries have the same meaning as the original mixed queries, while exploiting complementarities yielded by the various integrated databases of the warehouse. We explain how GenoQuery computes all the alternative queries of a given mixed query. We illustrate how useful this querying module is by means of a thorough example. Availability: http://www.lri.fr/~lemoine/GenoQuery/ Contact: chris@lri.fr, lemoine@lri.fr PMID:18586731

  15. Discovery of Tumor Suppressor Gene Function.

    ERIC Educational Resources Information Center

    Oppenheimer, Steven B.

    1995-01-01

    This is an update of a 1991 review on tumor suppressor genes written at a time when understanding of how the genes work was limited. A recent major breakthrough in the understanding of the function of tumor suppressor genes is discussed. (LZ)

  16. Annotation-Free Estimates of Gene-Expression from mRNA-Seq Elizabeth Purdom

    E-print Network

    McAuliffe, Jon

    alternative splicing, correctly estimating the standard measures of gene ex- pression can be a complex problem gene estimates in the presence of moderate alternative splicing. We compare all of these methods on two of primary interest. When the sequenced organism displays no alternative splicing, estimates of gene

  17. A framework for annotating human genome in disease context.

    PubMed

    Xu, Wei; Wang, Huisong; Cheng, Wenqing; Fu, Dong; Xia, Tian; Kibbe, Warren A; Lin, Simon M

    2012-01-01

    Identification of gene-disease association is crucial to understanding disease mechanism. A rapid increase in biomedical literatures, led by advances of genome-scale technologies, poses challenge for manually-curated-based annotation databases to characterize gene-disease associations effectively and timely. We propose an automatic method-The Disease Ontology Annotation Framework (DOAF) to provide a comprehensive annotation of the human genome using the computable Disease Ontology (DO), the NCBO Annotator service and NCBI Gene Reference Into Function (GeneRIF). DOAF can keep the resulting knowledgebase current by periodically executing automatic pipeline to re-annotate the human genome using the latest DO and GeneRIF releases at any frequency such as daily or monthly. Further, DOAF provides a computable and programmable environment which enables large-scale and integrative analysis by working with external analytic software or online service platforms. A user-friendly web interface (doa.nubic.northwestern.edu) is implemented to allow users to efficiently query, download, and view disease annotations and the underlying evidences. PMID:23251346

  18. A Framework for Annotating Human Genome in Disease Context

    PubMed Central

    Cheng, Wenqing; Fu, Dong; Xia, Tian; Kibbe, Warren A.; Lin, Simon M.

    2012-01-01

    Identification of gene-disease association is crucial to understanding disease mechanism. A rapid increase in biomedical literatures, led by advances of genome-scale technologies, poses challenge for manually-curated-based annotation databases to characterize gene-disease associations effectively and timely. We propose an automatic method-The Disease Ontology Annotation Framework (DOAF) to provide a comprehensive annotation of the human genome using the computable Disease Ontology (DO), the NCBO Annotator service and NCBI Gene Reference Into Function (GeneRIF). DOAF can keep the resulting knowledgebase current by periodically executing automatic pipeline to re-annotate the human genome using the latest DO and GeneRIF releases at any frequency such as daily or monthly. Further, DOAF provides a computable and programmable environment which enables large-scale and integrative analysis by working with external analytic software or online service platforms. A user-friendly web interface (doa.nubic.northwestern.edu) is implemented to allow users to efficiently query, download, and view disease annotations and the underlying evidences. PMID:23251346

  19. Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases.

    PubMed

    Gobeill, Julien; Pasche, Emilie; Vishnyakova, Dina; Ruch, Patrick

    2013-01-01

    The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. DATABASE URL: http://eagl.unige.ch/GOCat/ PMID:23842461

  20. Annotated embryonic CNS expression patterns of 5000 GMR GAL4 lines: a resource for manipulating gene expression and analyzing cis-regulatory modules

    PubMed Central

    Manning, Laurina; Heckscher, Ellie S.; Purice, Maria D.; Roberts, Jourdain; Bennett, Alysha L.; Kroll, Jason R.; Pollard, Jill L.; Strader, Marie E.; Lupton, Josh R.; Dyukareva, Anna V.; Doan, Phuong Nam; Bauer, David M.; Wilbur, Allison N.; Tanner, Stephanie; Kelly, Jimmy J.; Lai, Sen-Lin; Tran, Khoa D.; Kohwi, Minoree; Laverty, Todd R.; Pearson, Joseph C.; Crews, Stephen T.; Rubin, Gerald M.; Doe, Chris Q.

    2012-01-01

    Here we describe the embryonic CNS expression of 5,000 GAL4 lines made using molecularly defined cis-regulatory DNA inserted into a single attP genomic location. We document and annotate the patterns in early embryos when neurogenesis is at its peak, and in older embryos where there is maximal neuronal diversity and the first neural circuits are established. We note expression in other tissues such as the lateral body wall (muscle, sensory neurons, trachea) and viscera. Companion papers report on the adult brain and larval imaginal discs, and the integrated datasets are available online (www.janelia.org/flylight/gal4-gen1). This collection of embryonically-expressed GAL4 lines will be valuable for determining neuronal morphology and function; the 1862 lines expressed in small subsets of neurons (<20/segment) will be especially valuable for characterizing interneuronal diversity and function, as interneurons comprise the majority of all CNS neurons, yet their gene expression profile and function remain virtually unexplored. PMID:23063363

  1. Augmented annotation and orthologue analysis for Oryctolagus cuniculus: Better Bunny

    PubMed Central

    2012-01-01

    Background The rabbit is an important model organism used in a wide range of biomedical research. However, the rabbit genome is still sparsely annotated, thus prohibiting extensive functional analysis of gene sets derived from whole-genome experiments. We developed a web-based application that provides augmented annotation and orthologue analysis for rabbit genes. Importantly, the application allows comprehensive functional analysis through the use of orthologous relationships. Results Using data extracted from several public bioinformatics repositories we created Better Bunny, a database and query tool that extensively augments the available functional annotation for rabbit genes. Using the complete set of target genes from a commercial rabbit gene expression microarray as our benchmark, we are able to obtain functional information for 88 % of the genes on the microarray. Previously, functional information was available for fewer than 10 % of the rabbit genes. Conclusions We have developed a freely available, web-accessible bioinformatics tool that enables investigators to quickly and easily perform extensive functional analysis of rabbit genes (http://cptweb.cpt.wayne.edu). The software application fills a critical void for a wide range of biomedical research that relies on the rabbit model and requires characterization of biological function for large sets of genes. PMID:22568790

  2. Use of Modern Chemical Protein Synthesis and Advanced Fluorescent Assay Techniques to Experimentally Validate the Functional Annotation of Microbial Genomes

    SciTech Connect

    Kent, Stephen

    2012-07-20

    The objective of this research program was to prototype methods for the chemical synthesis of predicted protein molecules in annotated microbial genomes. High throughput chemical methods were to be used to make large numbers of predicted proteins and protein domains, based on microbial genome sequences. Microscale chemical synthesis methods for the parallel preparation of peptide-thioester building blocks were developed; these peptide segments are used for the parallel chemical synthesis of proteins and protein domains. Ultimately, it is envisaged that these synthetic molecules would be ‘printed’ in spatially addressable arrays. The unique ability of total synthesis to precision label protein molecules with dyes and with chemical or biochemical ‘tags’ can be used to facilitate novel assay technologies adapted from state-of-the art single molecule fluorescence detection techniques. In the future, in conjunction with modern laboratory automation this integrated set of techniques will enable high throughput experimental validation of the functional annotation of microbial genomes.

  3. Novel cardiovascular gene functions revealed via systematic phenotype prediction in zebrafish

    PubMed Central

    Musso, Gabriel; Tasan, Murat; Mosimann, Christian; Beaver, John E.; Plovie, Eva; Carr, Logan A.; Chua, Hon Nian; Dunham, Julie; Zuberi, Khalid; Rodriguez, Harold; Morris, Quaid; Zon, Leonard; Roth, Frederick P.; MacRae, Calum A.

    2014-01-01

    Comprehensive functional annotation of vertebrate genomes is fundamental to biological discovery. Reverse genetic screening has been highly useful for determination of gene function, but is untenable as a systematic approach in vertebrate model organisms given the number of surveyable genes and observable phenotypes. Unbiased prediction of gene-phenotype relationships offers a strategy to direct finite experimental resources towards likely phenotypes, thus maximizing de novo discovery of gene functions. Here we prioritized genes for phenotypic assay in zebrafish through machine learning, predicting the effect of loss of function of each of 15,106 zebrafish genes on 338 distinct embryonic anatomical processes. Focusing on cardiovascular phenotypes, the learning procedure predicted known knockdown and mutant phenotypes with high precision. In proof-of-concept studies we validated 16 high-confidence cardiac predictions using targeted morpholino knockdown and initial blinded phenotyping in embryonic zebrafish, confirming a significant enrichment for cardiac phenotypes as compared with morpholino controls. Subsequent detailed analyses of cardiac function confirmed these results, identifying novel physiological defects for 11 tested genes. Among these we identified tmem88a, a recently described attenuator of Wnt signaling, as a discrete regulator of the patterning of intercellular coupling in the zebrafish cardiac epithelium. Thus, we show that systematic prioritization in zebrafish can accelerate the pace of developmental gene function discovery. PMID:24346703

  4. Incorporating gene functions into regression analysis of DNA-protein binding data and gene expression data to construct transcriptional networks.

    PubMed

    Wei, Peng; Pan, Wei

    2008-01-01

    Useful information on transcriptional networks has been extracted by regression analyses of gene expression data and DNA-protein binding data. However, a potential limitation of these approaches is their assumption on the common and constant activity level of a transcription factor (TF) on all the genes in any given experimental condition; for example, any TF is assumed to be either an activator or a repressor, but not both, while it is known that some TFs can be dual regulators. Rather than assuming a common linear regression model for all the genes, we propose using separate regression models for various gene groups; the genes can be grouped based on their functions or some clustering results. Furthermore, to take advantage of the hierarchical structure of many existing gene function annotation systems, such as Gene Ontology (GO), we propose a shrinkage method that borrows information from relevant gene groups. Applications to a yeast dataset and simulations lend support for our proposed methods. In particular, we find that the shrinkage method consistently works well under various scenarios. We recommend the use of the shrinkage method as a useful alternative to the existing methods. PMID:18670043

  5. Evolution and Biochemistry of Family 4 Glycosidases: Implications for Assigning Enzyme Function in Sequence Annotations

    PubMed Central

    Pikis, Andreas; Thompson, John

    2009-01-01

    Glycosyl hydrolase Family 4 (GH4) is exceptional among the 114 families in this enzyme superfamily. Members of GH4 exhibit unusual cofactor requirements for activity, and an essential cysteine residue is present at the active site. Of greatest significance is the fact that members of GH4 employ a unique catalytic mechanism for cleavage of the glycosidic bond. By phylogenetic analysis, and from available substrate specificities, we have assigned a majority of the enzymes of GH4 to five subgroups. Our classification revealed an unexpected relationship between substrate specificity and the presence, in each subgroup, of a motif of four amino acids that includes the active-site Cys residue: ?-glucosidase, CHE(I/V); ?-galactosidase, CHSV; ?-glucuronidase, CHGx; 6-phospho-?-glucosidase, CDMP; and 6-phospho-?-glucosidase, CN(V/I)P. The question arises: Does the presence of a particular motif sufficiently predict the catalytic function of an unassigned GH4 protein? To test this hypothesis, we have purified and characterized the ?-glucoside–specific GH4 enzyme (PalH) from the phytopathogen, Erwinia rhapontici. The CHEI motif in this protein has been changed by site-directed mutagenesis, and the effects upon substrate specificity have been determined. The change to CHSV caused the loss of all ?-glucosidase activity, but the mutant protein exhibited none of the anticipated ?-galactosidase activity. The Cys-containing motif may be suggestive of enzyme specificity, but phylogenetic placement is required for confidence in that specificity. The Acholeplasma laidlawii GH4 protein is phylogenetically a phospho-?-glucosidase but has a unique SSSP motif. Lacking the initial Cys in that motif it cannot hydrolyze glycosides by the normal GH4 mechanism because the Cys is required to position the metal ion for hydrolysis, nor can it use the more common single or double-displacement mechanism of Koshland. Several considerations suggest that the protein has acquired a new function as the consequence of positive selection. This study emphasizes the importance of automatic annotation systems that by integrating phylogenetic analysis, functional motifs, and bioinformatics data, may lead to innovative experiments that further our understanding of biological systems. PMID:19625389

  6. Antagonistic functional duality of cancer genes.

    PubMed

    Stepanenko, A A; Vassetzky, Y S; Kavsan, V M

    2013-10-25

    Cancer evolution is a stochastic process both at the genome and gene levels. Most of tumors contain multiple genetic subclones, evolving in either succession or in parallel, either in a linear or branching manner, with heterogeneous genome and gene alterations, extensively rewired signaling networks, and addicted to multiple oncogenes easily switching with each other during cancer progression and medical intervention. Hundreds of discovered cancer genes are classified according to whether they function in a dominant (oncogenes) or recessive (tumor suppressor genes) manner in a cancer cell. However, there are many cancer "gene-chameleons", which behave distinctly in opposite way in the different experimental settings showing antagonistic duality. In contrast to the widely accepted view that mutant NADP(+)-dependent isocitrate dehydrogenases 1/2 (IDH1/2) and associated metabolite 2-hydroxyglutarate (R)-enantiomer are intrinsically "the drivers" of tumourigenesis, mutant IDH1/2 inhibited, promoted or had no effect on cell proliferation, growth and tumorigenicity in diverse experiments. Similar behavior was evidenced for dozens of cancer genes. Gene function is dependent on genetic network, which is defined by the genome context. The overall changes in karyotype can result in alterations of the role and function of the same genes and pathways. The diverse cell lines and tumor samples have been used in experiments for proving gene tumor promoting/suppressive activity. They all display heterogeneous individual karyotypes and disturbed signaling networks. Consequently, the effect and function of gene under investigation can be opposite and versatile in cells with different genomes that may explain antagonistic duality of cancer genes and the cell type- or the cellular genetic/context-dependent response to the same protein. Antagonistic duality of cancer genes might contribute to failure of chemotherapy. Instructive examples of unexpected activity of cancer genes and "paradoxical" effects of different anticancer drugs depending on the cellular genetic context/signaling network are discussed. PMID:23933273

  7. Annotation of metabolic and biosynthesis genes from Hessian fly (Diptera: Cecidomyiidae)

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The Hessian fly is the major insect pest of wheat in the southeastern United States and has traditionally been controlled through the utilization of Hessian fly resistance (R) genes in wheat. Such R genes are a limited resource, and once deployed lose their field effectiveness with time. Using 21 ...

  8. Automation of Drosophila gene expression pattern image annotation : development of web-based image annotation tool and application of machine learning methods

    E-print Network

    Ayuso, Anna Maria E

    2011-01-01

    Large-scale in situ hybridization screens are providing an abundance of spatio-temporal patterns of gene expression data that is valuable for understanding the mechanisms of gene regulation. Drosophila gene expression ...

  9. Evaluating the significance of protein functional similarity based on gene ontology.

    PubMed

    Konopka, Bogumil M; Golda, Tomasz; Kotulska, Malgorzata

    2014-11-01

    Gene ontology is among the most successful ontologies in the biomedical domain. It is used to describe, unambiguously, protein molecular functions, cellular localizations, and processes in which proteins participate. The hierarchical structure of gene ontology allows quantifying protein functional similarity by application of algorithms that calculate semantic similarities. The scores, however, are meaningless without a given context. Here, we propose how to evaluate the significance of protein function semantic similarity scores by comparing them to reference distributions calculated for randomly chosen proteins. In the study, thresholds for significant functional semantic similarity, in four representative annotation corpuses, were estimated. We also show that the score significance is influenced by the number and specificity of gene ontology terms that are annotated to compared proteins. While proteins with a greater number of terms tend to yield higher similarity scores, proteins with more specific terms produce lower scores. The estimated significance thresholds were validated using protein sequence-function and structure-function relationships. Taking into account the term number and term specificity improves the distinction between significant and insignificant semantic similarity comparisons. PMID:25188814

  10. Interestingness measures and strategies for mining multi-ontology multi-level association rules from gene ontology annotations for the discovery of new GO relationships.

    PubMed

    Manda, Prashanti; McCarthy, Fiona; Bridges, Susan M

    2013-10-01

    The Gene Ontology (GO), a set of three sub-ontologies, is one of the most popular bio-ontologies used for describing gene product characteristics. GO annotation data containing terms from multiple sub-ontologies and at different levels in the ontologies is an important source of implicit relationships between terms from the three sub-ontologies. Data mining techniques such as association rule mining that are tailored to mine from multiple ontologies at multiple levels of abstraction are required for effective knowledge discovery from GO annotation data. We present a data mining approach, Multi-ontology data mining at All Levels (MOAL) that uses the structure and relationships of the GO to mine multi-ontology multi-level association rules. We introduce two interestingness measures: Multi-ontology Support (MOSupport) and Multi-ontology Confidence (MOConfidence) customized to evaluate multi-ontology multi-level association rules. We also describe a variety of post-processing strategies for pruning uninteresting rules. We use publicly available GO annotation data to demonstrate our methods with respect to two applications (1) the discovery of co-annotation suggestions and (2) the discovery of new cross-ontology relationships. PMID:23850840

  11. Smoking Gun or Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants.

    PubMed

    Gagliano, Sarah A; Ravji, Reena; Barnes, Michael R; Weale, Michael E; Knight, Jo

    2015-01-01

    Although technology has triumphed in facilitating routine genome sequencing, new challenges have been created for the data-analyst. Genome-scale surveys of human variation generate volumes of data that far exceed capabilities for laboratory characterization. By incorporating functional annotations as predictors, statistical learning has been widely investigated for prioritizing genetic variants likely to be associated with complex disease. We compared three published prioritization procedures, which use different statistical learning algorithms and different predictors with regard to the quantity, type and coding. We also explored different combinations of algorithm and annotation set. As an application, we tested which methodology performed best for prioritizing variants using data from a large schizophrenia meta-analysis by the Psychiatric Genomics Consortium. Results suggest that all methods have considerable (and similar) predictive accuracies (AUCs 0.64-0.71) in test set data, but there is more variability in the application to the schizophrenia GWAS. In conclusion, a variety of algorithms and annotations seem to have a similar potential to effectively enrich true risk variants in genome-scale datasets, however none offer more than incremental improvement in prediction. We discuss how methods might be evolved for risk variant prediction to address the impending bottleneck of the new generation of genome re-sequencing studies. PMID:26300220

  12. Smoking Gun or Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants

    PubMed Central

    Gagliano, Sarah A.; Ravji, Reena; Barnes, Michael R.; Weale, Michael E.; Knight, Jo

    2015-01-01

    Although technology has triumphed in facilitating routine genome sequencing, new challenges have been created for the data-analyst. Genome-scale surveys of human variation generate volumes of data that far exceed capabilities for laboratory characterization. By incorporating functional annotations as predictors, statistical learning has been widely investigated for prioritizing genetic variants likely to be associated with complex disease. We compared three published prioritization procedures, which use different statistical learning algorithms and different predictors with regard to the quantity, type and coding. We also explored different combinations of algorithm and annotation set. As an application, we tested which methodology performed best for prioritizing variants using data from a large schizophrenia meta-analysis by the Psychiatric Genomics Consortium. Results suggest that all methods have considerable (and similar) predictive accuracies (AUCs 0.64–0.71) in test set data, but there is more variability in the application to the schizophrenia GWAS. In conclusion, a variety of algorithms and annotations seem to have a similar potential to effectively enrich true risk variants in genome-scale datasets, however none offer more than incremental improvement in prediction. We discuss how methods might be evolved for risk variant prediction to address the impending bottleneck of the new generation of genome re-sequencing studies. PMID:26300220

  13. IMG ER: A System for Microbial Genome Annotation Expert Review and Curation

    SciTech Connect

    Markowitz, Victor M.; Mavromatis, Konstantinos; Ivanova, Natalia N.; Chen, I-Min A.; Chu, Ken; Kyrpides, Nikos C.

    2009-05-25

    A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG's rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER's annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG's comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes.

  14. Combined QTL and Selective Sweep Mappings with Coding SNP Annotation and cis-eQTL Analysis Revealed PARK2 and JAG2 as New Candidate Genes for Adiposity Regulation

    PubMed Central

    Roux, Pierre-François; Boitard, Simon; Blum, Yuna; Parks, Brian; Montagner, Alexandra; Mouisel, Etienne; Djari, Anis; Esquerré, Diane; Désert, Colette; Boutin, Morgane; Leroux, Sophie; Lecerf, Frédéric; Le Bihan-Duval, Elisabeth; Klopp, Christophe; Servin, Bertrand; Pitel, Frédérique; Duclos, Michel Jean; Guillou, Hervé; Lusis, Aldons J.; Demeure, Olivier; Lagarrigue, Sandrine

    2015-01-01

    Very few causal genes have been identified by quantitative trait loci (QTL) mapping because of the large size of QTL, and most of them were identified thanks to functional links already known with the targeted phenotype. Here, we propose to combine selection signature detection, coding SNP annotation, and cis-expression QTL analyses to identify potential causal genes underlying QTL identified in divergent line designs. As a model, we chose experimental chicken lines divergently selected for only one trait, the abdominal fat weight, in which several QTL were previously mapped. Using new haplotype-based statistics exploiting the very high SNP density generated through whole-genome resequencing, we found 129 significant selective sweeps. Most of the QTL colocalized with at least one sweep, which markedly narrowed candidate region size. Some of those sweeps contained only one gene, therefore making them strong positional causal candidates with no presupposed function. We then focused on two of these QTL/sweeps. The absence of nonsynonymous SNPs in their coding regions strongly suggests the existence of causal mutations acting in cis on their expression, confirmed by cis-eQTL identification using either allele-specific expression or genetic mapping analyses. Additional expression analyses of those two genes in the chicken and mice contrasted for adiposity reinforces their link with this phenotype. This study shows for the first time the interest of combining selective sweeps mapping, coding SNP annotation and cis-eQTL analyses for identifying causative genes for a complex trait, in the context of divergent lines selected for this specific trait. Moreover, it highlights two genes, JAG2 and PARK2, as new potential negative and positive key regulators of adiposity in chicken and mice. PMID:25653314

  15. Combined QTL and selective sweep mappings with coding SNP annotation and cis-eQTL analysis revealed PARK2 and JAG2 as new candidate genes for adiposity regulation.

    PubMed

    Roux, Pierre-François; Boitard, Simon; Blum, Yuna; Parks, Brian; Montagner, Alexandra; Mouisel, Etienne; Djari, Anis; Esquerré, Diane; Désert, Colette; Boutin, Morgane; Leroux, Sophie; Lecerf, Frédéric; Le Bihan-Duval, Elisabeth; Klopp, Christophe; Servin, Bertrand; Pitel, Frédérique; Duclos, Michel Jean; Guillou, Hervé; Lusis, Aldons J; Demeure, Olivier; Lagarrigue, Sandrine

    2015-04-01

    Very few causal genes have been identified by quantitative trait loci (QTL) mapping because of the large size of QTL, and most of them were identified thanks to functional links already known with the targeted phenotype. Here, we propose to combine selection signature detection, coding SNP annotation, and cis-expression QTL analyses to identify potential causal genes underlying QTL identified in divergent line designs. As a model, we chose experimental chicken lines divergently selected for only one trait, the abdominal fat weight, in which several QTL were previously mapped. Using new haplotype-based statistics exploiting the very high SNP density generated through whole-genome resequencing, we found 129 significant selective sweeps. Most of the QTL colocalized with at least one sweep, which markedly narrowed candidate region size. Some of those sweeps contained only one gene, therefore making them strong positional causal candidates with no presupposed function. We then focused on two of these QTL/sweeps. The absence of nonsynonymous SNPs in their coding regions strongly suggests the existence of causal mutations acting in cis on their expression, confirmed by cis-eQTL identification using either allele-specific expression or genetic mapping analyses. Additional expression analyses of those two genes in the chicken and mice contrasted for adiposity reinforces their link with this phenotype. This study shows for the first time the interest of combining selective sweeps mapping, coding SNP annotation and cis-eQTL analyses for identifying causative genes for a complex trait, in the context of divergent lines selected for this specific trait. Moreover, it highlights two genes, JAG2 and PARK2, as new potential negative and positive key regulators of adiposity in chicken and mice. PMID:25653314

  16. DaGO-Fun: tool for Gene Ontology-based functional analysis using term information content measures

    PubMed Central

    2013-01-01

    Background The use of Gene Ontology (GO) data in protein analyses have largely contributed to the improved outcomes of these analyses. Several GO semantic similarity measures have been proposed in recent years and provide tools that allow the integration of biological knowledge embedded in the GO structure into different biological analyses. There is a need for a unified tool that provides the scientific community with the opportunity to explore these different GO similarity measure approaches and their biological applications. Results We have developed DaGO-Fun, an online tool available at http://web.cbio.uct.ac.za/ITGOM, which incorporates many different GO similarity measures for exploring, analyzing and comparing GO terms and proteins within the context of GO. It uses GO data and UniProt proteins with their GO annotations as provided by the Gene Ontology Annotation (GOA) project to precompute GO term information content (IC), enabling rapid response to user queries. Conclusions The DaGO-Fun online tool presents the advantage of integrating all the relevant IC-based GO similarity measures, including topology- and annotation-based approaches to facilitate effective exploration of these measures, thus enabling users to choose the most relevant approach for their application. Furthermore, this tool includes several biological applications related to GO semantic similarity scores, including the retrieval of genes based on their GO annotations, the clustering of functionally related genes within a set, and term enrichment analysis. PMID:24067102

  17. Automatic Annotation Techniques for Gene Expression Images of the Fruit Fly Embryo

    E-print Network

    Kumar, Sudhir

    such as view, orientation, and stage of development of the embryos, among other biological features biological images depicting gene expression patterns in developing embryos of fruit fly (Drosophila knowledge in the developmental biological community. Keywords: view, orientation, stage, curvature analysis

  18. Functional Annotation of the Ophiostoma novo-ulmi Genome: Insights into the Phytopathogenicity of the Fungal Agent of Dutch Elm Disease

    PubMed Central

    Comeau, André M.; Dufour, Josée; Bouvet, Guillaume F.; Jacobi, Volker; Nigg, Martha; Henrissat, Bernard; Laroche, Jérôme; Levesque, Roger C.; Bernier, Louis

    2015-01-01

    The ascomycete fungus Ophiostoma novo-ulmi is responsible for the pandemic of Dutch elm disease that has been ravaging Europe and North America for 50 years. We proceeded to annotate the genome of the O. novo-ulmi strain H327 that was sequenced in 2012. The 31.784-Mb nuclear genome (50.1% GC) is organized into 8 chromosomes containing a total of 8,640 protein-coding genes that we validated with RNA sequencing analysis. Approximately 53% of these genes have their closest match to Grosmannia clavigera kw1407, followed by 36% in other close Sordariomycetes, 5% in other Pezizomycotina, and surprisingly few (5%) orphans. A relatively small portion (?3.4%) of the genome is occupied by repeat sequences; however, the mechanism of repeat-induced point mutation appears active in this genome. Approximately 76% of the proteins could be assigned functions using Gene Ontology analysis; we identified 311 carbohydrate-active enzymes, 48 cytochrome P450s, and 1,731 proteins potentially involved in pathogen–host interaction, along with 7 clusters of fungal secondary metabolites. Complementary mating-type locus sequencing, mating tests, and culturing in the presence of elm terpenes were conducted. Our analysis identified a specific genetic arsenal impacting the sexual and vegetative growth, phytopathogenicity, and signaling/plant–defense–degradation relationship between O. novo-ulmi and its elm host and insect vectors. PMID:25539722

  19. RNA Interference for Wheat Functional Gene Analysis

    Technology Transfer Automated Retrieval System (TEKTRAN)

    RNA interference (RNAi) refers to a common mechanism of RNA-based post-transcriptional gene silencing in eukaryotic cells. In model plant species such as Arabidopsis and rice, RNAi has been routinely used to characterize gene function and to engineer novel phenotypes. In polyploid species, this appr...

  20. Visualizing the Gene Ontology-Annotated Clusters of Co-expressed Genes: A Two-Design Study

    E-print Network

    Hong,Seokhee

    ] provide visualization on the parent-child relationship within the GO hierarchy, biologists are often more Visual Analysis [5] provide visualizations of the gene-to-GO relationships while hiding the parent-child concerned with the visualization of gene-to-GO relationships. This is because the cooperative regulation

  1. Predicting gene function from images of cells

    E-print Network

    Jones, Thouis Raymond, 1971-

    2007-01-01

    This dissertation shows that biologically meaningful predictions can be made by analyzing images of cells. In particular, groups of related genes and their biological functions can be predicted using images from large ...

  2. Matrix factorization-based data fusion for gene function prediction in baker's yeast and slime mold.

    PubMed

    Zitnik, Marinka; Zupan, Blaž

    2014-01-01

    The development of effective methods for the characterization of gene functions that are able to combine diverse data sources in a sound and easily-extendible way is an important goal in computational biology. We have previously developed a general matrix factorization-based data fusion approach for gene function prediction. In this manuscript, we show that this data fusion approach can be applied to gene function prediction and that it can fuse various heterogeneous data sources, such as gene expression profiles, known protein annotations, interaction and literature data. The fusion is achieved by simultaneous matrix tri-factorization that shares matrix factors between sources. We demonstrate the effectiveness of the approach by evaluating its performance on predicting ontological annotations in slime mold D. discoideum and on recognizing proteins of baker's yeast S. cerevisiae that participate in the ribosome or are located in the cell membrane. Our approach achieves predictive performance comparable to that of the state-of-the-art kernel-based data fusion, but requires fewer data preprocessing steps. PMID:24297565

  3. A Manual Curation Strategy to Improve Genome Annotation: Application to a Set of Haloarchael Genomes

    PubMed Central

    Pfeiffer, Friedhelm; Oesterhelt, Dieter

    2015-01-01

    Genome annotation errors are a persistent problem that impede research in the biosciences. A manual curation effort is described that attempts to produce high-quality genome annotations for a set of haloarchaeal genomes (Halobacterium salinarum and Hbt. hubeiense, Haloferax volcanii and Hfx. mediterranei, Natronomonas pharaonis and Nmn. moolapensis, Haloquadratum walsbyi strains HBSQ001 and C23, Natrialba magadii, Haloarcula marismortui and Har. hispanica, and Halohasta litchfieldiae). Genomes are checked for missing genes, start codon misassignments, and disrupted genes. Assignments of a specific function are preferably based on experimentally characterized homologs (Gold Standard Proteins). To avoid overannotation, which is a major source of database errors, we restrict annotation to only general function assignments when support for a specific substrate assignment is insufficient. This strategy results in annotations that are resistant to the plethora of errors that compromise public databases. Annotation consistency is rigorously validated for ortholog pairs from the genomes surveyed. The annotation is regularly crosschecked against the UniProt database to further improve annotations and increase the level of standardization. Enhanced genome annotations are submitted to public databases (EMBL/GenBank, UniProt), to the benefit of the scientific community. The enhanced annotations are also publically available via HaloLex. PMID:26042526

  4. Functional characterization of drought-responsive modules and genes in Oryza sativa: a network-based approach

    PubMed Central

    Sircar, Sanchari; Parekh, Nita

    2015-01-01

    Drought is one of the major environmental stress conditions affecting the yield of rice across the globe. Unraveling the functional roles of the drought-responsive genes and their underlying molecular mechanisms will provide important leads to improve the yield of rice. Co-expression relationships derived from condition-dependent gene expression data is an effective way to identify the functional associations between genes that are part of the same biological process and may be under similar transcriptional control. For this purpose, vast amount of freely available transcriptomic data may be used. In this study, we consider gene expression data for different tissues and developmental stages in response to drought stress. We analyze the network of co-expressed genes to identify drought-responsive genes modules in a tissue and stage-specific manner based on differential expression and gene enrichment analysis. Taking cues from the systems-level behavior of these modules, we propose two approaches to identify clusters of tightly co-expressed/co-regulated genes. Using graph-centrality measures and differential gene expression, we identify biologically informative genes that lack any functional annotation. We show that using orthologous information from other plant species, the conserved co-expression patterns of the uncharacterized genes can be identified. Presence of a conserved neighborhood enables us to extrapolate functional annotation. Alternatively, we show that single ‘guide-gene’ approach can help in understanding tissue-specific transcriptional regulation of uncharacterized genes. Finally, we confirm the predicted roles of uncharacterized genes by the analysis of conserved cis-elements and explain the possible roles of these genes toward drought tolerance. PMID:26284112

  5. Functional Study of Genes Essential for Autogamy and Nuclear Reorganization in Paramecium?§

    PubMed Central

    Nowak, Jacek K.; Gromadka, Robert; Juszczuk, Marek; Jerka-Dziadosz, Maria; Maliszewska, Kamila; Mucchielli, Marie-Hélène; Gout, Jean-François; Arnaiz, Olivier; Agier, Nicolas; Tang, Thomas; Aggerbeck, Lawrence P.; Cohen, Jean; Delacroix, Hervé; Sperling, Linda; Herbert, Christopher J.; Zagulski, Marek; Bétermier, Mireille

    2011-01-01

    Like all ciliates, Paramecium tetraurelia is a unicellular eukaryote that harbors two kinds of nuclei within its cytoplasm. At each sexual cycle, a new somatic macronucleus (MAC) develops from the germ line micronucleus (MIC) through a sequence of complex events, which includes meiosis, karyogamy, and assembly of the MAC genome from MIC sequences. The latter process involves developmentally programmed genome rearrangements controlled by noncoding RNAs and a specialized RNA interference machinery. We describe our first attempts to identify genes and biological processes that contribute to the progression of the sexual cycle. Given the high percentage of unknown genes annotated in the P. tetraurelia genome, we applied a global strategy to monitor gene expression profiles during autogamy, a self-fertilization process. We focused this pilot study on the genes carried by the largest somatic chromosome and designed dedicated DNA arrays covering 484 genes from this chromosome (1.2% of all genes annotated in the genome). Transcriptome analysis revealed four major patterns of gene expression, including two successive waves of gene induction. Functional analysis of 15 upregulated genes revealed four that are essential for vegetative growth, one of which is involved in the maintenance of MAC integrity and another in cell division or membrane trafficking. Two additional genes, encoding a MIC-specific protein and a putative RNA helicase localizing to the old and then to the new MAC, are specifically required during sexual processes. Our work provides a proof of principle that genes essential for meiosis and nuclear reorganization can be uncovered following genome-wide transcriptome analysis. PMID:21257794

  6. GeneCards Version 3: the human gene integrator

    PubMed Central

    Safran, Marilyn; Dalah, Irina; Alexander, Justin; Rosen, Naomi; Iny Stein, Tsippi; Shmoish, Michael; Nativ, Noam; Bahir, Iris; Doniger, Tirza; Krug, Hagit; Sirota-Madi, Alexandra; Olender, Tsviya; Golan, Yaron; Stelzer, Gil; Harel, Arye; Lancet, Doron

    2010-01-01

    GeneCards (www.genecards.org) is a comprehensive, authoritative compendium of annotative information about human genes, widely used for nearly 15 years. Its gene-centric content is automatically mined and integrated from over 80 digital sources, resulting in a web-based deep-linked card for each of >73 000 human gene entries, encompassing the following categories: protein coding, pseudogene, RNA gene, genetic locus, cluster and uncategorized. We now introduce GeneCards Version 3, featuring a speedy and sophisticated search engine and a revamped, technologically enabling infrastructure, catering to the expanding needs of biomedical researchers. A key focus is on gene-set analyses, which leverage GeneCards’ unique wealth of combinatorial annotations. These include the GeneALaCart batch query facility, which tabulates user-selected annotations for multiple genes and GeneDecks, which identifies similar genes with shared annotations, and finds set-shared annotations by descriptor enrichment analysis. Such set-centric features address a host of applications, including microarray data analysis, cross-database annotation mapping and gene-disorder associations for drug targeting. We highlight the new Version 3 database architecture, its multi-faceted search engine, and its semi-automated quality assurance system. Data enhancements include an expanded visualization of gene expression patterns in normal and cancer tissues, an integrated alternative splicing pattern display, and augmented multi-source SNPs and pathways sections. GeneCards now provides direct links to gene-related research reagents such as antibodies, recombinant proteins, DNA clones and inhibitory RNAs and features gene-related drugs and compounds lists. We also portray the GeneCards Inferred Functionality Score annotation landscape tool for scoring a gene’s functional information status. Finally, we delineate examples of applications and collaborations that have benefited from the GeneCards suite. Database URL: www.genecards.org PMID:20689021

  7. Mining a database of single amplified genomes from Red Sea brine pool extremophiles—improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA)

    PubMed Central

    Grötzinger, Stefan W.; Alam, Intikhab; Ba Alawi, Wail; Bajic, Vladimir B.; Stingl, Ulrich; Eppinger, Jörg

    2014-01-01

    Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available through the INDIGO website. PMID:24778629

  8. Bioinformatic approaches for functional annotation and pathway inference in metagenomics data

    PubMed Central

    De Filippo, Carlotta; Ramazzotti, Matteo; Fontana, Paolo; Cavalieri, Duccio

    2012-01-01

    Metagenomic approaches are increasingly recognized as a baseline for understanding the ecology and evolution of microbial ecosystems. The development of methods for pathway inference from metagenomics data is of paramount importance to link a phenotype to a cascade of events stemming from a series of connected sets of genes or proteins. Biochemical and regulatory pathways have until recently been thought and modelled within one cell type, one organism, one species. This vision is being dramatically changed by the advent of whole microbiome sequencing studies, revealing the role of symbiotic microbial populations in fundamental biochemical functions. The new landscape we face requires a clear picture of the potentialities of existing tools and development of new tools to characterize, reconstruct and model biochemical and regulatory pathways as the result of integration of function in complex symbiotic interactions of ontologically and evolutionary distinct cell types. PMID:23175748

  9. The JCVI standard operating procedure for annotating prokaryotic metagenomic shotgun sequencing data.

    PubMed

    Tanenbaum, David M; Goll, Johannes; Murphy, Sean; Kumar, Prateek; Zafar, Nikhat; Thiagarajan, Mathangi; Madupu, Ramana; Davidsen, Tanja; Kagan, Leonid; Kravitz, Saul; Rusch, Douglas B; Yooseph, Shibu

    2010-01-01

    The JCVI metagenomics analysis pipeline provides for the efficient and consistent annotation of shotgun metagenomics sequencing data for sampling communities of prokaryotic organisms. The process can be equally applied to individual sequence reads from traditional Sanger capillary electrophoresis sequences, newer technologies such as 454 pyrosequencing, or sequence assemblies derived from one or more of these data types. It includes the analysis of both coding and non-coding genes, whether full-length or, as is often the case for shotgun metagenomics, fragmentary. The system is designed to provide the best-supported conservative functional annotation based on a combination of trusted homology-based scientific evidence and computational assertions and an annotation value hierarchy established through extensive manual curation. The functional annotation attributes assigned by this system include gene name, gene symbol, GO terms, EC numbers, and JCVI functional role categories. PMID:21304707

  10. Automatic annotation of organellar genomes with DOGMA

    SciTech Connect

    Wyman, Stacia; Jansen, Robert K.; Boore, Jeffrey L.

    2004-06-01

    Dual Organellar GenoMe Annotator (DOGMA) automates the annotation of extra-nuclear organellar (chloroplast and animal mitochondrial) genomes. It is a web-based package that allows the use of comparative BLAST searches to identify and annotate genes in a genome. DOGMA presents a list of putative genes to the user in a graphical format for viewing and editing. Annotations are stored on our password-protected server. Complete annotations can be extracted for direct submission to GenBank. Furthermore, intergenic regions of specified length can be extracted, as well the nucleotide sequences and amino acid sequences of the genes.

  11. Identification and functional analysis of ‘hypothetical’ genes expressed in Haemophilus influenzae

    PubMed Central

    Kolker, Eugene; Makarova, Kira S.; Shabalina, Svetlana; Picone, Alex F.; Purvine, Samuel; Holzman, Ted; Cherny, Tim; Armbruster, David; Munson, Robert S.; Kolesov, Grigory; Frishman, Dmitrij; Galperin, Michael Y.

    2004-01-01

    The progress in genome sequencing has led to a rapid accumulation in GenBank submissions of uncharacterized ‘hypothetical’ genes. These genes, which have not been experimentally characterized and whose functions cannot be deduced from simple sequence comparisons alone, now comprise a significant fraction of the public databases. Expression analyses of Haemophilus influenzae cells using a combination of transcriptomic and proteomic approaches resulted in confident identification of 54 ‘hypothetical’ genes that were expressed in cells under normal growth conditions. In an attempt to understand the functions of these proteins, we used a variety of publicly available analysis tools. Close homologs in other species were detected for each of the 54 ‘hypothetical’ genes. For 16 of them, exact functional assignments could be found in one or more public databases. Additionally, we were able to suggest general functional characterization for 27 more genes (comprising ?80% total). Findings from this analysis include the identification of a pyruvate-formate lyase-like operon, likely to be expressed not only in H.influenzae but also in several other bacteria. Further, we also observed three genes that are likely to participate in the transport and/or metabolism of sialic acid, an important component of the H.influenzae lipo-oligosaccharide. Accurate functional annotation of uncharacterized genes calls for an integrative approach, combining expression studies with extensive computational analysis and curation, followed by eventual experimental verification of the computational predictions. PMID:15121896

  12. GeneCards Version 3: the human gene integrator.

    PubMed

    Safran, Marilyn; Dalah, Irina; Alexander, Justin; Rosen, Naomi; Iny Stein, Tsippi; Shmoish, Michael; Nativ, Noam; Bahir, Iris; Doniger, Tirza; Krug, Hagit; Sirota-Madi, Alexandra; Olender, Tsviya; Golan, Yaron; Stelzer, Gil; Harel, Arye; Lancet, Doron

    2010-01-01

    GeneCards (www.genecards.org) is a comprehensive, authoritative compendium of annotative information about human genes, widely used for nearly 15 years. Its gene-centric content is automatically mined and integrated from over 80 digital sources, resulting in a web-based deep-linked card for each of >73,000 human gene entries, encompassing the following categories: protein coding, pseudogene, RNA gene, genetic locus, cluster and uncategorized. We now introduce GeneCards Version 3, featuring a speedy and sophisticated search engine and a revamped, technologically enabling infrastructure, catering to the expanding needs of biomedical researchers. A key focus is on gene-set analyses, which leverage GeneCards' unique wealth of combinatorial annotations. These include the GeneALaCart batch query facility, which tabulates user-selected annotations for multiple genes and GeneDecks, which identifies similar genes with shared annotations, and finds set-shared annotations by descriptor enrichment analysis. Such set-centric features address a host of applications, including microarray data analysis, cross-database annotation mapping and gene-disorder associations for drug targeting. We highlight the new Version 3 database architecture, its multi-faceted search engine, and its semi-automated quality assurance system. Data enhancements include an expanded visualization of gene expression patterns in normal and cancer tissues, an integrated alternative splicing pattern display, and augmented multi-source SNPs and pathways sections. GeneCards now provides direct links to gene-related research reagents such as antibodies, recombinant proteins, DNA clones and inhibitory RNAs and features gene-related drugs and compounds lists. We also portray the GeneCards Inferred Functionality Score annotation landscape tool for scoring a gene's functional information status. Finally, we delineate examples of applications and collaborations that have benefited from the GeneCards suite. Database URL: www.genecards.org. PMID:20689021

  13. Neural networks approaches for discovering the learnable correlation between gene function and gene expression in mouse

    E-print Network

    Morris, Quaid

    , especially in gene therapy [18]. Identifying gene function in prokaryotes is much easier than eukaryotes dueNeural networks approaches for discovering the learnable correlation between gene function and gene Keywords: Gene function prediction Self organizing maps (SOM) Multilayer perceptrons (MLP) Gene expression

  14. Rapid High Resolution Genotyping of Francisella tularensis by Whole Genome Sequence Comparison of Annotated Genes (“MLST+”)

    PubMed Central

    Mellmann, Alexander; Höppner, Sebastian; Splettstoesser, Wolf D.; Harmsen, Dag

    2015-01-01

    The zoonotic disease tularemia is caused by the bacterium Francisella tularensis. This pathogen is considered as a category A select agent with potential to be misused in bioterrorism. Molecular typing based on DNA-sequence like canSNP-typing or MLVA has become the accepted standard for this organism. Due to the organism’s highly clonal nature, the current typing methods have reached their limit of discrimination for classifying closely related subpopulations within the subspecies F. tularensis ssp. holarctica. We introduce a new gene-by-gene approach, MLST+, based on whole genome data of 15 sequenced F. tularensis ssp. holarctica strains and apply this approach to investigate an epidemic of lethal tularemia among non-human primates in two animal facilities in Germany. Due to the high resolution of MLST+ we are able to demonstrate that three independent clones of this highly infectious pathogen were responsible for these spatially and temporally restricted outbreaks. PMID:25856198

  15. Rapid high resolution genotyping of Francisella tularensis by whole genome sequence comparison of annotated genes ("MLST+").

    PubMed

    Antwerpen, Markus H; Prior, Karola; Mellmann, Alexander; Höppner, Sebastian; Splettstoesser, Wolf D; Harmsen, Dag

    2015-01-01

    The zoonotic disease tularemia is caused by the bacterium Francisella tularensis. This pathogen is considered as a category A select agent with potential to be misused in bioterrorism. Molecular typing based on DNA-sequence like canSNP-typing or MLVA has become the accepted standard for this organism. Due to the organism's highly clonal nature, the current typing methods have reached their limit of discrimination for classifying closely related subpopulations within the subspecies F. tularensis ssp. holarctica. We introduce a new gene-by-gene approach, MLST+, based on whole genome data of 15 sequenced F. tularensis ssp. holarctica strains and apply this approach to investigate an epidemic of lethal tularemia among non-human primates in two animal facilities in Germany. Due to the high resolution of MLST+ we are able to demonstrate that three independent clones of this highly infectious pathogen were responsible for these spatially and temporally restricted outbreaks. PMID:25856198

  16. Morpholinos: studying gene function in the chick

    PubMed Central

    Norris, Anneliese; Streit, Andrea

    2014-01-01

    The use of morpholinos for perturbing gene function in the chick, Gallus gallus, has led to many important discoveries in developmental biology. This technology makes use of in vivo electroporation, which allows gain and loss of function in a temporally, and spatially controlled manner. Using this method, morpholinos can be transfected into embryonic tissues from early to late developmental stages. In this article, we describe the methods currently used in our laboratory to knock down gene function using morpholinos in vivo. We also detail how morpholinos are used to provide consistency of the results, and describe two protocols to visualise the morpholino after electroporation. In addition, we provide guidance on avoiding potential pitfalls, and suggestions for troubleshooting solutions. These revised techniques provide a practical starting point for investigating gene function in the chick. PMID:24184187

  17. Large-scale prediction of long non-coding RNA functions in a coding–non-coding gene co-expression network

    PubMed Central

    Liao, Qi; Liu, Changning; Yuan, Xiongying; Kang, Shuli; Miao, Ruoyu; Xiao, Hui; Zhao, Guoguang; Luo, Haitao; Bu, Dechao; Zhao, Haitao; Skogerbø, Geir; Wu, Zhongdao; Zhao, Yi

    2011-01-01

    Although accumulating evidence has provided insight into the various functions of long-non-coding RNAs (lncRNAs), the exact functions of the majority of such transcripts are still unknown. Here, we report the first computational annotation of lncRNA functions based on public microarray expression profiles. A coding–non-coding gene co-expression (CNC) network was constructed from re-annotated Affymetrix Mouse Genome Array data. Probable functions for altogether 340 lncRNAs were predicted based on topological or other network characteristics, such as module sharing, association with network hubs and combinations of co-expression and genomic adjacency. The functions annotated to the lncRNAs mainly involve organ or tissue development (e.g. neuron, eye and muscle development), cellular transport (e.g. neuronal transport and sodium ion, acid or lipid transport) or metabolic processes (e.g. involving macromolecules, phosphocreatine and tyrosine). PMID:21247874

  18. GRYFUN: a web application for GO term annotation visualization and analysis in protein sets.

    PubMed

    Bastos, Hugo P; Sousa, Lisete; Clarke, Luka A; Couto, Francisco M

    2015-01-01

    Functional context for biological sequence is provided in the form of annotations. However, within a group of similar sequences there can be annotation heterogeneity in terms of coverage and specificity. This in turn can introduce issues regarding the interpretation of actual functional similarity and overall functional coherence of such a group. One way to mitigate such issues is through the use of visualization and statistical techniques. Therefore, in order to help interpret this annotation heterogeneity we created a web application that generates Gene Ontology annotation graphs for protein sets and their associated statistics from simple frequencies to enrichment values and Information Content based metrics. The publicly accessible website http://xldb.di.fc.ul.pt/gryfun/ currently accepts lists of UniProt accession numbers in order to create user-defined protein sets for subsequent annotation visualization and statistical assessment. GRYFUN is a freely available web application that allows GO annotation visualization of protein sets and which can be used for annotation coherence and cohesiveness analysis and annotation extension assessments within under-annotated protein sets. PMID:25794277

  19. GRYFUN: A Web Application for GO Term Annotation Visualization and Analysis in Protein Sets

    PubMed Central

    Bastos, Hugo P.; Sousa, Lisete; Clarke, Luka A.; Couto, Francisco M.

    2015-01-01

    Functional context for biological sequence is provided in the form of annotations. However, within a group of similar sequences there can be annotation heterogeneity in terms of coverage and specificity. This in turn can introduce issues regarding the interpretation of actual functional similarity and overall functional coherence of such a group. One way to mitigate such issues is through the use of visualization and statistical techniques. Therefore, in order to help interpret this annotation heterogeneity we created a web application that generates Gene Ontology annotation graphs for protein sets and their associated statistics from simple frequencies to enrichment values and Information Content based metrics. The publicly accessible website http://xldb.di.fc.ul.pt/gryfun/ currently accepts lists of UniProt accession numbers in order to create user-defined protein sets for subsequent annotation visualization and statistical assessment. GRYFUN is a freely available web application that allows GO annotation visualization of protein sets and which can be used for annotation coherence and cohesiveness analysis and annotation extension assessments within under-annotated protein sets. PMID:25794277

  20. Genome Annotation and Curation Using MAKER and MAKER-P

    PubMed Central

    Campbell, Michael S.; Holt, Carson; Moore, Barry; Yandell, Mark

    2014-01-01

    This unit describes how to use the genome annotation and curation tools MAKER and MAKER-P to annotate protein coding and non-coding RNA genes in newly assembled genomes, update/combine legacy annotations in light of new evidence, add quality metrics to annotations from other pipelines, and map existing annotations to a new assembly. MAKER and MAKER-P can rapidly annotate genomes of any size, and scale to match available computational resources. PMID:25501943

  1. The fate of duplicated genes: loss or new function?

    E-print Network

    Wagner, Andreas

    mutations. Thus, as long as one duplicate gene functions normally, the other may go the more likely routeThe fate of duplicated genes: loss or new function? Andreas Wagner Summary Gene duplication events are important sources of novel gene functions. However, more often than not, a duplicate gene may lose its

  2. Automatic annotation of organellar genomes with DOGMA.

    PubMed

    Wyman, Stacia K; Jansen, Robert K; Boore, Jeffrey L

    2004-11-22

    The Dual Organellar GenoMe Annotator (DOGMA) automates the annotation of organellar (plant chloroplast and animal mitochondrial) genomes. It is a Web-based package that allows the use of BLAST searches against a custom database, and conservation of basepairing in the secondary structure of animal mitochondrial tRNAs to identify and annotate genes. DOGMA provides a graphical user interface for viewing and editing annotations. Annotations are stored on our password-protected server to enable repeated sessions of working on the same genome. Finished annotations can be extracted for direct submission to GenBank. PMID:15180927

  3. Studying Functions of All Yeast Genes Simultaneously

    NASA Technical Reports Server (NTRS)

    Stolc, Viktor; Eason, Robert G.; Poumand, Nader; Herman, Zelek S.; Davis, Ronald W.; Anthony Kevin; Jejelowo, Olufisayo

    2006-01-01

    A method of studying the functions of all the genes of a given species of microorganism simultaneously has been developed in experiments on Saccharomyces cerevisiae (commonly known as baker's or brewer's yeast). It is already known that many yeast genes perform functions similar to those of corresponding human genes; therefore, by facilitating understanding of yeast genes, the method may ultimately also contribute to the knowledge needed to treat some diseases in humans. Because of the complexity of the method and the highly specialized nature of the underlying knowledge, it is possible to give only a brief and sketchy summary here. The method involves the use of unique synthetic deoxyribonucleic acid (DNA) sequences that are denoted as DNA bar codes because of their utility as molecular labels. The method also involves the disruption of gene functions through deletion of genes. Saccharomyces cerevisiae is a particularly powerful experimental system in that multiple deletion strains easily can be pooled for parallel growth assays. Individual deletion strains recently have been created for 5,918 open reading frames, representing nearly all of the estimated 6,000 genetic loci of Saccharomyces cerevisiae. Tagging of each deletion strain with one or two unique 20-nucleotide sequences enables identification of genes affected by specific growth conditions, without prior knowledge of gene functions. Hybridization of bar-code DNA to oligonucleotide arrays can be used to measure the growth rate of each strain over several cell-division generations. The growth rate thus measured serves as an index of the fitness of the strain.

  4. Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions?

    PubMed Central

    Pallejà, Albert; Harrington, Eoghan D; Bork, Peer

    2008-01-01

    Background Across the fully sequenced microbial genomes there are thousands of examples of overlapping genes. Many of these are only a few nucleotides long and are thought to function by permitting the coordinated regulation of gene expression. However, there should also be selective pressure against long overlaps, as the existence of overlapping reading frames increases the risk of deleterious mutations. Here we examine the longest overlaps and assess whether they are the product of special functional constraints or of erroneous annotation. Results We analysed the genes that overlap by 60 bps or more among 338 fully-sequenced prokaryotic genomes. The likely functional significance of an overlap was determined by comparing each of the genes to its respective orthologs. If a gene showed a significantly different length from its orthologs it was considered unlikely to be functional and therefore the result of an error either in sequencing or gene prediction. Focusing on 715 co-directional overlaps longer than 60 bps, we classified the erroneous ones into five categories: i) 5'-end extension of the downstream gene due to either a mispredicted start codon or a frameshift at 5'-end of the gene (409 overlaps), ii) fragmentation of a gene caused by a frameshift (163), iii) 3'-end extension of the upstream gene due to either a frameshift at 3'-end of a gene or point mutation at the stop codon (68), iv) Redundant gene predictions (4), v) 5' & 3'-end extension which is a combination of i) and iii) (71). We also studied 75 divergent overlaps that could be classified as misannotations of group i). Nevertheless we found some convergent long overlaps (54) that might be true overlaps, although an important part of convergent overlaps could be classified as group iii) (124). Conclusion Among the 968 overlaps larger than 60 bps which we analysed, we did not find a single real one among the co-directional and divergent orientations and concluded that there had been an excessive number of misannotations. Only convergent orientation seems to permit some long overlaps, although convergent overlaps are also hampered by misannotations. We propose a simple rule to flag these erroneous gene length predictions to facilitate automatic annotation. PMID:18627618

  5. LegumeIP 2.0—a platform for the study of gene function and genome evolution in legumes

    PubMed Central

    Li, Jun; Dai, Xinbin; Zhuang, Zhaohong; Zhao, Patrick X.

    2016-01-01

    The LegumeIP 2.0 database hosts large-scale genomics and transcriptomics data and provides integrative bioinformatics tools for the study of gene function and evolution in legumes. Our recent updates in LegumeIP 2.0 include gene and protein sequences, gene models and annotations, syntenic regions, protein families and phylogenetic trees for six legume species: Medicago truncatula, Glycine max (soybean), Lotus japonicus, Phaseolus vulgaris (common bean), Cicer arietinum (chickpea) and Cajanus cajan (pigeon pea) and two outgroup reference species: Arabidopsis thaliana and Poplar trichocarpa. Moreover, the LegumeIP 2.0 features the following new data resources and bioinformatics tools: (i) an integrative gene expression atlas for four model legumes that include 550 array hybridizations from M. truncatula, 962 gene expression profiles of G. max, 276 array hybridizations from L. japonicas and 56 RNA-Seq-based gene expression profiles for C. arietinum. These datasets were manually curated and hierarchically organized based on Experimental Ontology and Plant Ontology so that users can browse, search, and retrieve data for their selected experiments. (ii) New functions/analytical tools to query, mine and visualize large-scale gene sequences, annotations and transcriptome profiles. Users may select a subset of expression experiments and visualize and compare expression profiles for multiple genes. The LegumeIP 2.0 database is freely available to the public at http://plantgrn.noble.org/LegumeIP/. PMID:26578557

  6. LegumeIP 2.0-a platform for the study of gene function and genome evolution in legumes.

    PubMed

    Li, Jun; Dai, Xinbin; Zhuang, Zhaohong; Zhao, Patrick X

    2016-01-01

    The LegumeIP 2.0 database hosts large-scale genomics and transcriptomics data and provides integrative bioinformatics tools for the study of gene function and evolution in legumes. Our recent updates in LegumeIP 2.0 include gene and protein sequences, gene models and annotations, syntenic regions, protein families and phylogenetic trees for six legume species: Medicago truncatula, Glycine max (soybean), Lotus japonicus, Phaseolus vulgaris (common bean), Cicer arietinum (chickpea) and Cajanus cajan (pigeon pea) and two outgroup reference species: Arabidopsis thaliana and Poplar trichocarpa. Moreover, the LegumeIP 2.0 features the following new data resources and bioinformatics tools: (i) an integrative gene expression atlas for four model legumes that include 550 array hybridizations from M. truncatula, 962 gene expression profiles of G. max, 276 array hybridizations from L. japonicas and 56 RNA-Seq-based gene expression profiles for C. arietinum. These datasets were manually curated and hierarchically organized based on Experimental Ontology and Plant Ontology so that users can browse, search, and retrieve data for their selected experiments. (ii) New functions/analytical tools to query, mine and visualize large-scale gene sequences, annotations and transcriptome profiles. Users may select a subset of expression experiments and visualize and compare expression profiles for multiple genes. The LegumeIP 2.0 database is freely available to the public at http://plantgrn.noble.org/LegumeIP/. PMID:26578557

  7. Representative proteomes: a stable, scalable and unbiased proteome set for sequence analysis and functional annotation.

    PubMed

    Chen, Chuming; Natale, Darren A; Finn, Robert D; Huang, Hongzhan; Zhang, Jian; Wu, Cathy H; Mazumder, Raja

    2011-01-01

    The accelerating growth in the number of protein sequences taxes both the computational and manual resources needed to analyze them. One approach to dealing with this problem is to minimize the number of proteins subjected to such analysis in a way that minimizes loss of information. To this end we have developed a set of Representative Proteomes (RPs), each selected from a Representative Proteome Group (RPG) containing similar proteomes calculated based on co-membership in UniRef50 clusters. A Representative Proteome is the proteome that can best represent all the proteomes in its group in terms of the majority of the sequence space and information. RPs at 75%, 55%, 35% and 15% co-membership threshold (CMT) are provided to allow users to decrease or increase the granularity of the sequence space based on their requirements. We find that a CMT of 55% (RP55) most closely follows standard taxonomic classifications. Further analysis of this set reveals that sequence space is reduced by more than 80% relative to UniProtKB, while retaining both sequence diversity (over 95% of InterPro domains) and annotation information (93% of experimentally characterized proteins). All sets can be browsed and are available for sequence similarity searches and download at http://www.proteininformationresource.org/rps, while the set of 637 RPs determined using a 55% CMT are also available for text searches. Potential applications include sequence similarity searches, protein classification and targeted protein annotation and characterization. PMID:21556138

  8. Experimental Strategies for Functional Annotation and Metabolism Discovery: Targeted Screening of Solute Binding Proteins and Unbiased Panning of Metabolomes

    PubMed Central

    2015-01-01

    The rate at which genome sequencing data is accruing demands enhanced methods for functional annotation and metabolism discovery. Solute binding proteins (SBPs) facilitate the transport of the first reactant in a metabolic pathway, thereby constraining the regions of chemical space and the chemistries that must be considered for pathway reconstruction. We describe high-throughput protein production and differential scanning fluorimetry platforms, which enabled the screening of 158 SBPs against a 189 component library specifically tailored for this class of proteins. Like all screening efforts, this approach is limited by the practical constraints imposed by construction of the library, i.e., we can study only those metabolites that are known to exist and which can be made in sufficient quantities for experimentation. To move beyond these inherent limitations, we illustrate the promise of crystallographic- and mass spectrometric-based approaches for the unbiased use of entire metabolomes as screening libraries. Together, our approaches identified 40 new SBP ligands, generated experiment-based annotations for 2084 SBPs in 71 isofunctional clusters, and defined numerous metabolic pathways, including novel catabolic pathways for the utilization of ethanolamine as sole nitrogen source and the use of d-Ala-d-Ala as sole carbon source. These efforts begin to define an integrated strategy for realizing the full value of amassing genome sequence data. PMID:25540822

  9. Neural Networks Approaches for Discovering the Learnable Correlation between Gene Function and Gene

    E-print Network

    Bonner, Anthony

    in many ways, especially in Gene Therapy [18]. Identifying gene function in prokaryotes is much easierNeural Networks Approaches for Discovering the Learnable Correlation between Gene Function and Gene University of Toronto Toronto, ON. emad@cs.toronto.edu Abstract. Identifying gene function has many useful

  10. Comprehensive Screening of Gene Function and Networks by DNA Microarray Analysis in Japanese Patients with Idiopathic Portal Hypertension

    PubMed Central

    Kotani, Kohei; Kawabe, Joji; Morikawa, Hiroyasu; Akahoshi, Tomohiko; Hashizume, Makoto; Shiomi, Susumu

    2015-01-01

    The functions of genes involved in idiopathic portal hypertension (IPH) remain unidentified. The present study was undertaken to identify the functions of genes expressed in blood samples from patients with IPH through comprehensive analysis of gene expression using DNA microarrays. The data were compared with data from healthy individuals to explore the functions of genes showing increased or decreased expression in patients with IPH. In cluster analysis, no dominant probe group was shown to differ between patients with IPH and healthy controls. In functional annotation analysis using the Database for Annotation Visualization and Integrated Discovery tool, clusters showing dysfunction in patients with IPH involved gene terms related to the immune system. Analysis using network-based pathways revealed decreased expression of adenosine deaminase, ectonucleoside triphosphate diphosphohydrolase 4, ATP-binding cassette, subfamily C, member 1, transforming growth factor-?, and prostaglandin E receptor 2; increased expression of cytochrome P450, family 4, subfamily F, polypeptide 3, and glutathione peroxidase 3; and abnormalities in the immune system, nucleic acid metabolism, arachidonic acid/leukotriene pathways, and biological processes. These results suggested that IPH involved compromised function of immunocompetent cells and that such dysfunction may be associated with abnormalities in nucleic acid metabolism and arachidonic acid/leukotriene-related synthesis/metabolism. PMID:26549939

  11. A Uniform System for the Annotation of Vertebrate microRNA Genes and the Evolution of the Human microRNAome.

    PubMed

    Fromm, Bastian; Billipp, Tyler; Peck, Liam E; Johansen, Morten; Tarver, James E; King, Benjamin L; Newcomb, James M; Sempere, Lorenzo F; Flatmark, Kjersti; Hovig, Eivind; Peterson, Kevin J

    2015-11-23

    Although microRNAs (miRNAs) are among the most intensively studied molecules of the past 20 years, determining what is and what is not a miRNA has not been straightforward. Here, we present a uniform system for the annotation and nomenclature of miRNA genes. We show that less than a third of the 1,881 human miRBase entries, and only approximately 16% of the 7,095 metazoan miRBase entries, are robustly supported as miRNA genes. Furthermore, we show that the human repertoire of miRNAs has been shaped by periods of intense miRNA innovation and that mature gene products show a very different tempo and mode of sequence evolution than star products. We establish a new open access database-MirGeneDB ( http://mirgenedb.org )-to catalog this set of miRNAs, which complements the efforts of miRBase but differs from it by annotating the mature versus star products and by imposing an evolutionary hierarchy upon this curated and consistently named repertoire. PMID:26473382

  12. Rice functionality, starch structure and the genes

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Through collaborative efforts among USDA scientists at Beaumont, Texas, we have gained in-depth knowledge of how rice functionality, i.e. the texture of the cooked rice, rice processing properties, and starch gelatinization temperature, are associated with starch-synthesis genes and starch structure...

  13. Structure and Function of Fusion Gene Products in

    E-print Network

    Spang, Rainer

    Structure and Function of Fusion Gene Products in Childhood Acute Leukemia #12;Chromosomal;Transcriptional Activation RNA Polymerase II Gene Transcription Histon Acetylation Complex Chromatin Remodeling

  14. The DNA sequence and biological annotation of human chromosome 1.

    PubMed

    Gregory, S G; Barlow, K F; McLay, K E; Kaul, R; Swarbreck, D; Dunham, A; Scott, C E; Howe, K L; Woodfine, K; Spencer, C C A; Jones, M C; Gillson, C; Searle, S; Zhou, Y; Kokocinski, F; McDonald, L; Evans, R; Phillips, K; Atkinson, A; Cooper, R; Jones, C; Hall, R E; Andrews, T D; Lloyd, C; Ainscough, R; Almeida, J P; Ambrose, K D; Anderson, F; Andrew, R W; Ashwell, R I S; Aubin, K; Babbage, A K; Bagguley, C L; Bailey, J; Beasley, H; Bethel, G; Bird, C P; Bray-Allen, S; Brown, J Y; Brown, A J; Buckley, D; Burton, J; Bye, J; Carder, C; Chapman, J C; Clark, S Y; Clarke, G; Clee, C; Cobley, V; Collier, R E; Corby, N; Coville, G J; Davies, J; Deadman, R; Dunn, M; Earthrowl, M; Ellington, A G; Errington, H; Frankish, A; Frankland, J; French, L; Garner, P; Garnett, J; Gay, L; Ghori, M R J; Gibson, R; Gilby, L M; Gillett, W; Glithero, R J; Grafham, D V; Griffiths, C; Griffiths-Jones, S; Grocock, R; Hammond, S; Harrison, E S I; Hart, E; Haugen, E; Heath, P D; Holmes, S; Holt, K; Howden, P J; Hunt, A R; Hunt, S E; Hunter, G; Isherwood, J; James, R; Johnson, C; Johnson, D; Joy, A; Kay, M; Kershaw, J K; Kibukawa, M; Kimberley, A M; King, A; Knights, A J; Lad, H; Laird, G; Lawlor, S; Leongamornlert, D A; Lloyd, D M; Loveland, J; Lovell, J; Lush, M J; Lyne, R; Martin, S; Mashreghi-Mohammadi, M; Matthews, L; Matthews, N S W; McLaren, S; Milne, S; Mistry, S; Moore, M J F; Nickerson, T; O'Dell, C N; Oliver, K; Palmeiri, A; Palmer, S A; Parker, A; Patel, D; Pearce, A V; Peck, A I; Pelan, S; Phelps, K; Phillimore, B J; Plumb, R; Rajan, J; Raymond, C; Rouse, G; Saenphimmachak, C; Sehra, H K; Sheridan, E; Shownkeen, R; Sims, S; Skuce, C D; Smith, M; Steward, C; Subramanian, S; Sycamore, N; Tracey, A; Tromans, A; Van Helmond, Z; Wall, M; Wallis, J M; White, S; Whitehead, S L; Wilkinson, J E; Willey, D L; Williams, H; Wilming, L; Wray, P W; Wu, Z; Coulson, A; Vaudin, M; Sulston, J E; Durbin, R; Hubbard, T; Wooster, R; Dunham, I; Carter, N P; McVean, G; Ross, M T; Harrow, J; Olson, M V; Beck, S; Rogers, J; Bentley, D R; Banerjee, R; Bryant, S P; Burford, D C; Burrill, W D H; Clegg, S M; Dhami, P; Dovey, O; Faulkner, L M; Gribble, S M; Langford, C F; Pandian, R D; Porter, K M; Prigmore, E

    2006-05-18

    The reference sequence for each human chromosome provides the framework for understanding genome function, variation and evolution. Here we report the finished sequence and biological annotation of human chromosome 1. Chromosome 1 is gene-dense, with 3,141 genes and 991 pseudogenes, and many coding sequences overlap. Rearrangements and mutations of chromosome 1 are prevalent in cancer and many other diseases. Patterns of sequence variation reveal signals of recent selection in specific genes that may contribute to human fitness, and also in regions where no function is evident. Fine-scale recombination occurs in hotspots of varying intensity along the sequence, and is enriched near genes. These and other studies of human biology and disease encoded within chromosome 1 are made possible with the highly accurate annotated sequence, as part of the completed set of chromosome sequences that comprise the reference human genome. PMID:16710414

  15. PSSP-RFE: Accurate Prediction of Protein Structural Class by Recursive Feature Extraction from PSI-BLAST Profile, Physical-Chemical Property and Functional Annotations

    PubMed Central

    Yu, Sanjiu; Zhang, Yuan; Luo, Zhong; Yang, Hua; Zhou, Yue; Zheng, Xiaoqi

    2014-01-01

    Protein structure prediction is critical to functional annotation of the massively accumulated biological sequences, which prompts an imperative need for the development of high-throughput technologies. As a first and key step in protein structure prediction, protein structural class prediction becomes an increasingly challenging task. Amongst most homological-based approaches, the accuracies of protein structural class prediction are sufficiently high for high similarity datasets, but still far from being satisfactory for low similarity datasets, i.e., below 40% in pairwise sequence similarity. Therefore, we present a novel method for accurate and reliable protein structural class prediction for both high and low similarity datasets. This method is based on Support Vector Machine (SVM) in conjunction with integrated features from position-specific score matrix (PSSM), PROFEAT and Gene Ontology (GO). A feature selection approach, SVM-RFE, is also used to rank the integrated feature vectors through recursively removing the feature with the lowest ranking score. The definitive top features selected by SVM-RFE are input into the SVM engines to predict the structural class of a query protein. To validate our method, jackknife tests were applied to seven widely used benchmark datasets, reaching overall accuracies between 84.61% and 99.79%, which are significantly higher than those achieved by state-of-the-art tools. These results suggest that our method could serve as an accurate and cost-effective alternative to existing methods in protein structural classification, especially for low similarity datasets. PMID:24675610

  16. PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physical-chemical property and functional annotations.

    PubMed

    Li, Liqi; Cui, Xiang; Yu, Sanjiu; Zhang, Yuan; Luo, Zhong; Yang, Hua; Zhou, Yue; Zheng, Xiaoqi

    2014-01-01

    Protein structure prediction is critical to functional annotation of the massively accumulated biological sequences, which prompts an imperative need for the development of high-throughput technologies. As a first and key step in protein structure prediction, protein structural class prediction becomes an increasingly challenging task. Amongst most homological-based approaches, the accuracies of protein structural class prediction are sufficiently high for high similarity datasets, but still far from being satisfactory for low similarity datasets, i.e., below 40% in pairwise sequence similarity. Therefore, we present a novel method for accurate and reliable protein structural class prediction for both high and low similarity datasets. This method is based on Support Vector Machine (SVM) in conjunction with integrated features from position-specific score matrix (PSSM), PROFEAT and Gene Ontology (GO). A feature selection approach, SVM-RFE, is also used to rank the integrated feature vectors through recursively removing the feature with the lowest ranking score. The definitive top features selected by SVM-RFE are input into the SVM engines to predict the structural class of a query protein. To validate our method, jackknife tests were applied to seven widely used benchmark datasets, reaching overall accuracies between 84.61% and 99.79%, which are significantly higher than those achieved by state-of-the-art tools. These results suggest that our method could serve as an accurate and cost-effective alternative to existing methods in protein structural classification, especially for low similarity datasets. PMID:24675610

  17. Human Genome Annotation

    NASA Astrophysics Data System (ADS)

    Gerstein, Mark

    A central problem for 21st century science is annotating the human genome and making this annotation useful for the interpretation of personal genomes. My talk will focus on annotating the 99% of the genome that does not code for canonical genes, concentrating on intergenic features such as structural variants (SVs), pseudogenes (protein fossils), binding sites, and novel transcribed RNAs (ncRNAs). In particular, I will describe how we identify regulatory sites and variable blocks (SVs) based on processing next-generation sequencing experiments. I will further explain how we cluster together groups of sites to create larger annotations. Next, I will discuss a comprehensive pseudogene identification pipeline, which has enabled us to identify >10K pseudogenes in the genome and analyze their distribution with respect to age, protein family, and chromosomal location. Throughout, I will try to introduce some of the computational algorithms and approaches that are required for genome annotation. Much of this work has been carried out in the framework of the ENCODE, modENCODE, and 1000 genomes projects.

  18. Improvements to cardiovascular gene ontology.

    PubMed

    Lovering, Ruth C; Dimmer, Emily C; Talmud, Philippa J

    2009-07-01

    Gene Ontology (GO) provides a controlled vocabulary to describe the attributes of genes and gene products in any organism. Although one might initially wonder what relevance a 'controlled vocabulary' might have for cardiovascular science, such a resource is proving highly useful for researchers investigating complex cardiovascular disease phenotypes as well as those interpreting results from high-throughput methodologies. GO enables the current functional knowledge of individual genes to be used to annotate genomic or proteomic datasets. In this way, the GO data provides a very effective way of linking biological knowledge with the analysis of the large datasets of post-genomics research. Consequently, users of high-throughput methodologies such as expression arrays or proteomics will be the main beneficiaries of such annotation sets. However, as GO annotations increase in quality and quantity, groups using small-scale approaches will gradually begin to benefit too. For example, genome wide association scans for coronary heart disease are identifying novel genes, with previously unknown connections to cardiovascular processes, and the comprehensive annotation of these novel genes might provide clues to their cardiovascular link. At least 4000 genes, to date, have been implicated in cardiovascular processes and an initiative is underway to focus on annotating these genes for the benefit of the cardiovascular community. In this article we review the current uses of Gene Ontology annotation to highlight why Gene Ontology should be of interest to all those involved in cardiovascular research. PMID:19046747

  19. Next generation models for storage and representation of microbial biological annotation

    PubMed Central

    2010-01-01

    Background Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software system to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way. Results Here, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files. Conclusions The power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to researchers. In this way, all researchers, experimental and computational, will more easily understand the informatics processes constructing genome annotation and ultimately be able to help improve the systems that produce them. PMID:20946598

  20. Yeast chromosome III: new gene functions.

    PubMed Central

    Koonin, E V; Bork, P; Sander, C

    1994-01-01

    One year after the release of the sequence of yeast chromosome III, we have re-examined its open reading frames (ORFs) by computer methods. More than 61% of the 171 probable gene products have significant sequence similarities in the current databases; as many as 54% have already known functions or are related to functionally characterized proteins, allowing partial prediction of protein function, 11 percentage points more than reported a year ago; 19% are similar to proteins of known three-dimensional structure, allowing model building by homology. The most interesting new identifications include a sugar kinase distantly related to ribokinases, a phosphatidyl serine synthetase, a putative transcription regulator, a flavodoxin-like protein, and a zinc finger protein belonging to a distinct subfamily. Several ORFs have similarities to uncharacterized proteins, resulting in new families in search of a function'. About 54% of ORFs match sequences from other phyla, including numerous fragments in the database of expressed sequence tags (ESTs). Most significant similarities to ESTs are with proteins in conserved families widely represented in the databases. About 30% of ORFs contain one or more predicted transmembrane segments. The increase in the power of functional and structural prediction comes from improvements in sequence analysis and from richer databases and is expected to facilitate substantially the experimental effort in characterizing the function of new gene products. Images PMID:8313894

  1. Comparative validation of the D. melanogaster modENCODE transcriptome annotation.

    PubMed

    Chen, Zhen-Xia; Sturgill, David; Qu, Jiaxin; Jiang, Huaiyang; Park, Soo; Boley, Nathan; Suzuki, Ana Maria; Fletcher, Anthony R; Plachetzki, David C; FitzGerald, Peter C; Artieri, Carlo G; Atallah, Joel; Barmina, Olga; Brown, James B; Blankenburg, Kerstin P; Clough, Emily; Dasgupta, Abhijit; Gubbala, Sai; Han, Yi; Jayaseelan, Joy C; Kalra, Divya; Kim, Yoo-Ah; Kovar, Christie L; Lee, Sandra L; Li, Mingmei; Malley, James D; Malone, John H; Mathew, Tittu; Mattiuzzo, Nicolas R; Munidasa, Mala; Muzny, Donna M; Ongeri, Fiona; Perales, Lora; Przytycka, Teresa M; Pu, Ling-Ling; Robinson, Garrett; Thornton, Rebecca L; Saada, Nehad; Scherer, Steven E; Smith, Harold E; Vinson, Charles; Warner, Crystal B; Worley, Kim C; Wu, Yuan-Qing; Zou, Xiaoyan; Cherbas, Peter; Kellis, Manolis; Eisen, Michael B; Piano, Fabio; Kionte, Karin; Fitch, David H; Sternberg, Paul W; Cutter, Asher D; Duff, Michael O; Hoskins, Roger A; Graveley, Brenton R; Gibbs, Richard A; Bickel, Peter J; Kopp, Artyom; Carninci, Piero; Celniker, Susan E; Oliver, Brian; Richards, Stephen

    2014-07-01

    Accurate gene model annotation of reference genomes is critical for making them useful. The modENCODE project has improved the D. melanogaster genome annotation by using deep and diverse high-throughput data. Since transcriptional activity that has been evolutionarily conserved is likely to have an advantageous function, we have performed large-scale interspecific comparisons to increase confidence in predicted annotations. To support comparative genomics, we filled in divergence gaps in the Drosophila phylogeny by generating draft genomes for eight new species. For comparative transcriptome analysis, we generated mRNA expression profiles on 81 samples from multiple tissues and developmental stages of 15 Drosophila species, and we performed cap analysis of gene expression in D. melanogaster and D. pseudoobscura. We also describe conservation of four distinct core promoter structures composed of combinations of elements at three positions. Overall, each type of genomic feature shows a characteristic divergence rate relative to neutral models, highlighting the value of multispecies alignment in annotating a target genome that should prove useful in the annotation of other high priority genomes, especially human and other mammalian genomes that are rich in noncoding sequences. We report that the vast majority of elements in the annotation are evolutionarily conserved, indicating that the annotation will be an important springboard for functional genetic testing by the Drosophila community. PMID:24985915

  2. Re-annotation of the Saccharopolyspora erythraea genome using a systems biology approach

    PubMed Central

    2013-01-01

    Background Accurate bacterial genome annotations provide a framework to understanding cellular functions, behavior and pathogenicity and are essential for metabolic engineering. Annotations based only on in silico predictions are inaccurate, particularly for large, high G?+?C content genomes due to the lack of similarities in gene length and gene organization to model organisms. Results Here we describe a 2D systems biology driven re-annotation of the Saccharopolyspora erythraea genome using proteogenomics, a genome-scale metabolic reconstruction, RNA-sequencing and small-RNA-sequencing. We observed transcription of more than 300 intergenic regions, detected 59 peptides in intergenic regions, confirmed 164 open reading frames previously annotated as hypothetical proteins and reassigned function to open reading frames using the genome-scale metabolic reconstruction. Finally, we present a novel way of mapping ribosomal binding sites across the genome by sequencing small RNAs. Conclusions The work presented here describes a novel framework for annotation of the Saccharopolyspora erythraea genome. Based on experimental observations, the 2D annotation framework greatly reduces errors that are commonly made when annotating large-high G?+?C content genomes using computational prediction algorithms. PMID:24118942

  3. Using co-expression to redefine functional gene sets for gene set enrichment analysis

    E-print Network

    Kodysh, Yuliya

    2007-01-01

    Manually curated gene sets related to a biological function often contain genes that are not tightly co-regulated transcriptionally. which obscures the evidence of coordinated differential expression of these gene sets in ...

  4. COMBREX: a project to accelerate the functional annotation of prokaryotic genomes

    E-print Network

    Segrè, Daniel

    , 9 Department of Computer Science, University of Maryland, College Park, MD, 20742 and 10 Children's Hospital Informatics Program, Harvard-MIT Division of Health Sciences and Technology (CHIP@HST), Cambridge of DNA sequencing continuing to drop, this has led to an explosion in the number of genes that are pre

  5. Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae

    SciTech Connect

    Rutledge, Alexandra C.; Jones, Marcus B.; Chauhan, Sadhana; Purvine, Samuel O.; Sanford, James; Monroe, Matthew E.; Brewer, Heather M.; Payne, Samuel H.; Ansong, Charles; Frank, Bryan C.; Smith, Richard D.; Peterson, Scott; Motin, Vladimir L.; Adkins, Joshua N.

    2012-03-27

    Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. To date, the perceived value of manual curation for genome annotations is not offset by the real cost and time associated with the process. In order to balance the large number of sequences generated, the annotation process is now performed almost exclusively in an automated fashion for most genome sequencing projects. One possible way to reduce errors inherent to automated computational annotations is to apply data from 'omics' measurements (i.e. transcriptional and proteomic) to the un-annotated genome with a proteogenomic-based approach. This approach does require additional experimental and bioinformatics methods to include omics technologies; however, the approach is readily automatable and can benefit from rapid developments occurring in those research domains as well. The annotation process can be improved by experimental validation of transcription and translation and aid in the discovery of annotation errors. Here the concept of annotation refinement has been extended to include a comparative assessment of genomes across closely related species, as is becoming common in sequencing efforts. Transcriptomic and proteomic data derived from three highly similar pathogenic Yersiniae (Y. pestis CO92, Y. pestis pestoides F, and Y. pseudotuberculosis PB1/+) was used to demonstrate a comprehensive comparative omic-based annotation methodology. Peptide and oligo measurements experimentally validated the expression of nearly 40% of each strain's predicted proteome and revealed the identification of 28 novel and 68 previously incorrect protein-coding sequences (e.g., observed frameshifts, extended start sites, and translated pseudogenes) within the three current Yersinia genome annotations. Gene loss is presumed to play a major role in Y. pestis acquiring its niche as a virulent pathogen, thus the discovery of many translated pseudogenes underscores a need for functional analyses to investigate hypotheses related to divergence. Refinements included the discovery of a seemingly essential ribosomal protein, several virulence-associated factors, and a transcriptional regulator, among other proteins, most of which are annotated as hypothetical, that were missed during annotation.

  6. Experimental annotation of post-translational features and translated coding regions in the pathogen Salmonella Typhimurium

    SciTech Connect

    Ansong, Charles; Tolic, Nikola; Purvine, Samuel O.; Porwollik, Steffen; Jones, Marcus B.; Yoon, Hyunjin; Payne, Samuel H.; Martin, Jessica L.; Burnet, Meagan C.; Monroe, Matthew E.; Venepally, Pratap; Smith, Richard D.; Peterson, Scott; Heffron, Fred; Mcclelland, Michael; Adkins, Joshua N.

    2011-08-25

    Complete and accurate genome annotation is crucial for comprehensive and systematic studies of biological systems. For example systems biology-oriented genome scale modeling efforts greatly benefit from accurate annotation of protein-coding genes to develop proper functioning models. However, determining protein-coding genes for most new genomes is almost completely performed by inference, using computational predictions with significant documented error rates (> 15%). Furthermore, gene prediction programs provide no information on biologically important post-translational processing events critical for protein function. With the ability to directly measure peptides arising from expressed proteins, mass spectrometry-based proteomics approaches can be used to augment and verify coding regions of a genomic sequence and importantly detect post-translational processing events. In this study we utilized “shotgun” proteomics to guide accurate primary genome annotation of the bacterial pathogen Salmonella Typhimurium 14028 to facilitate a systems-level understanding of Salmonella biology. The data provides protein-level experimental confirmation for 44% of predicted protein-coding genes, suggests revisions to 48 genes assigned incorrect translational start sites, and uncovers 13 non-annotated genes missed by gene prediction programs. We also present a comprehensive analysis of post-translational processing events in Salmonella, revealing a wide range of complex chemical modifications (70 distinct modifications) and confirming more than 130 signal peptide and N-terminal methionine cleavage events in Salmonella. This study highlights several ways in which proteomics data applied during the primary stages of annotation can improve the quality of genome annotations, especially with regards to the annotation of mature protein products.

  7. PA-GOSUB: a searchable database of model organism protein sequences with their predicted Gene Ontology molecular function and subcellular localization

    PubMed Central

    Lu, Paul; Szafron, Duane; Greiner, Russell; Wishart, David S.; Fyshe, Alona; Pearcy, Brandon; Poulin, Brett; Eisner, Roman; Ngo, Danny; Lamb, Nicholas

    2005-01-01

    PA-GOSUB (Proteome Analyst: Gene Ontology Molecular Function and Subcellular Localization) is a publicly available, web-based, searchable and downloadable database that contains the sequences, predicted GO molecular functions and predicted subcellular localizations of more than 107?000 proteins from 10 model organisms (and growing), covering the major kingdoms and phyla for which annotated proteomes exist (http://www.cs.ualberta.ca/~bioinfo/PA/GOSUB). The PA-GOSUB database effectively expands the coverage of subcellular localization and GO function annotations by a significant factor (already over five for subcellular localization, compared with Swiss-Prot v42.7), and more model organisms are being added to PA-GOSUB as their sequenced proteomes become available. PA-GOSUB can be used in three main ways. First, a researcher can browse the pre-computed PA-GOSUB annotations on a per-organism and per-protein basis using annotation-based and text-based filters. Second, a user can perform BLAST searches against the PA-GOSUB database and use the annotations from the homologs as simple predictors for the new sequences. Third, the whole of PA-GOSUB can be downloaded in either FASTA or comma-separated values (CSV) formats. PMID:15608166

  8. PA-GOSUB: a searchable database of model organism protein sequences with their predicted Gene Ontology molecular function and subcellular localization.

    PubMed

    Lu, Paul; Szafron, Duane; Greiner, Russell; Wishart, David S; Fyshe, Alona; Pearcy, Brandon; Poulin, Brett; Eisner, Roman; Ngo, Danny; Lamb, Nicholas

    2005-01-01

    PA-GOSUB (Proteome Analyst: Gene Ontology Molecular Function and Subcellular Localization) is a publicly available, web-based, searchable and downloadable database that contains the sequences, predicted GO molecular functions and predicted subcellular localizations of more than 107,000 proteins from 10 model organisms (and growing), covering the major kingdoms and phyla for which annotated proteomes exist (http://www.cs.ualberta.ca/~bioinfo/PA/GOSUB). The PA-GOSUB database effectively expands the coverage of subcellular localization and GO function annotations by a significant factor (already over five for subcellular localization, compared with Swiss-Prot v42.7), and more model organisms are being added to PA-GOSUB as their sequenced proteomes become available. PA-GOSUB can be used in three main ways. First, a researcher can browse the pre-computed PA-GOSUB annotations on a per-organism and per-protein basis using annotation-based and text-based filters. Second, a user can perform BLAST searches against the PA-GOSUB database and use the annotations from the homologs as simple predictors for the new sequences. Third, the whole of PA-GOSUB can be downloaded in either FASTA or comma-separated values (CSV) formats. PMID:15608166

  9. Gene polymorphisms associated with functional dyspepsia

    PubMed Central

    Kourikou, Anastasia; Karamanolis, George P; Dimitriadis, George D; Triantafyllou, Konstantinos

    2015-01-01

    Functional dyspepsia (FD) is a constellation of functional upper abdominal complaints with poorly elucidated pathophysiology. However, there is increasing evidence that susceptibility to FD is influenced by hereditary factors. Genetic association studies in FD have examined genotypes related to gastrointestinal motility or sensation, as well as those related to inflammation or immune response. G-protein b3 subunit gene polymorphisms were first reported as being associated with FD. Thereafter, several gene polymorphisms including serotonin transporter promoter, interlukin-17F, migration inhibitory factor, cholecystocynine-1 intron 1, cyclooxygenase-1, catechol-o-methyltransferase, transient receptor potential vanilloid 1 receptor, regulated upon activation normal T cell expressed and secreted, p22PHOX, Toll like receptor 2, SCN10A, CD14 and adrenoreceptors have been investigated in relation to FD; however, the results are contradictory. Several limitations underscore the value of current studies. Among others, inconsistencies in the definitions of FD and controls, subject composition differences regarding FD subtypes, inadequate samples, geographical and ethnical differences, as well as unadjusted environmental factors. Further well-designed studies are necessary to determine how targeted genes polymorphisms, influence the clinical manifestations and potentially the therapeutic response in FD. PMID:26167069

  10. Genome Analysis Maintenance of duplicate genes and their functional

    E-print Network

    Zhang, Jianzhi

    Genome Analysis Maintenance of duplicate genes and their functional redundancy by reduced- gence between duplicate genes, many old duplicates still maintain a high degree of functional similarity- tion of their ancestral functions. Consistent with this hypothesis, gene expression data from both

  11. Mining Association Rules among Gene Functions in Clusters of Similar Gene Expression Maps

    E-print Network

    Obradovic, Zoran

    and conditions have been used to extract interesting patterns from gene expression datasets. One goal of gene, what genes are expressed in diseased cells that are not expressed in healthy cells. Related work hasMining Association Rules among Gene Functions in Clusters of Similar Gene Expression Maps Li An1

  12. Proteogenomics: the needs and roles to be filled by proteomics in genome annotation

    SciTech Connect

    Ansong, Charles; Purvine, Samuel O.; Adkins, Joshua N.; Lipton, Mary S.; Smith, Richard D.

    2008-01-01

    While genome sequencing efforts reveal the basic building blocks of life, a genome sequence alone is insufficient for elucidating biological function. Genome annotation – the process of identifying genes and assigning function to each gene in a genome sequence – provides the means to elucidate biological function from sequence. Current state-of-the-art high throughput genome annotation uses a combination of comparative (sequence similarity data) and non-comparative (ab initio gene prediction algorithms) methods to identify open reading frames in genome sequences. Because approaches used to validate the presence of these open reading frames are typically based on the information derived from the annotated genomes, they cannot independently and unequivocally determine whether a predicted open reading frame is translated into a protein. With the ability to directly measure peptides arising from expressed proteins, high throughput liquid chromatography-tandem mass spectrometry-based proteomics, approaches can be used to verify coding regions of a genomic sequence. Here, we highlight several ways in which high throughput tandem mass spectrometry-based proteomics can improve the quality of genome annotations and suggest that it could be efficiently applied during the initial gene calling process so that the improvements are propagated through the subsequent functional annotation process.

  13. Functional annotation of native enhancers with a Cas9 -histone demethylase fusion

    PubMed Central

    Tabak, Barbara; Genga, Ryan M; Silverstein, Noah J; Garber, Manuel; Maehr, René

    2015-01-01

    Understanding of mammalian enhancer function is limited by the lack of a technology to rapidly and thoroughly test their cell type-specific function. Here, we use a nuclease-deficient (d)Cas9 histone demethylase fusion to functionally characterize previously described and novel enhancer elements for their roles in the embryonic stem cell state. Further, we distinguish the mechanism of action of dCas9-LSD1 at enhancers from previous dCas9-effectors. PMID:25775043

  14. GENCODE: The reference human genome annotation for The ENCODE Project

    E-print Network

    Lin, Michael

    The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation ...

  15. Function Annotation of Hepatic Retinoid x Receptor ? Based on Genome-Wide DNA Binding and Transcriptome Profiling

    PubMed Central

    Zhan, Qi; Fang, Yaping; He, Yuqi; Liu, Hui-Xin; Fang, Jianwen; Wan, Yu-Jui Yvonne

    2012-01-01

    Background Retinoid x receptor ? (RXR?) is abundantly expressed in the liver and is essential for the function of other nuclear receptors. Using chromatin immunoprecipitation sequencing and mRNA profiling data generated from wild type and RXR?-null mouse livers, the current study identifies the bona-fide hepatic RXR? targets and biological pathways. In addition, based on binding and motif analysis, the molecular mechanism by which RXR? regulates hepatic genes is elucidated in a high-throughput manner. Principal Findings Close to 80% of hepatic expressed genes were bound by RXR?, while 16% were expressed in an RXR?-dependent manner. Motif analysis predicted direct repeat with a spacer of one nucleotide as the most prevalent RXR? binding site. Many of the 500 strongest binding motifs overlapped with the binding motif of specific protein 1. Biological functional analysis of RXR?-dependent genes revealed that hepatic RXR? deficiency mainly resulted in up-regulation of steroid and cholesterol biosynthesis-related genes and down-regulation of translation- as well as anti-apoptosis-related genes. Furthermore, RXR? bound to many genes that encode nuclear receptors and their cofactors suggesting the central role of RXR? in regulating nuclear receptor-mediated pathways. Conclusions This study establishes the relationship between RXR? DNA binding and hepatic gene expression. RXR? binds extensively to the mouse genome. However, DNA binding does not necessarily affect the basal mRNA level. In addition to metabolism, RXR? dictates the expression of genes that regulate RNA processing, translation, and protein folding illustrating the novel roles of hepatic RXR? in post-transcriptional regulation. PMID:23166811

  16. Integrative bioinformatics for functional genome annotation: trawling for G protein-coupled receptors.

    PubMed

    Flower, Darren R; Attwood, Teresa K

    2004-12-01

    G protein-coupled receptors (GPCR) are amongst the best studied and most functionally diverse types of cell-surface protein. The importance of GPCRs as mediates or cell function and organismal developmental underlies their involvement in key physiological roles and their prominence as targets for pharmacological therapeutics. In this review, we highlight the requirement for integrated protocols which underline the different perspectives offered by different sequence analysis methods. BLAST and FastA offer broad brush strokes. Motif-based search methods add the fine detail. Structural modelling offers another perspective which allows us to elucidate the physicochemical properties that underlie ligand binding. Together, these different views provide a more informative and a more detailed picture of GPCR structure and function. Many GPCRs remain orphan receptors with no identified ligand, yet as computer-driven functional genomics starts to elaborate their functions, a new understanding of their roles in cell and developmental biology will follow. PMID:15561589

  17. An annotated cDNA library and microarray for large-scale gene-expression studies in the ant Solenopsis invicta.

    PubMed

    Wang, John; Jemielity, Stephanie; Uva, Paolo; Wurm, Yannick; Gräff, Johannes; Keller, Laurent

    2007-01-01

    Ants display a range of fascinating behaviors, a remarkable level of intra-species phenotypic plasticity and many other interesting characteristics. Here we present a new tool to study the molecular mechanisms underlying these traits: a tentatively annotated expressed sequence tag (EST) resource for the fire ant Solenopsis invicta. From a normalized cDNA library we obtained 21,715 ESTs, which represent 11,864 putatively different transcripts with very diverse molecular functions. All ESTs were used to construct a cDNA microarray. PMID:17224046

  18. Structural relationships among proteins with different global topologies and their implications for function annotation strategies.

    PubMed

    Petrey, Donald; Fischer, Markus; Honig, Barry

    2009-10-13

    It has become increasingly apparent that geometric relationships often exist between regions of two proteins that have quite different global topologies or folds. In this article, we examine whether such relationships can be used to infer a functional connection between the two proteins in question. We find, by considering a number of examples involving metal and cation binding, sugar binding, and aromatic group binding, that geometrically similar protein fragments can share related functions, even if they have been classified as belonging to different folds and topologies. Thus, the use of classifications inevitably limits the number of functional inferences that can be obtained from the comparative analysis of protein structures. In contrast, the development of interactive computational tools that recognize the "continuous" nature of protein structure/function space, by increasing the number of potentially meaningful relationships that are considered, may offer a dramatic enhancement in the ability to extract information from protein structure databases. We introduce the MarkUs server, that embodies this strategy and that is designed for a user interested in developing and validating specific functional hypotheses. PMID:19805138

  19. Validation of a novel expressed sequence tag (EST) clustering method and development of a phylogenetic annotation pipeline for livestock gene families 

    E-print Network

    Venkatraman, Anand

    2009-05-15

    Prediction of functions of genes in a genome is a key step in all genome sequencing projects. Sequences that carry out important functions are likely to be conserved between evolutionarily distant species and can be ...

  20. Assembly, Gene Annotation and Marker Development Using 454 Floral Transcriptome Sequences in Ziziphus Celata (Rhamnaceae), a Highly Endangered, Florida Endemic Plant

    PubMed Central

    Edwards, Christine E.; Parchman, Thomas L.; Weekley, Carl W.

    2012-01-01

    Large-scale DNA sequence data may enable development of genetic resources in endangered species, thereby facilitating conservation efforts. Ziziphus celata, a federally endangered, self-incompatible plant species occurring in Florida, USA, is one species for which genetic resources are necessary to facilitate new introductions and augmentations essential for recovery of the species. We used 454 pyrosequencing of a Z. celata normalized floral cDNA library to create a genomic resource for gene and marker discovery. A half-plate GS-FLX Titanium run yielded 655 337 reads averaging 250 bp. A total of 474 025 reads were assembled de novo into 84 645 contigs averaging 408 bp, while 181 312 reads remained unassembled. Forty-seven and 43% of contig consensus sequences had BLAST matches to known proteins in the Uniref50 and TAIR9 annotated protein databases, respectively; many contigs fully represented orthologous proteins in TAIR9. A total of 22 707 unique genes were sequenced, indicating substantial coverage of the Z. celata transcriptome. We detected single-nucleotide polymorphisms and simple sequence repeats (SSRs) and developed thousands of SSR primers for use in future genetic studies. As a first step towards understanding self-incompatibility in Z. celata, we identified sequences belonging to the gene family encoding self-incompatibility. This study demonstrates the efficacy of 454 transcriptome sequencing for rapid gene and marker discovery in an endangered plant. PMID:22039173

  1. INVESTIGATION Gene Functional Trade-Offs and the Evolution

    E-print Network

    Otto, Sarah

    INVESTIGATION Gene Functional Trade-Offs and the Evolution of Pleiotropy Frédéric Guillaume*,1, Vancouver, British Columbia, Canada V6T 1Z4 ABSTRACT Pleiotropy is the property of genes affecting multiple functions or characters of an organism. Genes vary widely in their degree of pleiotropy, but this variation

  2. A draft annotation and overview of the human genome

    PubMed Central

    Wright, Fred A; Lemon, William J; Zhao, Wei D; Sears, Russell; Zhuo, Degen; Wang, Jian-Ping; Yang, Hee-Yung; Baer, Troy; Stredney, Don; Spitzner, Joe; Stutz, Al; Krahe, Ralf; Yuan, Bo

    2001-01-01

    Background The recent draft assembly of the human genome provides a unified basis for describing genomic structure and function. The draft is sufficiently accurate to provide useful annotation, enabling direct observations of previously inferred biological phenomena. Results We report here a functionally annotated human gene index placed directly on the genome. The index is based on the integration of public transcript, protein, and mapping information, supplemented with computational prediction. We describe numerous global features of the genome and examine the relationship of various genetic maps with the assembly. In addition, initial sequence analysis reveals highly ordered chromosomal landscapes associated with paralogous gene clusters and distinct functional compartments. Finally, these annotation data were synthesized to produce observations of gene density and number that accord well with historical estimates. Such a global approach had previously been described only for chromosomes 21 and 22, which together account for 2.2% of the genome. Conclusions We estimate that the genome contains 65,000-75,000 transcriptional units, with exon sequences comprising 4%. The creation of a comprehensive gene index requires the synthesis of all available computational and experimental evidence. PMID:11516338

  3. A highly conserved gene island of three genes on chromosome 3B of hexaploid wheat: diverse gene function and genomic structure maintained in a tightly linked block

    PubMed Central

    2010-01-01

    Background The complexity of the wheat genome has resulted from waves of retrotransposable element insertions. Gene deletions and disruptions generated by the fast replacement of repetitive elements in wheat have resulted in disruption of colinearity at a micro (sub-megabase) level among the cereals. In view of genomic changes that are possible within a given time span, conservation of genes between species tends to imply an important functional or regional constraint that does not permit a change in genomic structure. The ctg1034 contig completed in this paper was initially studied because it was assigned to the Sr2 resistance locus region, but detailed mapping studies subsequently assigned it to the long arm of 3B and revealed its unusual features. Results BAC shotgun sequencing of the hexaploid wheat (Triticum aestivum cv. Chinese Spring) genome has been used to assemble a group of 15 wheat BACs from the chromosome 3B physical map FPC contig ctg1034 into a 783,553 bp genomic sequence. This ctg1034 sequence was annotated for biological features such as genes and transposable elements. A three-gene island was identified among >80% repetitive DNA sequence. Using bioinformatics analysis there were no observable similarity in their gene functions. The ctg1034 gene island also displayed complete conservation of gene order and orientation with syntenic gene islands found in publicly available genome sequences of Brachypodium distachyon, Oryza sativa, Sorghum bicolor and Zea mays, even though the intergenic space and introns were divergent. Conclusion We propose that ctg1034 is located within the heterochromatic C-band region of deletion bin 3BL7 based on the identification of heterochromatic tandem repeats and presence of significant matches to chromodomain-containing gypsy LTR retrotransposable elements. We also speculate that this location, among other highly repetitive sequences, may account for the relative stability in gene order and orientation within the gene island. Sequence data from this article have been deposited with the GenBank Data Libraries under accession no. GQ422824 PMID:20507561

  4. Functional analysis of the mospd gene family 

    E-print Network

    Buerger, Katrin

    2010-01-01

    Mospd3, a gene located on mouse chromosome 5, was identified in a gene trap screen in ES cells. The gene trap vector integration in multiple copies into the putative promoter of the gene, resulted in a loss of expression of Mospd3 at the trapped...

  5. TreeQ-VISTA: An Interactive Tree Visualization Tool withFunctional Annotation Query Capabilities

    SciTech Connect

    Gu, Shengyin; Anderson, Iain; Kunin, Victor; Cipriano, Michael; Minovitsky, Simon; Weber, Gunther; Amenta, Nina; Hamann, Bernd; Dubchak,Inna

    2007-05-07

    Summary: We describe a general multiplatform exploratorytool called TreeQ-Vista, designed for presenting functional annotationsin a phylogenetic context. Traits, such as phenotypic and genomicproperties, are interactively queried from a relational database with auser-friendly interface which provides a set of tools for users with orwithout SQL knowledge. The query results are projected onto aphylogenetic tree and can be displayed in multiple color groups. A richset of browsing, grouping and query tools are provided to facilitatetrait exploration, comparison and analysis.Availability: The program,detailed tutorial and examples are available online athttp://genome-test.lbl.gov/vista/TreeQVista.

  6. Annotation of a hybrid partial genome of the coffee rust (Hemileia vastatrix) contributes to the gene repertoire catalog of the Pucciniales

    PubMed Central

    Cristancho, Marco A.; Botero-Rozo, David Octavio; Giraldo, William; Tabima, Javier; Riaño-Pachón, Diego Mauricio; Escobar, Carolina; Rozo, Yomara; Rivera, Luis F.; Durán, Andrés; Restrepo, Silvia; Eilam, Tamar; Anikster, Yehoshua; Gaitán, Alvaro L.

    2014-01-01

    Coffee leaf rust caused by the fungus Hemileia vastatrix is the most damaging disease to coffee worldwide. The pathogen has recently appeared in multiple outbreaks in coffee producing countries resulting in significant yield losses and increases in costs related to its control. New races/isolates are constantly emerging as evidenced by the presence of the fungus in plants that were previously resistant. Genomic studies are opening new avenues for the study of the evolution of pathogens, the detailed description of plant-pathogen interactions and the development of molecular techniques for the identification of individual isolates. For this purpose we sequenced 8 different H. vastatrix isolates using NGS technologies and gathered partial genome assemblies due to the large repetitive content in the coffee rust hybrid genome; 74.4% of the assembled contigs harbor repetitive sequences. A hybrid assembly of 333 Mb was built based on the 8 isolates; this assembly was used for subsequent analyses. Analysis of the conserved gene space showed that the hybrid H. vastatrix genome, though highly fragmented, had a satisfactory level of completion with 91.94% of core protein-coding orthologous genes present. RNA-Seq from urediniospores was used to guide the de novo annotation of the H. vastatrix gene complement. In total, 14,445 genes organized in 3921 families were uncovered; a considerable proportion of the predicted proteins (73.8%) were homologous to other Pucciniales species genomes. Several gene families related to the fungal lifestyle were identified, particularly 483 predicted secreted proteins that represent candidate effector genes and will provide interesting hints to decipher virulence in the coffee rust fungus. The genome sequence of Hva will serve as a template to understand the molecular mechanisms used by this fungus to attack the coffee plant, to study the diversity of this species and for the development of molecular markers to distinguish races/isolates. PMID:25400655

  7. Clock gene evolution and functional divergence.

    PubMed

    Tauber, Eran; Last, Kim S; Olive, Peter J W; Kyriacou, C P

    2004-10-01

    In considering the impact of the earth's changing geophysical conditions during the history of life, it is surprising to learn that the earth's rotational period may have been as short as 4 h, as recently as 1900 million years ago (or 1.9 billion years ago). The implications of such figures for the origin and evolution of clocks are considerable, and the authors speculate on how this short rotational period might have influenced the development of the "protoclock" in early microorganisms, such as the Cyanobacteria, during the geological periodsin which they arose and flourished. They then discuss the subsequent duplication of clock genes that took place around and after the Cambrian period, 543 million years ago, and its consequences. They compare the relative divergences of the canonical clock genes, which reveal the Per family to be the most rapidly evolving. In addition, the authors use a statistical test to predict which residues within the PER and CRY families may have undergone functional specialization. PMID:15534324

  8. Partitioning heritability by functional annotation using genome-wide association summary statistics.

    PubMed

    Finucane, Hilary K; Bulik-Sullivan, Brendan; Gusev, Alexander; Trynka, Gosia; Reshef, Yakir; Loh, Po-Ru; Anttila, Verneri; Xu, Han; Zang, Chongzhi; Farh, Kyle; Ripke, Stephan; Day, Felix R; Purcell, Shaun; Stahl, Eli; Lindstrom, Sara; Perry, John R B; Okada, Yukinori; Raychaudhuri, Soumya; Daly, Mark J; Patterson, Nick; Neale, Benjamin M; Price, Alkes L

    2015-11-01

    Recent work has demonstrated that some functional categories of the genome contribute disproportionately to the heritability of complex diseases. Here we analyze a broad set of functional elements, including cell type-specific elements, to estimate their polygenic contributions to heritability in genome-wide association studies (GWAS) of 17 complex diseases and traits with an average sample size of 73,599. To enable this analysis, we introduce a new method, stratified LD score regression, for partitioning heritability from GWAS summary statistics while accounting for linked markers. This new method is computationally tractable at very large sample sizes and leverages genome-wide information. Our findings include a large enrichment of heritability in conserved regions across many traits, a very large immunological disease-specific enrichment of heritability in FANTOM5 enhancers and many cell type-specific enrichments, including significant enrichment of central nervous system cell types in the heritability of body mass index, age at menarche, educational attainment and smoking behavior. PMID:26414678

  9. Homology modeling, functional annotation and comparative genomics of outer membrane protein H of Pasteurella multocida.

    PubMed

    Ganguly, Bhaskar; Tewari, Kamal; Singh, Rashmi

    2015-12-01

    Pasteurella multocida is an important pathogen of animals and humans. Outer Membrane Protein (Omp) H is a major conserved protein in the envelope of P. multocida and has been commonly targeted as a protective antigen. However, not much is known about its structure and function due to the difficulties that are typically associated with obtaining sufficient amounts of purified prokaryotic transmembrane proteins. The present work is aimed at studying the OmpH using an in silico approach and consolidate the findings in light of existing experimental evidences. Our study describes the first 3D model of the P. multocida OmpH obtained through a combination of several in silico modeling approaches. From our results, OmpH of P. multocida could be classified as a homotrimeric, 16 stranded, ?-barrel porin involved in the non-specific transport of small, hydrophilic molecules, serving essential osmoregulatory function. Moreover, very small homologous sequences could be identified in the host proteome, strengthening the probability of a successful OmpH-based vaccine against the pathogen with remote chances of cross-reaction to host proteins. PMID:26362105

  10. Comparative genome analysis of PHB gene family reveals deep evolutionary origins and diverse gene function

    PubMed Central

    2010-01-01

    Background PHB (Prohibitin) gene family is involved in a variety of functions important for different biological processes. PHB genes are ubiquitously present in divergent species from prokaryotes to eukaryotes. Human PHB genes have been found to be associated with various diseases. Recent studies by our group and others have shown diverse function of PHB genes in plants for development, senescence, defence, and others. Despite the importance of the PHB gene family, no comprehensive gene family analysis has been carried to evaluate the relatedness of PHB genes across different species. In order to better guide the gene function analysis and understand the evolution of the PHB gene family, we therefore carried out the comparative genome analysis of the PHB genes across different kingdoms. Results The relatedness, motif distribution, and intron/exon distribution all indicated that PHB genes is a relatively conserved gene family. The PHB genes can be classified into 5 classes and each class have a very deep evolutionary origin. The PHB genes within the class maintained the same motif patterns during the evolution. With Arabidopsis as the model species, we found that PHB gene intron/exon structure and domains are also conserved during the evolution. Despite being a conserved gene family, various gene duplication events led to the expansion of the PHB genes. Both segmental and tandem gene duplication were involved in Arabidopsis PHB gene family expansion. However, segmental duplication is predominant in Arabidopsis. Moreover, most of the duplicated genes experienced neofunctionalization. The results highlighted that PHB genes might be involved in important functions so that the duplicated genes are under the evolutionary pressure to derive new function. Conclusion PHB gene family is a conserved gene family and accounts for diverse but important biological functions based on the similar molecular mechanisms. The highly diverse biological function indicated that more research needs to be carried out to dissect the PHB gene function. The conserved gene evolution indicated that the study in the model species can be translated to human and mammalian studies. PMID:20946606

  11. 9. Bioconductor: annotation databases Thomas Lumley

    E-print Network

    Rice, Ken

    ). 9.3 #12;Annotate Bioconductor distributes annotation packages for a wide range of gene expression Ontology and one of the Affymetrix human microarray chips. 9.4 #12;Lookups The databases are queried includes cyclic nucleotides (nucleoside cyclic phosphates). Synonym: nucleotide metabolism 9.6 #12;Bio

  12. 9. Bioconductor: annotation databases Thomas Lumley

    E-print Network

    Rice, Ken

    (biological process). 9.3 #12;Annotate Bioconductor distributes annotation packages for a wide range of gene Ontology and one of the Affymetrix human microarray chips. 9.4 #12;Lookups The databases are queried includes cyclic nucleotides (nucleoside cyclic phosphates). Synonym: nucleotide metabolism 9.6 #12;Bio

  13. ANNOTATION OF THE AFFYMETRIX PORCINE GENOME MICROARRAY

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The Affymetrix Porcine Genome Microarray is minimally annotated. Less than 10% of the probe sets on this array are described with gene names, posing a challenge to biological interpretation of data. Lack of annotation is likely due to limited availability of full-length porcine cDNA sequence. Pr...

  14. Towards an Informative Mutant Phenotype for Every Bacterial Gene

    PubMed Central

    Deutschbauer, Adam; Price, Morgan N.; Wetmore, Kelly M.; Tarjan, Daniel R.; Xu, Zhuchen; Shao, Wenjun; Leon, Dacia

    2014-01-01

    Mutant phenotypes provide strong clues to the functions of the underlying genes and could allow annotation of the millions of sequenced yet uncharacterized bacterial genes. However, it is not known how many genes have a phenotype under laboratory conditions, how many phenotypes are biologically interpretable for predicting gene function, and what experimental conditions are optimal to maximize the number of genes with a phenotype. To address these issues, we measured the mutant fitness of 1,586 genes of the ethanol-producing bacterium Zymomonas mobilis ZM4 across 492 diverse experiments and found statistically significant phenotypes for 89% of all assayed genes. Thus, in Z. mobilis, most genes have a functional consequence under laboratory conditions. We demonstrate that 41% of Z. mobilis genes have both a strong phenotype and a similar fitness pattern (cofitness) to another gene, and are therefore good candidates for functional annotation using mutant fitness. Among 502 poorly characterized Z. mobilis genes, we identified a significant cofitness relationship for 174. For 57 of these genes without a specific functional annotation, we found additional evidence to support the biological significance of these gene-gene associations, and in 33 instances, we were able to predict specific physiological or biochemical roles for the poorly characterized genes. Last, we identified a set of 79 diverse mutant fitness experiments in Z. mobilis that are nearly as biologically informative as the entire set of 492 experiments. Therefore, our work provides a blueprint for the functional annotation of diverse bacteria using mutant fitness. PMID:25112473

  15. Towards an informative mutant phenotype for every bacterial gene

    DOE PAGESBeta

    Deutschbauer, Adam; Price, Morgan N.; Wetmore, Kelly M.; Tarjan, Daniel R.; Xu, Zhuchen; Shao, Wenjen; Leon, Dacia; Arkin, Adam P.; Skerker, Jeffrey M.

    2014-08-11

    Mutant phenotypes provide strong clues to the functions of the underlying genes and could allow annotation of the millions of sequenced yet uncharacterized bacterial genes. However, it is not known how many genes have a phenotype under laboratory conditions, how many phenotypes are biologically interpretable for predicting gene function, and what experimental conditions are optimal to maximize the number of genes with a phenotype. To address these issues, we measured the mutant fitness of 1,586 genes of the ethanol-producing bacterium Zymomonas mobilis ZM4 across 492 diverse experiments and found statistically significant phenotypes for 89% of all assayed genes. Thus, inmore »Z. mobilis, most genes have a functional consequence under laboratory conditions. We demonstrate that 41% of Z. mobilis genes have both a strong phenotype and a similar fitness pattern (cofitness) to another gene, and are therefore good candidates for functional annotation using mutant fitness. Among 502 poorly characterized Z. mobilis genes, we identified a significant cofitness relationship for 174. For 57 of these genes without a specific functional annotation, we found additional evidence to support the biological significance of these gene-gene associations, and in 33 instances, we were able to predict specific physiological or biochemical roles for the poorly characterized genes. Last, we identified a set of 79 diverse mutant fitness experiments in Z. mobilis that are nearly as biologically informative as the entire set of 492 experiments. Therefore, our work provides a blueprint for the functional annotation of diverse bacteria using mutant fitness.« less

  16. In silico cloning and annotation of genes involved in the digestion, detoxification and RNA interference mechanism in the midgut of Bactrocera dorsalis [Hendel (Diptera: Tephritidae)].

    PubMed

    Shen, G-M; Dou, W; Huang, Y; Jiang, X-Z; Smagghe, G; Wang, J-J

    2013-08-01

    As the second largest organ in insects, the insect midgut is the major tissue involved in the digestion of food and detoxification of xenobiotics, such as insecticides, and the first barrier and target for oral RNA interference (RNAi). In this study, we performed a midgut-specific transcriptome analysis in the oriental fruit fly, Bactrocera dorsalis, an economically important worldwide pest, with many populations showing high levels of insecticide resistance. Using high-throughput sequencing, 52?838?060 short reads were generated and assembled to 25?236 unigenes with a mean length of 758?bp. Interestingly, 34 unique sequences encoding digestion enzymes were newly described and these included aminopeptidase and trypsin, genes associated with Bacillus thuringiensis resistance and fitness cost. Second, 41 transcripts were annotated to particular detoxification genes such as glutathione S-transferases, carboxylesterases and cytochrome P450s, and the subsequent phylogenetic analysis indicated homology with tissue-specific and insecticide resistance-related genes of Drosophila melanogaster. Third, we identified the genes involved in the mechanism of RNAi and the uptake of double-stranded RNA. The sequences encoding Dicer-2, R2D2, AGO2, and Eater were confirmed, but SID and SR-CI were absent in the midgut transcriptome. In conclusion, the results provide basic molecular information to better understand the mechanisms of food digestion, insecticide resistance and oral RNAi in this important pest insect in agriculture. Specific genes in these systems can be used in the future as potential targets for pest control, for instance, with RNAi technology. PMID:23577657

  17. Ranking Biomedical Annotations with Annotator's Semantic Relevancy

    PubMed Central

    2014-01-01

    Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator's knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user's vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large. PMID:24899918

  18. High-resolution chemical dissection of a model eukaryote reveals targets, pathways and gene functions.

    PubMed

    Hoepfner, Dominic; Helliwell, Stephen B; Sadlish, Heather; Schuierer, Sven; Filipuzzi, Ireos; Brachat, Sophie; Bhullar, Bhupinder; Plikat, Uwe; Abraham, Yann; Altorfer, Marc; Aust, Thomas; Baeriswyl, Lukas; Cerino, Raffaele; Chang, Lena; Estoppey, David; Eichenberger, Juerg; Frederiksen, Mathias; Hartmann, Nicole; Hohendahl, Annika; Knapp, Britta; Krastel, Philipp; Melin, Nicolas; Nigsch, Florian; Oakeley, Edward J; Petitjean, Virginie; Petersen, Frank; Riedl, Ralph; Schmitt, Esther K; Staedtler, Frank; Studer, Christian; Tallarico, John A; Wetzel, Stefan; Fishman, Mark C; Porter, Jeffrey A; Movva, N Rao

    2014-01-01

    Due to evolutionary conservation of biology, experimental knowledge captured from genetic studies in eukaryotic model organisms provides insight into human cellular pathways and ultimately physiology. Yeast chemogenomic profiling is a powerful approach for annotating cellular responses to small molecules. Using an optimized platform, we provide the relative sensitivities of the heterozygous and homozygous deletion collections for nearly 1800 biologically active compounds. The data quality enables unique insights into pathways that are sensitive and resistant to a given perturbation, as demonstrated with both known and novel compounds. We present examples of novel compounds that inhibit the therapeutically relevant fatty acid synthase and desaturase (Fas1p and Ole1p), and demonstrate how the individual profiles facilitate hypothesis-driven experiments to delineate compound mechanism of action. Importantly, the scale and diversity of tested compounds yields a dataset where the number of modulated pathways approaches saturation. This resource can be used to map novel biological connections, and also identify functions for unannotated genes. We validated hypotheses generated by global two-way hierarchical clustering of profiles for (i) novel compounds with a similar mechanism of action acting upon microtubules or vacuolar ATPases, and (ii) an un-annotated ORF, YIL060w, that plays a role in respiration in the mitochondria. Finally, we identify and characterize background mutations in the widely used yeast deletion collection which should improve the interpretation of past and future screens throughout the community. This comprehensive resource of cellular responses enables the expansion of our understanding of eukaryotic pathway biology. PMID:24360837

  19. Functional Annotation of Two New Carboxypeptidases from the Amidohydrolase Superfamily of Enzymes

    SciTech Connect

    Xiang, D.; Xu, C; Kumaran, D; Brown, A; Sauder, M; Burley, S; Swaminathan, S; Raushel, F

    2009-01-01

    Two proteins from the amidohydrolase superfamily of enzymes were cloned, expressed, and purified to homogeneity. The first protein, Cc0300, was from Caulobacter crescentus CB-15 (Cc0300), while the second one (Sgx9355e) was derived from an environmental DNA sequence originally isolated from the Sargasso Sea (gi|44371129). The catalytic functions and the substrate profiles for the two enzymes were determined with the aid of combinatorial dipeptide libraries. Both enzymes were shown to catalyze the hydrolysis of l-Xaa-l-Xaa dipeptides in which the amino acid at the N-terminus was relatively unimportant. These enzymes were specific for hydrophobic amino acids at the C-terminus. With Cc0300, substrates terminating in isoleucine, leucine, phenylalanine, tyrosine, valine, methionine, and tryptophan were hydrolyzed. The same specificity was observed with Sgx9355e, but this protein was also able to hydrolyze peptides terminating in threonine. Both enzymes were able to hydrolyze N-acetyl and N-formyl derivatives of the hydrophobic amino acids and tripeptides. The best substrates identified for Cc0300 were l-Ala-l-Leu with kcat and kcat/Km values of 37 s-1 and 1.1 x 105 M-1 s-1, respectively, and N-formyl-l-Tyr with kcat and kcat/Km values of 33 s-1 and 3.9 x 105 M-1 s-1, respectively. The best substrate identified for Sgx9355e was l-Ala-l-Phe with kcat and kcat/Km values of 0.41 s-1 and 5.8 x 103 M-1 s-1. The three-dimensional structure of Sgx9355e was determined to a resolution of 2.33 Angstroms with l-methionine bound in the active site. The a-carboxylate of the methionine is ion-paired to His-237 and also hydrogen bonded to the backbone amide groups of Val-201 and Leu-202. The a-amino group of the bound methionine interacts with Asp-328. The structural determinants for substrate recognition were identified and compared with other enzymes in this superfamily that hydrolyze dipeptides with different specificities.

  20. Functional Annotation of Two New Carboxypeptidases from the Amidohydrolase Superfamily of Enzymes†

    PubMed Central

    Xiang, Dao Feng; Xu, Chengfu; Kumaran, Desigan; Brown, Ann C.; Sauder, J. Michael; Burley, Stephen K.; Swaminathan, Subramanyam; Raushel, Frank M.

    2009-01-01

    Two proteins from the amidohydrolase superfamily of enzymes were cloned, expressed and purified to homogeneity. The first protein, Cc0300, was from Caulobacter crescentus CB-15 (Cc0300) while the second one (Sgx9355e) was derived from an environmental DNA sequence originally isolated from the Sargasso Sea (gi| 44371129). The catalytic functions and the substrate profiles for the two enzymes were determined with the aid of combinatorial dipeptide libraries. Both enzymes were shown to catalyze the hydrolysis of L-Xaa-L-Xaa dipeptides where the amino acid at the N-terminus was relatively unimportant. These enzymes were specific for hydrophobic amino acids at the C-terminus. With Cc0300, substrates terminating in isoleucine, leucine, phenylalanine, tyrosine, valine, methionine, and tryptophan were hydrolyzed. The same specificity was observed with Sgx9355e but this protein was also able to hydrolyze peptides terminating in threonine. Both enzymes were able to hydrolyze N-acetyl and N-formyl derivatives of the hydrophobic amino acids and tripeptides. The best substrates identified for Cc0300 were L-Ala-L-Leu with values of kcat and kcat/Km of 37 s?1 and 1.1 × 105 M?1 s?1, respectively, and N-formyl-L-Tyr with values of kcat and kcat/Km of 33 s?1 and 3.9 × 105 M?1 s?1, respectively. The best substrate identified for Sgx9355e was L-Ala-L-Phe will values of kcat and kcat/Km of 0.41 s?1 and 5.8 × 103 M?1 s?1. The three-dimensional structure of Sgx9355e was determined to a resolution of 2.33 Å with L-methionine bound in the active site. The ?-carboxylate of the methionine is ion-paired to His-237 and also hydrogen bonded to the backbone amide groups of Val-201 and Leu-202. The ?-amino group of the bound methionine interacts with Asp-328. The structural determinants for substrate recognition were identified and compared with other enzymes in this superfamily that hydrolyze dipeptides with different specificities. PMID:19358546

  1. Reconstruction of a Functional Human Gene Network, with an Application for Prioritizing Positional Candidate Genes

    PubMed Central

    Franke, Lude; Bakel, Harm van; Fokkens, Like; de Jong, Edwin D.; Egmont-Petersen, Michael; Wijmenga, Cisca

    2006-01-01

    Most common genetic disorders have a complex inheritance and may result from variants in many genes, each contributing only weak effects to the disease. Pinpointing these disease genes within the myriad of susceptibility loci identified in linkage studies is difficult because these loci may contain hundreds of genes. However, in any disorder, most of the disease genes will be involved in only a few different molecular pathways. If we know something about the relationships between the genes, we can assess whether some genes (which may reside in different loci) functionally interact with each other, indicating a joint basis for the disease etiology. There are various repositories of information on pathway relationships. To consolidate this information, we developed a functional human gene network that integrates information on genes and the functional relationships between genes, based on data from the Kyoto Encyclopedia of Genes and Genomes, the Biomolecular Interaction Network Database, Reactome, the Human Protein Reference Database, the Gene Ontology database, predicted protein-protein interactions, human yeast two-hybrid interactions, and microarray coexpressions. We applied this network to interrelate positional candidate genes from different disease loci and then tested 96 heritable disorders for which the Online Mendelian Inheritance in Man database reported at least three disease genes. Artificial susceptibility loci, each containing 100 genes, were constructed around each disease gene, and we used the network to rank these genes on the basis of their functional interactions. By following up the top five genes per artificial locus, we were able to detect at least one known disease gene in 54% of the loci studied, representing a 2.8-fold increase over random selection. This suggests that our method can significantly reduce the cost and effort of pinpointing true disease genes in analyses of disorders for which numerous loci have been reported but for which most of the genes are unknown. PMID:16685651

  2. Gene function classification using Bayesian models with hierarchy-based priors

    PubMed Central

    Shahbaba, Babak; Neal, Radford M

    2006-01-01

    Background We investigate whether annotation of gene function can be improved using a classification scheme that is aware that functional classes are organized in a hierarchy. The classifiers look at phylogenic descriptors, sequence based attributes, and predicted secondary structure. We discuss three Bayesian models and compare their performance in terms of predictive accuracy. These models are the ordinary multinomial logit (MNL) model, a hierarchical model based on a set of nested MNL models, and an MNL model with a prior that introduces correlations between the parameters for classes that are nearby in the hierarchy. We also provide a new scheme for combining different sources of information. We use these models to predict the functional class of Open Reading Frames (ORFs) from the E. coli genome. Results The results from all three models show substantial improvement over previous methods, which were based on the C5 decision tree algorithm. The MNL model using a prior based on the hierarchy outperforms both the non-hierarchical MNL model and the nested MNL model. In contrast to previous attempts at combining the three sources of information in this dataset, our new approach to combining data sources produces a higher accuracy rate than applying our models to each data source alone. Conclusion Together, these results show that gene function can be predicted with higher accuracy than previously achieved, using Bayesian models that incorporate suitable prior information. PMID:17038174

  3. Speeding disease gene discovery by sequence based candidate prioritization 

    E-print Network

    Adie, Euan A; Adams, Richard R; Evans, Kathryn L; Porteous, David; Pickard, Ben S

    2005-03-14

    Background: Regions of interest identified through genetic linkage studies regularly exceed 30 centimorgans in size and can contain hundreds of genes. Traditionally this number is reduced by matching functional annotation ...

  4. DAnCER: Disease-Annotated Chromatin Epigenetics Resource

    PubMed Central

    Turinsky, Andrei L.; Turner, Brian; Borja, Rosanne C.; Gleeson, James A.; Heath, Michael; Pu, Shuye; Switzer, Thomas; Dong, Dong; Gong, Yunchen; On, Tuan; Xiong, Xuejian; Emili, Andrew; Greenblatt, Jack; Parkinson, John; Zhang, Zhaolei; Wodak, Shoshana J.

    2011-01-01

    Chromatin modification (CM) is a set of epigenetic processes that govern many aspects of DNA replication, transcription and repair. CM is carried out by groups of physically interacting proteins, and their disruption has been linked to a number of complex human diseases. CM remains largely unexplored, however, especially in higher eukaryotes such as human. Here we present the DAnCER resource, which integrates information on genes with CM function from five model organisms, including human. Currently integrated are gene functional annotations, Pfam domain architecture, protein interaction networks and associated human diseases. Additional supporting evidence includes orthology relationships across organisms, membership in protein complexes, and information on protein 3D structure. These data are available for 962 experimentally confirmed and manually curated CM genes and for over 5000 genes with predicted CM function on the basis of orthology and domain composition. DAnCER allows visual explorations of the integrated data and flexible query capabilities using a variety of data filters. In particular, disease information and functional annotations are mapped onto the protein interaction networks, enabling the user to formulate new hypotheses on the function and disease associations of a given gene based on those of its interaction partners. DAnCER is freely available at http://wodaklab.org/dancer/. PMID:20876685

  5. Evolution of new functions de novo and from preexisting genes.

    PubMed

    Andersson, Dan I; Jerlström-Hultqvist, Jon; Näsvall, Joakim

    2015-06-01

    How the enormous structural and functional diversity of new genes and proteins was generated (estimated to be 10(10)-10(12) different proteins in all organisms on earth [Choi I-G, Kim S-H. 2006. Evolution of protein structural classes and protein sequence families. Proc Natl Acad Sci 103: 14056-14061] is a central biological question that has a long and rich history. Extensive work during the last 80 years have shown that new genes that play important roles in lineage-specific phenotypes and adaptation can originate through a multitude of different mechanisms, including duplication, lateral gene transfer, gene fusion/fission, and de novo origination. In this review, we focus on two main processes as generators of new functions: evolution of new genes by duplication and divergence of pre-existing genes and de novo gene origination in which a whole protein-coding gene evolves from a noncoding sequence. PMID:26032716

  6. An atlas of bovine gene expression reveals novel distinctive tissue characteristics and evidence for improving genome annotation

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Background A comprehensive transcriptome survey, or gene atlas, provides information essential for a complete understanding of the genomic biology of an organism. We present an atlas of RNA abundance for 92 adult, juvenile and fetal cattle tissues and three cattle cell lines. Results The Bovine Gene...

  7. Network analysis of gene essentiality in functional genomics experiments.

    PubMed

    Jiang, Peng; Wang, Hongfang; Li, Wei; Zang, Chongzhi; Li, Bo; Wong, Yinling J; Meyer, Cliff; Liu, Jun S; Aster, Jon C; Liu, X Shirley

    2015-01-01

    Many genomic techniques have been developed to study gene essentiality genome-wide, such as CRISPR and shRNA screens. Our analyses of public CRISPR screens suggest protein interaction networks, when integrated with gene expression or histone marks, are highly predictive of gene essentiality. Meanwhile, the quality of CRISPR and shRNA screen results can be significantly enhanced through network neighbor information. We also found network neighbor information to be very informative on prioritizing ChIP-seq target genes and survival indicator genes from tumor profiling. Thus, our study provides a general method for gene essentiality analysis in functional genomic experiments ( http://nest.dfci.harvard.edu ). PMID:26518695

  8. 8. Bioconductor Intro and Annotation Thomas Lumley

    E-print Network

    Rice, Ken

    8. Bioconductor Intro and Annotation Ken Rice Thomas Lumley Universities of Washington and Auckland for a microarray or SNP chip. 8.13 #12;Types of database Translations of names: Affy probe 32972_at is the gene

  9. Introduction GO FLNs GAIN Results Other Algorithms Gene Function Prediction

    E-print Network

    Zhang, Liqing

    Introduction GO FLNs GAIN Results Other Algorithms Gene Function Prediction T. M. Murali Department, 6, 11, 2011 T. M. Murali CS 3824 Gene Function Prediction #12;Introduction GO FLNs GAIN Results;Introduction GO FLNs GAIN Results Other Algorithms to the System Keith Haring, Untited, 1986 Urs Wehrli

  10. Detecting Mutual Functional Gene Clusters from Multiple Related Diseases

    E-print Network

    Buffalo, State University of New York

    Detecting Mutual Functional Gene Clusters from Multiple Related Diseases Nan Du, Xiaoyi Li, Yuan opportunity for understanding the functional genomics of a specific disease. Due to its strong power the gene clusters for various diseases. However, more and more evidence suggest that human diseases

  11. Gene expression and function involved in polyol biosynthesis of Trichosporonoides megachiliensis under hyper-osmotic stress.

    PubMed

    Kobayashi, Yosuke; Yoshida, Junjiro; Iwata, Hisashi; Koyama, Yoshiyuki; Kato, Jun; Ogihara, Jun; Kasumi, Takafumi

    2013-06-01

    Among three erythritol reductase isogenes (er1, er2, and er3) in Trichosporonoides megachiliensis SN-124A, er1 and er2 each had one stress response element (STRE) approximately 2 kbp upstream of their respective initiator codon; in contrast, er3 had two STREs, 148 and 40 bp upstream from the initiator codon. Based on intracellular erythritol accumulation and gene expression profiles, er3 seemed to be highly responsive to stress than er1 or er2. Under hyper-osmotic conditions, intracellular glycerol production, increased significantly within 1.5 h together with glycerol-3-phosphate dehydrogenase gene (gpd1) expression; in contrast, neither er gene expression nor the corresponding production of intracellular erythritol increased significantly within the first 1.5 h of hyper-osmotic culture. However, within 24 h of hyper-osmotic culture, erythritol production and er3 gene expression increased significantly and in parallel. Thus, we concluded that, as an initial response to hyper-osmotic growth conditions, T. megachiliensis produces glycerol as an osmoregulatory compatible solute via GPD; however, within 24 h, it begins to produce erythritol, mainly via ER3, as the preferred compatible solute. Heterologous expression of ers in a Saccharomyces cerevisiae mutant indicated that any of three ers might not function in S. cerevisiae for erythritol biosynthesis in spite of ers and corresponding ERs expression. Hence, although er is annotated as a galactose-inducible crystalline-like yeast protein gene (gcy1) homolog, er may be functionally different from gcy1 in glycolytic metabolism. Otherwise, S. cerevisiae is not likely to produce erythrose, the substrate of erythrose reductase due to metabolic characteristics. PMID:23294575

  12. Chemical genomics for studying parasite gene function and interaction

    PubMed Central

    Li, Jian; Yuan, Jing; Chen, Chin-chien; Inglese, James; Su, Xin-zhuan

    2013-01-01

    With the development of new technologies in genome sequencing, gene expression profiling, genotyping, and high-throughput screening of chemical compound libraries, small molecules are playing increasingly important roles in studying gene expression regulation, gene-gene interaction, and gene function. Here we briefly review and discuss some recent advancements in drug target identification and phenotype characterization using combinations of high-throughput screening of small-molecule libraries and various genome-wide methods such as whole genome sequencing, genome-wide association studies, and genome-wide expressional analysis. These approaches can be used to search for new drugs against parasitic infections, to identify drug targets or drug-resistance genes, and to infer gene function. PMID:24215777

  13. Function does not follow form in gene regulatory circuits

    PubMed Central

    Payne, Joshua L.; Wagner, Andreas

    2015-01-01

    Gene regulatory circuits are to the cell what arithmetic logic units are to the chip: fundamental components of information processing that map an input onto an output. Gene regulatory circuits come in many different forms, distinct structural configurations that determine who regulates whom. Studies that have focused on the gene expression patterns (functions) of circuits with a given structure (form) have examined just a few structures or gene expression patterns. Here, we use a computational model to exhaustively characterize the gene expression patterns of nearly 17 million three-gene circuits in order to systematically explore the relationship between circuit form and function. Three main conclusions emerge. First, function does not follow form. A circuit of any one structure can have between twelve and nearly thirty thousand distinct gene expression patterns. Second, and conversely, form does not follow function. Most gene expression patterns can be realized by more than one circuit structure. And third, multifunctionality severely constrains circuit form. The number of circuit structures able to drive multiple gene expression patterns decreases rapidly with the number of these patterns. These results indicate that it is generally not possible to infer circuit function from circuit form, or vice versa. PMID:26290154

  14. Using CATH-Gene3D to Analyze the Sequence, Structure, and Function of Proteins.

    PubMed

    Sillitoe, Ian; Lewis, Tony; Orengo, Christine

    2015-01-01

    The CATH database is a classification of protein structures found in the Protein Data Bank (PDB). Protein structures are chopped into individual units of structural domains, and these domains are grouped together into superfamilies if there is sufficient evidence that they have diverged from a common ancestor during the process of evolution. A sister resource, Gene3D, extends this information by scanning sequence profiles of these CATH domain superfamilies against many millions of known proteins to identify related sequences. Thus the combined CATH-Gene3D resource provides confident predictions of the likely structural fold, domain organisation, and evolutionary relatives of these proteins. In addition, this resource incorporates annotations from a large number of external databases such as known enzyme active sites, GO molecular functions, physical interactions, and mutations. This unit details how to access and understand the information contained within the CATH-Gene3D Web pages, the downloadable data files, and the remotely accessible Web services. © 2015 by John Wiley & Sons, Inc. PMID:26087950

  15. Bioinformatic prediction of gene functions regulated by quorum sensing in the bioleaching bacterium Acidithiobacillus ferrooxidans.

    PubMed

    Banderas, Alvaro; Guiliani, Nicolas

    2013-01-01

    The biomining bacterium Acidithiobacillus ferrooxidans oxidizes sulfide ores and promotes metal solubilization. The efficiency of this process depends on the attachment of cells to surfaces, a process regulated by quorum sensing (QS) cell-to-cell signalling in many Gram-negative bacteria. At. ferrooxidans has a functional QS system and the presence of AHLs enhances its attachment to pyrite. However, direct targets of the QS transcription factor AfeR remain unknown. In this study, a bioinformatic approach was used to infer possible AfeR direct targets based on the particular palindromic features of the AfeR binding site. A set of Hidden Markov Models designed to maintain palindromic regions and vary non-palindromic regions was used to screen for putative binding sites. By annotating the context of each predicted binding site (PBS), we classified them according to their positional coherence relative to other putative genomic structures such as start codons, RNA polymerase promoter elements and intergenic regions. We further used the Multiple EM for Motif Elicitation algorithm (MEME) to further filter out low homology PBSs. In summary, 75 target-genes were identified, 34 of which have a higher confidence level. Among the identified genes, we found afeR itself, zwf, genes encoding glycosyltransferase activities, metallo-beta lactamases, and active transport-related proteins. Glycosyltransferases and Zwf (Glucose 6-phosphate-1-dehydrogenase) might be directly involved in polysaccharide biosynthesis and attachment to minerals by At. ferrooxidans cells during the bioleaching process. PMID:23959118

  16. Bioinformatic Prediction of Gene Functions Regulated by Quorum Sensing in the Bioleaching Bacterium Acidithiobacillus ferrooxidans

    PubMed Central

    Banderas, Alvaro; Guiliani, Nicolas

    2013-01-01

    The biomining bacterium Acidithiobacillus ferrooxidans oxidizes sulfide ores and promotes metal solubilization. The efficiency of this process depends on the attachment of cells to surfaces, a process regulated by quorum sensing (QS) cell-to-cell signalling in many Gram-negative bacteria. At. ferrooxidans has a functional QS system and the presence of AHLs enhances its attachment to pyrite. However, direct targets of the QS transcription factor AfeR remain unknown. In this study, a bioinformatic approach was used to infer possible AfeR direct targets based on the particular palindromic features of the AfeR binding site. A set of Hidden Markov Models designed to maintain palindromic regions and vary non-palindromic regions was used to screen for putative binding sites. By annotating the context of each predicted binding site (PBS), we classified them according to their positional coherence relative to other putative genomic structures such as start codons, RNA polymerase promoter elements and intergenic regions. We further used the Multiple EM for Motif Elicitation algorithm (MEME) to further filter out low homology PBSs. In summary, 75 target-genes were identified, 34 of which have a higher confidence level. Among the identified genes, we found afeR itself, zwf, genes encoding glycosyltransferase activities, metallo-beta lactamases, and active transport-related proteins. Glycosyltransferases and Zwf (Glucose 6-phosphate-1-dehydrogenase) might be directly involved in polysaccharide biosynthesis and attachment to minerals by At. ferrooxidans cells during the bioleaching process. PMID:23959118

  17. Shotgun Metagenomic Sequencing Reveals Functional Genes and Microbiome Associated with Bovine Digital Dermatitis

    PubMed Central

    Zinicola, Martin; Higgins, Hazel; Lima, Svetlana; Machado, Vinicius; Guard, Charles; Bicalho, Rodrigo

    2015-01-01

    Metagenomic methods amplifying 16S ribosomal RNA genes have been used to describe the microbial diversity of healthy skin and lesion stages of bovine digital dermatitis (DD) and to detect critical pathogens involved with disease pathogenesis. In this study, we characterized the microbiome and for the first time, the composition of functional genes of healthy skin (HS), active (ADD) and inactive (IDD) lesion stages using a whole-genome shotgun approach. Metagenomic sequences were annotated using MG-RAST pipeline. Six phyla were identified as the most abundant. Firmicutes and Actinobacteria were the predominant bacterial phyla in the microbiome of HS, while Spirochetes, Bacteroidetes and Proteobacteria were highly abundant in ADD and IDD. T. denticola-like, T. vincentii-like and T. phagedenis-like constituted the most abundant species in ADD and IDD. Recruitment plots comparing sequences from HS, ADD and IDD samples to the genomes of specific Treponema spp., supported the presence of T. denticola and T. vincentii in ADD and IDD. Comparison of the functional composition of HS to ADD and IDD identified a significant difference in genes associated with motility/chemotaxis and iron acquisition/metabolism. We also provide evidence that the microbiome of ADD and IDD compared to that of HS had significantly higher abundance of genes associated with resistance to copper and zinc, which are commonly used in footbaths to prevent and control DD. In conclusion, the results from this study provide new insights into the HS, ADD and IDD microbiomes, improve our understanding of the disease pathogenesis and generate unprecedented knowledge regarding the functional genetic composition of the digital dermatitis microbiome. PMID:26193110

  18. FUNCTIONAL NANOPARTICLES FOR MOLECULAR IMAGING GUIDED GENE DELIVERY

    PubMed Central

    Liu, Gang; Swierczewska, Magdalena; Lee, Seulki; Chen, Xiaoyuan

    2010-01-01

    Gene therapy has great potential to bring tremendous changes in treatment of various diseases and disorders. However, one of the impediments to successful gene therapy is the inefficient delivery of genes to target tissues and the inability to monitor delivery of genes and therapeutic responses at the targeted site. The emergence of molecular imaging strategies has been pivotal in optimizing gene therapy; since it can allow us to evaluate the effectiveness of gene delivery noninvasively and spatiotemporally. Due to the unique physiochemical properties of nanomaterials, numerous functional nanoparticles show promise in accomplishing gene delivery with the necessary feature of visualizing the delivery. In this review, recent developments of nanoparticles for molecular imaging guided gene delivery are summarized. PMID:22473061

  19. Corpus annotation methodology for citation classification in scientific literature

    E-print Network

    1 Corpus annotation methodology for citation classification in scientific literature Myriam generation and testing of automatic systems for classifying the purpose or function of a citation referenced a methodology for corpus annotation for citation classification in scientific literature that facilitate

  20. When natural selection gives gene function the cold shoulder.

    PubMed

    Cutter, Asher D; Jovelin, Richard

    2015-11-01

    It is tempting to invoke organismal selection as perpetually optimizing the function of any given gene. However, natural selection can drive genic functional change without improvement of biochemical activity, even to the extinction of gene activity. Detrimental mutations can creep in owing to linkage with other selectively favored loci. Selection can promote functional degradation, irrespective of genetic drift, when adaptation occurs by loss of gene function. Even stabilizing selection on a trait can lead to divergence of the underlying molecular constituents. Selfish genetic elements can also proliferate independent of any functional benefits to the host genome. Here we review the logic and evidence for these diverse processes acting in genome evolution. This collection of distinct evolutionary phenomena - while operating through easily understandable mechanisms - all contribute to the seemingly counterintuitive notion that maintenance or improvement of a gene's biochemical function sometimes do not determine its evolutionary fate. PMID:26411745

  1. Fine-scale genetic mapping of a hybrid sterility factor between Drosophila simulans and D. mauritiana: the varied and elusive functions of "speciation genes"

    PubMed Central

    2010-01-01

    Background Hybrid male sterility (HMS) is a usual outcome of hybridization between closely related animal species. It arises because interactions between alleles that are functional within one species may be disrupted in hybrids. The identification of genes leading to hybrid sterility is of great interest for understanding the evolutionary process of speciation. In the current work we used marked P-element insertions as dominant markers to efficiently locate one genetic factor causing a severe reduction in fertility in hybrid males of Drosophila simulans and D. mauritiana. Results Our mapping effort identified a region of 9 kb on chromosome 3, containing three complete and one partial coding sequences. Within this region, two annotated genes are suggested as candidates for the HMS factor, based on the comparative molecular characterization and public-source information. Gene Taf1 is partially contained in the region, but yet shows high polymorphism with four fixed non-synonymous substitutions between the two species. Its molecular functions involve sequence-specific DNA binding and transcription factor activity. Gene agt is a small, intronless gene, whose molecular function is annotated as methylated-DNA-protein-cysteine S-methyltransferase activity. High polymorphism and one fixed non-synonymous substitution suggest this is a fast evolving gene. The gene trees of both genes perfectly separate D. simulans and D. mauritiana into monophyletic groups. Analysis of gene expression using microarray revealed trends that were similar to those previously found in comparisons between whole-genome hybrids and parental species. Conclusions The identification following confirmation of the HMS candidate gene will add another case study leading to understanding the evolutionary process of hybrid incompatibility. PMID:21144061

  2. Horizontal functional gene transfer from bacteria to fishes.

    PubMed

    Sun, Bao-Fa; Li, Tong; Xiao, Jin-Hua; Jia, Ling-Yi; Liu, Li; Zhang, Peng; Murphy, Robert W; He, Shun-Min; Huang, Da-Wei

    2015-01-01

    Invertebrates can acquire functional genes via horizontal gene transfer (HGT) from bacteria but fishes are not known to do so. We provide the first reliable evidence of one HGT event from marine bacteria to fishes. The HGT appears to have occurred after emergence of the teleosts. The transferred gene is expressed and regulated developmentally. Its successful integration and expression may change the genetic and metabolic repertoire of fishes. In addition, this gene contains conserved domains and similar tertiary structures in fishes and their putative donor bacteria. Thus, it may function similarly in both groups. Evolutionary analyses indicate that it evolved under purifying selection, further indicating its conserved function. We document the first likely case of HGT of functional gene from prokaryote to fishes. This discovery certifies that HGT can influence vertebrate evolution. PMID:26691285

  3. Horizontal functional gene transfer from bacteria to fishes

    PubMed Central

    Sun, Bao-Fa; Li, Tong; Xiao, Jin-Hua; Jia, Ling-Yi; Liu, Li; Zhang, Peng; Murphy, Robert W.; He, Shun-Min; Huang, Da-Wei

    2015-01-01

    Invertebrates can acquire functional genes via horizontal gene transfer (HGT) from bacteria but fishes are not known to do so. We provide the first reliable evidence of one HGT event from marine bacteria to fishes. The HGT appears to have occurred after emergence of the teleosts. The transferred gene is expressed and regulated developmentally. Its successful integration and expression may change the genetic and metabolic repertoire of fishes. In addition, this gene contains conserved domains and similar tertiary structures in fishes and their putative donor bacteria. Thus, it may function similarly in both groups. Evolutionary analyses indicate that it evolved under purifying selection, further indicating its conserved function. We document the first likely case of HGT of functional gene from prokaryote to fishes. This discovery certifies that HGT can influence vertebrate evolution. PMID:26691285

  4. Virus Induced Gene Silencing for Functional Characterization of Genes Expressing in Various Wheat Tissues

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Virus induced gene silencing (VIGS) has a great potential as a functional genomics tool. Methodological simplicity, robustness and speedy results make VIGS an ideal technique for high-throughput functional analysis of genes. In monocots, like wheat and barley, this technique has been shown to work...

  5. Improved annotation through genome-scale metabolic modeling of Aspergillus oryzae

    PubMed Central

    Vongsangnak, Wanwipa; Olsen, Peter; Hansen, Kim; Krogsgaard, Steen; Nielsen, Jens

    2008-01-01

    Background Since ancient times the filamentous fungus Aspergillus oryzae has been used in the fermentation industry for the production of fermented sauces and the production of industrial enzymes. Recently, the genome sequence of A. oryzae with 12,074 annotated genes was released but the number of hypothetical proteins accounted for more than 50% of the annotated genes. Considering the industrial importance of this fungus, it is therefore valuable to improve the annotation and further integrate genomic information with biochemical and physiological information available for this microorganism and other related fungi. Here we proposed the gene prediction by construction of an A. oryzae Expressed Sequence Tag (EST) library, sequencing and assembly. We enhanced the function assignment by our developed annotation strategy. The resulting better annotation was used to reconstruct the metabolic network leading to a genome scale metabolic model of A. oryzae. Results Our assembled EST sequences we identified 1,046 newly predicted genes in the A. oryzae genome. Furthermore, it was possible to assign putative protein functions to 398 of the newly predicted genes. Noteworthy, our annotation strategy resulted in assignment of new putative functions to 1,469 hypothetical proteins already present in the A. oryzae genome database. Using the substantially improved annotated genome we reconstructed the metabolic network of A. oryzae. This network contains 729 enzymes, 1,314 enzyme-encoding genes, 1,073 metabolites and 1,846 (1,053 unique) biochemical reactions. The metabolic reactions are compartmentalized into the cytosol, the mitochondria, the peroxisome and the extracellular space. Transport steps between the compartments and the extracellular space represent 281 reactions, of which 161 are unique. The metabolic model was validated and shown to correctly describe the phenotypic behavior of A. oryzae grown on different carbon sources. Conclusion A much enhanced annotation of the A. oryzae genome was performed and a genome-scale metabolic model of A. oryzae was reconstructed. The model accurately predicted the growth and biomass yield on different carbon sources. The model serves as an important resource for gaining further insight into our understanding of A. oryzae physiology. PMID:18500999

  6. Target Gene and Function Prediction of Differentially Expressed MicroRNAs in Lactating Mammary Glands of Dairy Goats

    PubMed Central

    Ji, Zhi-Bin; Chen, Cun-Xian; Wang, Gui-Zhi; Wang, Jian-Min

    2013-01-01

    MicroRNAs are small noncoding RNAs that can regulate gene expression, and they can be involved in the regulation of mammary gland development. The differential expression of miRNAs during mammary gland development is expected to provide insight into their roles in regulating the homeostasis of mammary gland tissues. To screen out miRNAs that should have important regulatory function in the development of mammary gland from miRNA expression profiles and to predict their function, in this study, the target genes of differentially expressed miRNAs in the lactating mammary glands of Laoshan dairy goats are predicted, and then the functions of these miRNAs are analyzed via bioinformatics. First, we screen the expression patterns of 25 miRNAs that had shown significant differences during the different lactation stages in the mammary gland. Then, these miRNAs are clustered according to their expression patterns. Computational methods were used to obtain 215 target genes for 22 of these miRNAs. Combining gene ontology annotation, Fisher's exact test, and KEGG analysis with the target prediction for these miRNAs, the regulatory functions of miRNAs belonging to different clusters are predicted. PMID:24195063

  7. Functional requirements driving the gene duplication in 12 Drosophila species

    PubMed Central

    2013-01-01

    Background Gene duplication supplies the raw materials for novel gene functions and many gene families arisen from duplication experience adaptive evolution. Most studies of young duplicates have focused on mammals, especially humans, whereas reports describing their genome-wide evolutionary patterns across the closely related Drosophila species are rare. The sequenced 12 Drosophila genomes provide the opportunity to address this issue. Results In our study, 3,647 young duplicate gene families were identified across the 12 Drosophila species and three types of expansions, species-specific, lineage-specific and complex expansions, were detected in these gene families. Our data showed that the species-specific young duplicate genes predominated (86.6%) over the other two types. Interestingly, many independent species-specific expansions in the same gene family have been observed in many species, even including 11 or 12 Drosophila species. Our data also showed that the functional bias observed in these young duplicate genes was mainly related to responses to environmental stimuli and biotic stresses. Conclusions This study reveals the evolutionary patterns of young duplicates across 12 Drosophila species on a genomic scale. Our results suggest that convergent evolution acts on young duplicate genes after the species differentiation and adaptive evolution may play an important role in duplicate genes for adaption to ecological factors and environmental changes in Drosophila. PMID:23945147

  8. Introns within Ribosomal Protein Genes Regulate the Production and Function

    E-print Network

    Zhang, Jianzhi

    -specific phenotypic effects. Together, our results indicate that splicing in yeast RP genes mediates intergeneIntrons within Ribosomal Protein Genes Regulate the Production and Function of Yeast Ribosomes In budding yeast, the most abundantly spliced pre- mRNAs encode ribosomal proteins (RPs). To investi- gate

  9. Drosophila duplicate genes evolve new functions on the fly

    PubMed Central

    Assis, Raquel

    2014-01-01

    Gene duplication is thought to play a key role in phenotypic innovation. While several processes have been hypothesized to drive the retention and functional evolution of duplicate genes, their genomic contributions have never been determined. We recently developed the first genome-wide method to classify these processes by comparing distances between expression profiles of duplicate genes and their ancestral single-copy orthologs. Application of our approach to spatial gene expression profiles in two Drosophila species revealed that a majority of young duplicate genes possess new functions, and that new functions are acquired rapidly—often within a few million years. Surprisingly, new functions tend to arise in younger copies of duplicate gene pairs. Moreover, we found that young duplicates are often specifically expressed in testes, whereas old duplicates are broadly expressed across several tissues, providing strong support for the hypothetical “out-of-testes” origin of new genes. In this Extra View, I discuss our findings in the context of theoretical predictions about gene duplication, with a particular emphasis on the importance of natural selection in the evolution of novel phenotypes. PMID:25483247

  10. The ubiquilin gene family: evolutionary patterns and functional insights

    PubMed Central

    2014-01-01

    Background Ubiquilins are proteins that function as ubiquitin receptors in eukaryotes. Mutations in two ubiquilin-encoding genes have been linked to the genesis of neurodegenerative diseases. However, ubiquilin functions are still poorly understood. Results In this study, evolutionary and functional data are combined to determine the origin and diversification of the ubiquilin gene family and to characterize novel potential roles of ubiquilins in mammalian species, including humans. The analysis of more than six hundred sequences allowed characterizing ubiquilin diversity in all the main eukaryotic groups. Many organisms (e. g. fungi, many animals) have single ubiquilin genes, but duplications in animal, plant, alveolate and excavate species are described. Seven different ubiquilins have been detected in vertebrates. Two of them, here called UBQLN5 and UBQLN6, had not been hitherto described. Significantly, marsupial and eutherian mammals have the most complex ubiquilin gene families, composed of up to 6 genes. This exceptional mammalian-specific expansion is the result of the recent emergence of four new genes, three of them (UBQLN3, UBQLN5 and UBQLNL) with precise testis-specific expression patterns that indicate roles in the postmeiotic stages of spermatogenesis. A gene with related features has independently arisen in species of the Drosophila genus. Positive selection acting on some mammalian ubiquilins has been detected. Conclusions The ubiquilin gene family is highly conserved in eukaryotes. The infrequent lineage-specific amplifications observed may be linked to the emergence of novel functions in particular tissues. PMID:24674348

  11. Concept annotation in the CRAFT corpus

    PubMed Central

    2012-01-01

    Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. PMID:22776079

  12. GEneSTATION 1.0: a synthetic resource of diverse evolutionary and functional genomic data for studying the evolution of pregnancy-associated tissues and phenotypes

    PubMed Central

    Kim, Mara; Cooper, Brian A.; Venkat, Rohit; Phillips, Julie B.; Eidem, Haley R.; Hirbo, Jibril; Nutakki, Sashank; Williams, Scott M.; Muglia, Louis J.; Capra, J. Anthony; Petren, Kenneth; Abbot, Patrick; Rokas, Antonis; McGary, Kriston L.

    2016-01-01

    Mammalian gestation and pregnancy are fast evolving processes that involve the interaction of the fetal, maternal and paternal genomes. Version 1.0 of the GEneSTATION database (http://genestation.org) integrates diverse types of omics data across mammals to advance understanding of the genetic basis of gestation and pregnancy-associated phenotypes and to accelerate the translation of discoveries from model organisms to humans. GEneSTATION is built using tools from the Generic Model Organism Database project, including the biology-aware database CHADO, new tools for rapid data integration, and algorithms that streamline synthesis and user access. GEneSTATION contains curated life history information on pregnancy and reproduction from 23 high-quality mammalian genomes. For every human gene, GEneSTATION contains diverse evolutionary (e.g. gene age, population genetic and molecular evolutionary statistics), organismal (e.g. tissue-specific gene and protein expression, differential gene expression, disease phenotype), and molecular data types (e.g. Gene Ontology Annotation, protein interactions), as well as links to many general (e.g. Entrez, PubMed) and pregnancy disease-specific (e.g. PTBgene, dbPTB) databases. By facilitating the synthesis of diverse functional and evolutionary data in pregnancy-associated tissues and phenotypes and enabling their quick, intuitive, accurate and customized meta-analysis, GEneSTATION provides a novel platform for comprehensive investigation of the function and evolution of mammalian pregnancy. PMID:26567549

  13. Biosynthesis of Akaeolide and Lorneic Acids and Annotation of Type I Polyketide Synthase Gene Clusters in the Genome of Streptomyces sp. NPS554

    PubMed Central

    Zhou, Tao; Komaki, Hisayuki; Ichikawa, Natsuko; Hosoyama, Akira; Sato, Seizo; Igarashi, Yasuhiro

    2015-01-01

    The incorporation pattern of biosynthetic precursors into two structurally unique polyketides, akaeolide and lorneic acid A, was elucidated by feeding experiments with 13C-labeled precursors. In addition, the draft genome sequence of the producer, Streptomyces sp. NPS554, was performed and the biosynthetic gene clusters for these polyketides were identified. The putative gene clusters contain all the polyketide synthase (PKS) domains necessary for assembly of the carbon skeletons. Combined with the 13C-labeling results, gene function prediction enabled us to propose biosynthetic pathways involving unusual carbon-carbon bond formation reactions. Genome analysis also indicated the presence of at least ten orphan type I PKS gene clusters that might be responsible for the production of new polyketides. PMID:25603349

  14. Functions of the gene products of Escherichia coli.

    PubMed Central

    Riley, M

    1993-01-01

    A list of currently identified gene products of Escherichia coli is given, together with a bibliography that provides pointers to the literature on each gene product. A scheme to categorize cellular functions is used to classify the gene products of E. coli so far identified. A count shows that the numbers of genes concerned with small-molecule metabolism are on the same order as the numbers concerned with macromolecule biosynthesis and degradation. One large category is the category of tRNAs and their synthetases. Another is the category of transport elements. The categories of cell structure and cellular processes other than metabolism are smaller. Other subjects discussed are the occurrence in the E. coli genome of redundant pairs and groups of genes of identical or closely similar function, as well as variation in the degree of density of genetic information in different parts of the genome. PMID:7508076

  15. A sentence sliding window approach to extract protein annotations from biomedical articles

    PubMed Central

    Krallinger, Martin; Padron, Maria; Valencia, Alfonso

    2005-01-01

    Background Within the emerging field of text mining and statistical natural language processing (NLP) applied to biomedical articles, a broad variety of techniques have been developed during the past years. Nevertheless, there is still a great ned of comparative assessment of the performance of the proposed methods and the development of common evaluation criteria. This issue was addressed by the Critical Assessment of Text Mining Methods in Molecular Biology (BioCreative) contest. The aim of this contest was to assess the performance of text mining systems applied to biomedical texts including tools which recognize named entities such as genes and proteins, and tools which automatically extract protein annotations. Results The "sentence sliding window" approach proposed here was found to efficiently extract text fragments from full text articles containing annotations on proteins, providing the highest number of correctly predicted annotations. Moreover, the number of correct extractions of individual entities (i.e. proteins and GO terms) involved in the relationships used for the annotations was significantly higher than the correct extractions of the complete annotations (protein-function relations). Conclusion We explored the use of averaging sentence sliding windows for information extraction, especially in a context where conventional training data is unavailable. The combination of our approach with more refined statistical estimators and machine learning techniques might be a way to improve annotation extraction for future biomedical text mining applications. PMID:15960831

  16. Ranking, selecting, and prioritising genes with desirability functions

    PubMed Central

    2015-01-01

    In functional genomics experiments, researchers often select genes to follow-up or validate from a long list of differentially expressed genes. Typically, sharp thresholds are used to bin genes into groups such as significant/non-significant or fold change above/below a cut-off value, and ad hoc criteria are also used such as favouring well-known genes. Binning, however, is inefficient and does not take the uncertainty of the measurements into account. Furthermore, p-values, fold-changes, and other outcomes are treated as equally important, and relevant genes may be overlooked with such an approach. Desirability functions are proposed as a way to integrate multiple selection criteria for ranking, selecting, and prioritising genes. These functions map any variable to a continuous 0–1 scale, where one is maximally desirable and zero is unacceptable. Multiple selection criteria are then combined to provide an overall desirability that is used to rank genes. In addition to p-values and fold-changes, further experimental results and information contained in databases can be easily included as criteria. The approach is demonstrated with a breast cancer microarray data set. The functions and an example data set can be found in the desiR package on CRAN (https://cran.r-project.org/web/packages/desiR/) and the development version is available on GitHub (https://github.com/stanlazic/desiR). PMID:26644980

  17. Annotation and retrieval in protein interaction databases

    NASA Astrophysics Data System (ADS)

    Cannataro, Mario; Hiram Guzzi, Pietro; Veltri, Pierangelo

    2014-06-01

    Biological databases have been developed with a special focus on the efficient retrieval of single records or the efficient computation of specialized bioinformatics algorithms against the overall database, such as in sequence alignment. The continuos production of biological knowledge spread on several biological databases and ontologies, such as Gene Ontology, and the availability of efficient techniques to handle such knowledge, such as annotation and semantic similarity measures, enable the development on novel bioinformatics applications that explicitly use and integrate such knowledge. After introducing the annotation process and the main semantic similarity measures, this paper shows how annotations and semantic similarity can be exploited to improve the extraction and analysis of biologically relevant data from protein interaction databases. As case studies, the paper presents two novel software tools, OntoPIN and CytoSeVis, both based on the use of Gene Ontology annotations, for the advanced querying of protein interaction databases and for the enhanced visualization of protein interaction networks.

  18. Using deep RNA sequencing for the structural annotation of the laccaria bicolor mycorrhizal transcriptome.

    SciTech Connect

    Larsen, P. E.; Trivedi, G.; Sreedasyam, A.; Lu, V.; Podila, G. K.; Collart, F. R.; Biosciences Division; Univ. of Alabama

    2010-07-06

    Accurate structural annotation is important for prediction of function and required for in vitro approaches to characterize or validate the gene expression products. Despite significant efforts in the field, determination of the gene structure from genomic data alone is a challenging and inaccurate process. The ease of acquisition of transcriptomic sequence provides a direct route to identify expressed sequences and determine the correct gene structure. We developed methods to utilize RNA-seq data to correct errors in the structural annotation and extend the boundaries of current gene models using assembly approaches. The methods were validated with a transcriptomic data set derived from the fungus Laccaria bicolor, which develops a mycorrhizal symbiotic association with the roots of many tree species. Our analysis focused on the subset of 1501 gene models that are differentially expressed in the free living vs. mycorrhizal transcriptome and are expected to be important elements related to carbon metabolism, membrane permeability and transport, and intracellular signaling. Of the set of 1501 gene models, 1439 (96%) successfully generated modified gene models in which all error flags were successfully resolved and the sequences aligned to the genomic sequence. The remaining 4% (62 gene models) either had deviations from transcriptomic data that could not be spanned or generated sequence that did not align to genomic sequence. The outcome of this process is a set of high confidence gene models that can be reliably used for experimental characterization of protein function. 69% of expressed mycorrhizal JGI 'best' gene models deviated from the transcript sequence derived by this method. The transcriptomic sequence enabled correction of a majority of the structural inconsistencies and resulted in a set of validated models for 96% of the mycorrhizal genes. The method described here can be applied to improve gene structural annotation in other species, provided that there is a sequenced genome and a set of gene models.

  19. Transcriptator: An Automated Computational Pipeline to Annotate Assembled Reads and Identify Non Coding RNA

    PubMed Central

    Zuccaro, Antonio; Guarracino, Mario Rosario

    2015-01-01

    RNA-seq is a new tool to measure RNA transcript counts, using high-throughput sequencing at an extraordinary accuracy. It provides quantitative means to explore the transcriptome of an organism of interest. However, interpreting this extremely large data into biological knowledge is a problem, and biologist-friendly tools are lacking. In our lab, we developed Transcriptator, a web application based on a computational Python pipeline with a user-friendly Java interface. This pipeline uses the web services available for BLAST (Basis Local Search Alignment Tool), QuickGO and DAVID (Database for Annotation, Visualization and Integrated Discovery) tools. It offers a report on statistical analysis of functional and Gene Ontology (GO) annotation’s enrichment. It helps users to identify enriched biological themes, particularly GO terms, pathways, domains, gene/proteins features and protein—protein interactions related informations. It clusters the transcripts based on functional annotations and generates a tabular report for functional and gene ontology annotations for each submitted transcript to the web server. The implementation of QuickGo web-services in our pipeline enable the users to carry out GO-Slim analysis, whereas the integration of PORTRAIT (Prediction of transcriptomic non coding RNA (ncRNA) by ab initio methods) helps to identify the non coding RNAs and their regulatory role in transcriptome. In summary, Transcriptator is a useful software for both NGS and array data. It helps the users to characterize the de-novo assembled reads, obtained from NGS experiments for non-referenced organisms, while it also performs the functional enrichment analysis of differentially expressed transcripts/genes for both RNA-seq and micro-array experiments. It generates easy to read tables and interactive charts for better understanding of the data. The pipeline is modular in nature, and provides an opportunity to add new plugins in the future. Web application is freely available at: http://www-labgtp.na.icar.cnr.it/Transcriptator PMID:26581084

  20. Cost benefit theory and optimal design of gene regulation functions

    NASA Astrophysics Data System (ADS)

    Kalisky, Tomer; Dekel, Erez; Alon, Uri

    2007-12-01

    Cells respond to the environment by regulating the expression of genes according to environmental signals. The relation between the input signal level and the expression of the gene is called the gene regulation function. It is of interest to understand the shape of a gene regulation function in terms of the environment in which it has evolved and the basic constraints of biological systems. Here we address this by presenting a cost-benefit theory for gene regulation functions that takes into account temporally varying inputs in the environment and stochastic noise in the biological components. We apply this theory to the well-studied lac operon of E. coli. The present theory explains the shape of this regulation function in terms of temporal variation of the input signals, and of minimizing the deleterious effect of cell-cell variability in regulatory protein levels. We also apply the theory to understand the evolutionary tradeoffs in setting the number of regulatory proteins and for selection of feed-forward loops in genetic circuits. The present cost-benefit theory can be used to understand the shape of other gene regulatory functions in terms of environment and noise constraints.

  1. Global Regulatory Functions of the Staphylococcus aureus Endoribonuclease III in Gene Expression

    PubMed Central

    Lioliou, Efthimia; Sharma, Cynthia M.; Caldelari, Isabelle; Helfer, Anne-Catherine; Fechter, Pierre; Vandenesch, François; Vogel, Jörg; Romby, Pascale

    2012-01-01

    RNA turnover plays an important role in both virulence and adaptation to stress in the Gram-positive human pathogen Staphylococcus aureus. However, the molecular players and mechanisms involved in these processes are poorly understood. Here, we explored the functions of S. aureus endoribonuclease III (RNase III), a member of the ubiquitous family of double-strand-specific endoribonucleases. To define genomic transcripts that are bound and processed by RNase III, we performed deep sequencing on cDNA libraries generated from RNAs that were co-immunoprecipitated with wild-type RNase III or two different cleavage-defective mutant variants in vivo. Several newly identified RNase III targets were validated by independent experimental methods. We identified various classes of structured RNAs as RNase III substrates and demonstrated that this enzyme is involved in the maturation of rRNAs and tRNAs, regulates the turnover of mRNAs and non-coding RNAs, and autoregulates its synthesis by cleaving within the coding region of its own mRNA. Moreover, we identified a positive effect of RNase III on protein synthesis based on novel mechanisms. RNase III–mediated cleavage in the 5? untranslated region (5?UTR) enhanced the stability and translation of cspA mRNA, which encodes the major cold-shock protein. Furthermore, RNase III cleaved overlapping 5?UTRs of divergently transcribed genes to generate leaderless mRNAs, which constitutes a novel way to co-regulate neighboring genes. In agreement with recent findings, low abundance antisense RNAs covering 44% of the annotated genes were captured by co-immunoprecipitation with RNase III mutant proteins. Thus, in addition to gene regulation, RNase III is associated with RNA quality control of pervasive transcription. Overall, this study illustrates the complexity of post-transcriptional regulation mediated by RNase III. PMID:22761586

  2. Gene3D: structural assignment for whole genes and genomes using the CATH domain structure database.

    PubMed

    Buchan, Daniel W A; Shepherd, Adrian J; Lee, David; Pearl, Frances M G; Rison, Stuart C G; Thornton, Janet M; Orengo, Christine A

    2002-03-01

    We present a novel web-based resource, Gene3D, of precalculated structural assignments to gene sequences and whole genomes. This resource assigns structural domains from the CATH database to whole genes and links these to their curated functional and structural annotations within the CATH domain structure database, the functional Dictionary of Homologous Superfamilies (DHS) and PDBsum. Currently Gene3D provides annotation for 36 complete genomes (two eukaryotes, six archaea, and 28 bacteria). On average, between 30% and 40% of the genes of a given genome can be structurally annotated. Matches to structural domains are found using the profile-based method (PSI-BLAST). and a novel protocol, DRange, is used to resolve conflicts in matches involving different homologous superfamilies. PMID:11875040

  3. The language of gene ontology: a Zipf’s law analysis

    PubMed Central

    2012-01-01

    Background Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf’s law, the exponent of which can provide useful information on the nature of the language being used. We have therefore explored the hypothesis that collections of GO annotations will show similar statistical behaviours to natural language. Results Annotations from the Gene Ontology Annotation project were found to follow Zipf’s law. Surprisingly, the measured power law exponents were consistently different between annotation captured using the three GO sub-ontologies in the corpora (function, process and component). On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation. Conclusions Techniques from computational linguistics can provide new insights into the annotation process. GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation. PMID:22676436

  4. Analysis and functional annotation of expressed sequence tags from in vitro cell lines of elasmobranchs: spiny dogfish shark (Squalus acanthias) and little skate (Leucoraja erinacea)

    PubMed Central

    Parton, Angela; Bayne, Christopher J.; Barnes, David W.

    2010-01-01

    Elasmobranchs are the most commonly used experimental models among the jawed, cartilaginous fish (Chondrichthyes). Previously we developed cell lines from embryos of two elasmobranchs, Squalus acanthias the spiny dogfish shark (SAE line), and Leucoraja erinacea the little skate (LEE-1 line). From these lines cDNA libraries were derived and expressed sequence tags (ESTs) generated. From the SAE cell line 4303 unique transcripts were identified, with 1848 of these representing unknown sequences (showing no BLASTX identification). From the LEE-1 cell line, 3660 unique transcripts were identified, and unknown, unique sequences totaled 1333. Gene Ontology (GO) annotation showed that GO assignments for the two cell lines were in general similar. These results suggest that the procedures used to derive the cell lines led to isolation of cell types of the same general embryonic origin from both species. The LEE-1 transcripts included GO categories “envelope” and “oxidoreductase activity” but the SAE transcripts did not. GO analysis of SAE transcripts identified the category “anatomical structure formation” that was not present in LEE-1 cells. Increased organelle compartments may exist within LEE-1 cells compared to SAE cells, and the higher oxidoreductase activity in LEE-1 cells may indicate a role for these cells in responses associated with innate immunity or in steroidogenesis. These EST libraries from elasmobranch cell lines provide information for assembly of genomic sequences and are useful in revealing gene diversity, new genes and molecular markers, as well as in providing means for elucidation of full-length cDNAs and probes for gene array analyses. This is the first study of this type with members of the Chondrichthyes. PMID:20471924

  5. Sucrose metabolism gene families and their biological functions

    PubMed Central

    Jiang, Shu-Ye; Chi, Yun-Hua; Wang, Ji-Zhou; Zhou, Jun-Xia; Cheng, Yan-Song; Zhang, Bao-Lan; Ma, Ali; Vanitha, Jeevanandam; Ramachandran, Srinivasan

    2015-01-01

    Sucrose, as the main product of photosynthesis, plays crucial roles in plant development. Although studies on general metabolism pathway were well documented, less information is available on the genome-wide identification of these genes, their expansion and evolutionary history as well as their biological functions. We focused on four sucrose metabolism related gene families including sucrose synthase, sucrose phosphate synthase, sucrose phosphate phosphatase and UDP-glucose pyrophosphorylase. These gene families exhibited different expansion and evolutionary history as their host genomes experienced differentiated rates of the whole genome duplication, tandem and segmental duplication, or mobile element mediated gene gain and loss. They were evolutionarily conserved under purifying selection among species and expression divergence played important roles for gene survival after expansion. However, we have detected recent positive selection during intra-species divergence. Overexpression of 15 sorghum genes in Arabidopsis revealed their roles in biomass accumulation, flowering time control, seed germination and response to high salinity and sugar stresses. Our studies uncovered the molecular mechanisms of gene expansion and evolution and also provided new insight into the role of positive selection in intra-species divergence. Overexpression data revealed novel biological functions of these genes in flowering time control and seed germination under normal and stress conditions. PMID:26616172

  6. Sucrose metabolism gene families and their biological functions.

    PubMed

    Jiang, Shu-Ye; Chi, Yun-Hua; Wang, Ji-Zhou; Zhou, Jun-Xia; Cheng, Yan-Song; Zhang, Bao-Lan; Ma, Ali; Vanitha, Jeevanandam; Ramachandran, Srinivasan

    2015-01-01

    Sucrose, as the main product of photosynthesis, plays crucial roles in plant development. Although studies on general metabolism pathway were well documented, less information is available on the genome-wide identification of these genes, their expansion and evolutionary history as well as their biological functions. We focused on four sucrose metabolism related gene families including sucrose synthase, sucrose phosphate synthase, sucrose phosphate phosphatase and UDP-glucose pyrophosphorylase. These gene families exhibited different expansion and evolutionary history as their host genomes experienced differentiated rates of the whole genome duplication, tandem and segmental duplication, or mobile element mediated gene gain and loss. They were evolutionarily conserved under purifying selection among species and expression divergence played important roles for gene survival after expansion. However, we have detected recent positive selection during intra-species divergence. Overexpression of 15 sorghum genes in Arabidopsis revealed their roles in biomass accumulation, flowering time control, seed germination and response to high salinity and sugar stresses. Our studies uncovered the molecular mechanisms of gene expansion and evolution and also provided new insight into the role of positive selection in intra-species divergence. Overexpression data revealed novel biological functions of these genes in flowering time control and seed germination under normal and stress conditions. PMID:26616172

  7. Transferred interbacterial antagonism genes augment eukaryotic innate immune function.

    PubMed

    Chou, Seemay; Daugherty, Matthew D; Peterson, S Brook; Biboy, Jacob; Yang, Youyun; Jutras, Brandon L; Fritz-Laylin, Lillian K; Ferrin, Michael A; Harding, Brittany N; Jacobs-Wagner, Christine; Yang, X Frank; Vollmer, Waldemar; Malik, Harmit S; Mougous, Joseph D

    2015-02-01

    Horizontal gene transfer allows organisms to rapidly acquire adaptive traits. Although documented instances of horizontal gene transfer from bacteria to eukaryotes remain rare, bacteria represent a rich source of new functions potentially available for co-option. One benefit that genes of bacterial origin could provide to eukaryotes is the capacity to produce antibacterials, which have evolved in prokaryotes as the result of eons of interbacterial competition. The type VI secretion amidase effector (Tae) proteins are potent bacteriocidal enzymes that degrade the cell wall when delivered into competing bacterial cells by the type VI secretion system. Here we show that tae genes have been transferred to eukaryotes on at least six occasions, and that the resulting domesticated amidase effector (dae) genes have been preserved for hundreds of millions of years through purifying selection. We show that the dae genes acquired eukaryotic secretion signals, are expressed within recipient organisms, and encode active antibacterial toxins that possess substrate specificity matching extant Tae proteins of the same lineage. Finally, we show that a dae gene in the deer tick Ixodes scapularis limits proliferation of Borrelia burgdorferi, the aetiologic agent of Lyme disease. Our work demonstrates that a family of horizontally acquired toxins honed to mediate interbacterial antagonism confers previously undescribed antibacterial capacity to eukaryotes. We speculate that the selective pressure imposed by competition between bacteria has produced a reservoir of genes encoding diverse antimicrobial functions that are tailored for co-option by eukaryotic innate immune systems. PMID:25470067

  8. Annotation of the modular polyketide synthase and nonribosomal peptide synthetase gene clusters in the genome of Streptomyces tsukubaensis NRRL18488.

    PubMed

    Blazic, Marko; Starcevic, Antonio; Lisfi, Mohamed; Baranasic, Damir; Goranovic, Dusan; Fujs, Stefan; Kuscer, Enej; Kosec, Gregor; Petkovic, Hrvoje; Cullum, John; Hranueli, Daslav; Zucko, Jurica

    2012-12-01

    The high G+C content and large genome size make the sequencing and assembly of Streptomyces genomes more difficult than for other bacteria. Many pharmaceutically important natural products are synthesized by modular polyketide synthases (PKSs) and nonribosomal peptide synthetases (NRPSs). The analysis of such gene clusters is difficult if the genome sequence is not of the highest quality, because clusters can be distributed over several contigs, and sequencing errors can introduce apparent frameshifts into the large PKS and NRPS proteins. An additional problem is that the modular nature of the clusters results in the presence of imperfect repeats, which may cause assembly errors. The genome sequence of Streptomyces tsukubaensis NRRL18488 was scanned for potential PKS and NRPS modular clusters. A phylogenetic approach was used to identify multiple contigs belonging to the same cluster. Four PKS clusters and six NRPS clusters were identified. Contigs containing cluster sequences were analyzed in detail by using the ClustScan program, which suggested the order and orientation of the contigs. The sequencing of the appropriate PCR products confirmed the ordering and allowed the correction of apparent frameshifts resulting from sequencing errors. The product chemistry of such correctly assembled clusters could also be predicted. The analysis of one PKS cluster showed that it should produce a bafilomycin-like compound, and reverse transcription (RT)-PCR was used to show that the cluster was transcribed. PMID:22983969

  9. Annotation of the Modular Polyketide Synthase and Nonribosomal Peptide Synthetase Gene Clusters in the Genome of Streptomyces tsukubaensis NRRL18488

    PubMed Central

    Blaži?, Marko; Starcevic, Antonio; Lisfi, Mohamed; Baranasic, Damir; Goranovi?, Dušan; Fujs, Štefan; Kuš?er, Enej; Kosec, Gregor; Petkovi?, Hrvoje; Cullum, John; Hranueli, Daslav

    2012-01-01

    The high G+C content and large genome size make the sequencing and assembly of Streptomyces genomes more difficult than for other bacteria. Many pharmaceutically important natural products are synthesized by modular polyketide synthases (PKSs) and nonribosomal peptide synthetases (NRPSs). The analysis of such gene clusters is difficult if the genome sequence is not of the highest quality, because clusters can be distributed over several contigs, and sequencing errors can introduce apparent frameshifts into the large PKS and NRPS proteins. An additional problem is that the modular nature of the clusters results in the presence of imperfect repeats, which may cause assembly errors. The genome sequence of Streptomyces tsukubaensis NRRL18488 was scanned for potential PKS and NRPS modular clusters. A phylogenetic approach was used to identify multiple contigs belonging to the same cluster. Four PKS clusters and six NRPS clusters were identified. Contigs containing cluster sequences were analyzed in detail by using the ClustScan program, which suggested the order and orientation of the contigs. The sequencing of the appropriate PCR products confirmed the ordering and allowed the correction of apparent frameshifts resulting from sequencing errors. The product chemistry of such correctly assembled clusters could also be predicted. The analysis of one PKS cluster showed that it should produce a bafilomycin-like compound, and reverse transcription (RT)-PCR was used to show that the cluster was transcribed. PMID:22983969

  10. Video annotation tools 

    E-print Network

    Chaudhary, Ahmed

    2008-10-10

    general purpose annotation toolkits that can be used to create domain specific applications. A video annotation toolkit along with toolkits for searching, retrieving, analyzing and presenting videos can help achieve the broader goal of creating integrated...

  11. Functional gene diversity of oolitic sands from Great Bahama Bank.

    PubMed

    Diaz, M R; Van Norstrand, J D; Eberli, G P; Piggot, A M; Zhou, J; Klaus, J S

    2014-05-01

    Despite the importance of oolitic depositional systems as indicators of climate and reservoirs of inorganic C, little is known about the microbial functional diversity, structure, composition, and potential metabolic processes leading to precipitation of carbonates. To fill this gap, we assess the metabolic gene carriage and extracellular polymeric substance (EPS) development in microbial communities associated with oolitic carbonate sediments from the Bahamas Archipelago. Oolitic sediments ranging from high-energy 'active' to lower energy 'non-active' and 'microbially stabilized' environments were examined as they represent contrasting depositional settings, mostly influenced by tidal flows and wave-generated currents. Functional gene analysis, which employed a microarray-based gene technology, detected a total of 12,432 of 95,847 distinct gene probes, including a large number of metabolic processes previously linked to mineral precipitation. Among these, gene-encoding enzymes for denitrification, sulfate reduction, ammonification, and oxygenic/anoxygenic photosynthesis were abundant. In addition, a broad diversity of genes was related to organic carbon degradation, and N2 fixation implying these communities has metabolic plasticity that enables survival under oligotrophic conditions. Differences in functional genes were detected among the environments, with higher diversity associated with non-active and microbially stabilized environments in comparison with the active environment. EPS showed a gradient increase from active to microbially stabilized communities, and when combined with functional gene analysis, which revealed genes encoding EPS-degrading enzymes (chitinases, glucoamylase, amylases), supports a putative role of EPS-mediated microbial calcium carbonate precipitation. We propose that carbonate precipitation in marine oolitic biofilms is spatially and temporally controlled by a complex consortium of microbes with diverse physiologies, including photosynthesizers, heterotrophs, denitrifiers, sulfate reducers, and ammonifiers. PMID:24612324

  12. Assembly, Annotation, and Integration of UNIGENE Clusters into the Human Genome Draft

    PubMed Central

    Zhuo, Degen; Zhao, Wei D.; Wright, Fred A.; Yang, Hee-Yung; Wang, Jian-Ping; Sears, Russell; Baer, Troy; Kwon, Do-Hun; Gordon, David; Gibbs, Solomon; Dai, Dean; Yang, Qing; Spitzner, Joe; Krahe, Ralf; Stredney, Don; Stutz, Al; Yuan, Bo

    2001-01-01

    The recent release of the first draft of the human genome provides an unprecedented opportunity to integrate human genes and their functions in a complete positional context. However, at least three significant technical hurdles remain: first, to assemble a complete and nonredundant human transcript index; second, to accurately place the individual transcript indices on the human genome; and third, to functionally annotate all human genes. Here, we report the extension of the UNIGENE database through the assembly of its sequence clusters into nonredundant sequence contigs. Each resulting consensus was aligned to the human genome draft. A unique location for each transcript within the human genome was determined by the integration of the restriction fingerprint, assembled genomic contig, and radiation hybrid (RH) maps. A total of 59,500 UNIGENE clusters were mapped on the basis of at least three independent criteria as compared with the 30,000 human genes/ESTs currently mapped in Genemap'99. Finally, the extension of the human transcript consensus in this study enabled a greater number of putative functional assignments than the 11,000 annotated entries in UNIGENE. This study reports a draft physical map with annotations for a majority of the human transcripts, called the Human Index of Nonredundant Transcripts (HINT). Such information can be immediately applied to the discovery of new genes and the identification of candidate genes for positional cloning. PMID:11337484

  13. XomAnnotate: Analysis of Heterogeneous and Complex Exome- A Step towards Translational Medicine

    PubMed Central

    Talukder, Asoke K.; Ravishankar, Shashidhar; Sasmal, Krittika; Gandham, Santhosh; Prabhukumar, Jyothsna; Achutharao, Prahalad H.; Barh, Debmalya; Blasi, Francesco

    2015-01-01

    In translational cancer medicine, implicated pathways and the relevant master genes are of focus. Exome's specificity, processing-time, and cost advantage makes it a compelling tool for this purpose. However, analysis of exome lacks reliable combinatory analysis tools and techniques. In this paper we present XomAnnotate – a meta- and functional-analysis software for exome. We compared UnifiedGenotyper, Freebayes, Delly, and Lumpy algorithms that were designed for whole-genome and combined their strengths in XomAnnotate for exome data through meta-analysis to identify comprehensive mutation profile (SNPs/SNVs, short inserts/deletes, and SVs) of patients. The mutation profile is annotated followed by functional analysis through pathway enrichment and network analysis to identify most critical genes and pathways implicated in the disease genesis. The efficacy of the software is verified through MDS and clustering and tested with available 11 familial non-BRCA1/BRCA2 breast cancer exome data. The results showed that the most significantly affected pathways across all samples are cell communication and antigen processing and presentation. ESCO1, HYAL1, RAF1 and PRKCA emerged as the key genes. Network analysis further showed the purine and propanotate metabolism pathways along with RAF1 and PRKCA genes to be master regulators in these patients. Therefore, XomAnnotate is able to use exome data to identify entire mutation landscape, pathways, and the master genes accurately with wide concordance from earlier microarray and whole-genome studies -- making it a suitable biomedical software for using exome in next-generation translational medicine. Availability http://www.iomics.in/research/XomAnnotate PMID:25905921

  14. EST Express: PHP/MySQL based automated annotation of ESTs from expression libraries

    PubMed Central

    Smith, Robin P; Buchser, William J; Lemmon, Marcus B; Pardinas, Jose R; Bixby, John L; Lemmon, Vance P

    2008-01-01

    Background Several biological techniques result in the acquisition of functional sets of cDNAs that must be sequenced and analyzed. The emergence of redundant databases such as UniGene and centralized annotation engines such as Entrez Gene has allowed the development of software that can analyze a great number of sequences in a matter of seconds. Results We have developed "EST Express", a suite of analytical tools that identify and annotate ESTs originating from specific mRNA populations. The software consists of a user-friendly GUI powered by PHP and MySQL that allows for online collaboration between researchers and continuity with UniGene, Entrez Gene and RefSeq. Two key features of the software include a novel, simplified Entrez Gene parser and tools to manage cDNA library sequencing projects. We have tested the software on a large data set (2,016 samples) produced by subtractive hybridization. Conclusion EST Express is an open-source, cross-platform web server application that imports sequences from cDNA libraries, such as those generated through subtractive hybridization or yeast two-hybrid screens. It then provides several layers of annotation based on Entrez Gene and RefSeq to allow the user to highlight useful genes and manage cDNA library projects. PMID:18402700

  15. Gene Function Prediction from Functional Association Networks Using Kernel Partial Least Squares Regression

    PubMed Central

    Lehtinen, Sonja; Lees, Jon; Bähler, Jürg; Shawe-Taylor, John; Orengo, Christine

    2015-01-01

    With the growing availability of large-scale biological datasets, automated methods of extracting functionally meaningful information from this data are becoming increasingly important. Data relating to functional association between genes or proteins, such as co-expression or functional association, is often represented in terms of gene or protein networks. Several methods of predicting gene function from these networks have been proposed. However, evaluating the relative performance of these algorithms may not be trivial: concerns have been raised over biases in different benchmarking methods and datasets, particularly relating to non-independence of functional association data and test data. In this paper we propose a new network-based gene function prediction algorithm using a commute-time kernel and partial least squares regression (Compass). We compare Compass to GeneMANIA, a leading network-based prediction algorithm, using a number of different benchmarks, and find that Compass outperforms GeneMANIA on these benchmarks. We also explicitly explore problems associated with the non-independence of functional association data and test data. We find that a benchmark based on the Gene Ontology database, which, directly or indirectly, incorporates information from other databases, may considerably overestimate the performance of algorithms exploiting functional association data for prediction. PMID:26288239

  16. Correction of the Caulobacter crescentus NA1000 Genome Annotation

    PubMed Central

    Ely, Bert; Scott, LaTia Etheredge

    2014-01-01

    Bacterial genome annotations are accumulating rapidly in the GenBank database and the use of automated annotation technologies to create these annotations has become the norm. However, these automated methods commonly result in a small, but significant percentage of genome annotation errors. To improve accuracy and reliability, we analyzed the Caulobacter crescentus NA1000 genome utilizing computer programs Artemis and MICheck to manually examine the third codon position GC content, alignment to a third codon position GC frame plot peak, and matches in the GenBank database. We identified 11 new genes, modified the start site of 113 genes, and changed the reading frame of 38 genes that had been incorrectly annotated. Furthermore, our manual method of identifying protein-coding genes allowed us to remove 112 non-coding regions that had been designated as coding regions. The improved NA1000 genome annotation resulted in a reduction in the use of rare codons since noncoding regions with atypical codon usage were removed from the annotation and 49 new coding regions were added to the annotation. Thus, a more accurate codon usage table was generated as well. These results demonstrate that a comparison of the location of peaks third codon position GC content to the location of protein coding regions could be used to verify the annotation of any genome that has a GC content that is greater than 60%. PMID:24621776

  17. Functional genes of non-model arthropods

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Technology and bioinformatics facilitate studies of non-model organisms, including pest insects and insect biological control agents. A cDNA library prepared from a laboratory-reared colony of Lygus lineolaris male nymphs identified sequences that appeared to have known functions or close homologues...

  18. The systematic functional characterisation of Xq28 genes prioritises candidate disease genes

    PubMed Central

    Kolb-Kokocinski, Anja; Mehrle, Alexander; Bechtel, Stephanie; Simpson, Jeremy C; Kioschis, Petra; Wiemann, Stefan; Wellenreuther, Ruth; Poustka, Annemarie

    2006-01-01

    Background Well known for its gene density and the large number of mapped diseases, the human sub-chromosomal region Xq28 has long been a focus of genome research. Over 40 of approximately 300 X-linked diseases map to this region, and systematic mapping, transcript identification, and mutation analysis has led to the identification of causative genes for 26 of these diseases, leaving another 17 diseases mapped to Xq28, where the causative gene is still unknown. To expedite disease gene identification, we have initiated the functional characterisation of all known Xq28 genes. Results By using a systematic approach, we describe the Xq28 genes by RNA in situ hybridisation and Northern blotting of the mouse orthologs, as well as subcellular localisation and data mining of the human genes. We have developed a relational web-accessible database with comprehensive query options integrating all experimental data. Using this database, we matched gene expression patterns with affected tissues for 16 of the 17 remaining Xq28 linked diseases, where the causative gene is unknown. Conclusion By using this systematic approach, we have prioritised genes in linkage regions of Xq28-mapped diseases to an amenable number for mutational screens. Our database can be queried by any researcher performing highly specified searches including diseases not listed in OMIM or diseases that might be linked to Xq28 in the future. PMID:16503986

  19. Norrie disease gene: Characterization of deletions and possible function

    SciTech Connect

    Chen, Z.Y.; Battinelli, E.M.; Hendriks, R.W.; Craig, I.W.; Powell, J.F.; Middleton-Price, H.; Sims, K.B.; Breakefield, X.O.

    1993-05-01

    Positional cloning experiments have resulted recently in the isolation of a candidate gene for Norrie disease (pseudoglioma; NDP), a severe X-linked neuro-developmental disorder. Here the authors report the isolation and analysis of human genomic DNA clones encompassing the NDP gene. The gene spans 28 kb and consists of 3 exons, the first of which is entirely contained within the 5{prime} untranslated region. Detailed analysis of genomic deletions in Norrie patients shows that they are heterogeneous, both in size and in position. By PCR analysis, they found that expression of the NDP gene was not confined to the eye or to the brain. An extensive DNA and protein sequence comparison between the human NDP gene and related genes from the database revealed homology with cysteine-rich protein-binding domains of immediate--early genes implicated in the regulation of cell proliferation. They propose that NDP is a molecule related in function to these genes and may be involved in a pathway that regulates neural cell differentiation and proliferation. 19 refs., 2 figs.

  20. Loss of gene function and evolution of human phenotypes.

    PubMed

    Oh, Hye Ji; Choi, Dongjin; Goh, Chul Jun; Hahn, Yoonsoo

    2015-07-01

    Humans have acquired many distinct evolutionary traits after the human-chimpanzee divergence. These phenotypes have resulted from genetic changes that occurred in the human genome and were retained by natural selection. Comparative primate genome analyses reveal that loss-of-function mutations are common in the human genome. Some of these gene inactivation events were revealed to be associated with the emergence of advantageous phenotypes and were therefore positively selected and fixed in modern humans (the "less-ismore" hypothesis). Representative cases of human gene inactivation and their functional implications are presented in this review. Functional studies of additional inactive genes will provide insight into the molecular mechanisms underlying acquisition of various human-specific traits. PMID:25887751

  1. Loss of gene function and evolution of human phenotypes

    PubMed Central

    Oh, Hye Ji; Choi, Dongjin; Goh, Chul Jun; Hahn, Yoonsoo

    2015-01-01

    Humans have acquired many distinct evolutionary traits after the human-chimpanzee divergence. These phenotypes have resulted from genetic changes that occurred in the human genome and were retained by natural selection. Comparative primate genome analyses reveal that loss-of-function mutations are common in the human genome. Some of these gene inactivation events were revealed to be associated with the emergence of advantageous phenotypes and were therefore positively selected and fixed in modern humans (the “less-ismore” hypothesis). Representative cases of human gene inactivation and their functional implications are presented in this review. Functional studies of additional inactive genes will provide insight into the molecular mechanisms underlying acquisition of various human-specific traits. [BMB Reports 2015; 48(7): 373-379] PMID:25887751

  2. Core Promoter Functions in the Regulation of Gene Expression of Drosophila Dorsal Target Genes*

    PubMed Central

    Zehavi, Yonathan; Kuznetsov, Olga; Ovadia-Shochat, Avital; Juven-Gershon, Tamar

    2014-01-01

    Developmental processes are highly dependent on transcriptional regulation by RNA polymerase II. The RNA polymerase II core promoter is the ultimate target of a multitude of transcription factors that control transcription initiation. Core promoters consist of core promoter motifs, e.g. the initiator, TATA box, and the downstream core promoter element (DPE), which confer specific properties to the core promoter. Here, we explored the importance of core promoter functions in the dorsal-ventral developmental gene regulatory network. This network includes multiple genes that are activated by different nuclear concentrations of Dorsal, an NF?B homolog transcription factor, along the dorsal-ventral axis. We show that over two-thirds of Dorsal target genes contain DPE sequence motifs, which is significantly higher than the proportion of DPE-containing promoters in Drosophila genes. We demonstrate that multiple Dorsal target genes are evolutionarily conserved and functionally dependent on the DPE. Furthermore, we have analyzed the activation of key Dorsal target genes by Dorsal, as well as by another Rel family transcription factor, Relish, and the dependence of their activation on the DPE motif. Using hybrid enhancer-promoter constructs in Drosophila cells and embryo extracts, we have demonstrated that the core promoter composition is an important determinant of transcriptional activity of Dorsal target genes. Taken together, our results provide evidence for the importance of core promoter composition in the regulation of Dorsal target genes. PMID:24634215

  3. Screening of Target Genes and Regulatory Function of miRNAs as Prognostic Indicators for Prostate Cancer

    PubMed Central

    Xiaoli, Zhang; Yawei, Wei; Lianna, Liu; Haifeng, Li; Hui, Zhang

    2015-01-01

    Background MicroRNAs expression profiling of prostate cancer is becoming increasingly used due to its usefulness in diagnosis, staging, prognosis, and response to treatment. The aim of this study was to screen differentially expressed miRNAs in prostate cancer and analyze the functions and signal pathways of their target genes. Material/Methods High-throughput data of miRNAs were downloaded from The Cancer Genome Atlas (TCGA) database. A total of 551 samples (52 normal and 499 prostate cancer cases) and 1046 miRNAs expression values were selected for further analysis. Differentially expressed miRNAs between normal and prostate cancer tissues were identified using SAMR. StarBase and TargetScan software were used to predict the miRNAs’ target group and target genes, respectively. GO functional and KEGG pathway analysis was conducted on up/down-regulated expressed miRNA with DAVID. Finally, survival analysis was performed to evaluate the association of differently expressed miRNAs signature and overall survival of prostate cancer patients. Results A total of 162 miRNAs were differentially expressed between normal and prostate cancer samples, including 128 up-regulated and 38 down-regulated ones; hsa-mir-153-2, hsa-mir-92a-1, and hsa-mir-182 (up-regulated); and hsa-mir-29a, hsa-mir-10a, and hsa-mir-221 (down-regulated) were identified as good biomarkers. In GO and KEGG analysis, target genes of down-regulated miRNAs were significantly enriched in positive ion combination and JAK-STAT pathway annotation, respectively; the ones with up-regulated miRNAs were significantly enriched in the function of plasma membrane and MARK signaling pathway annotation, respectively. Patients were categorized into low- or high-score groups according to their risk scores from each miRNA. The patients in the low-score group had better overall survival compared with those in high-score group. Conclusions The 6 differentially expressed miRNAs and their target genes were used to define important molecular targets that could serve as prognostic and predictive markers in the treatment of prostate cancer. Further research on the function of the target genes in the MAPK signal pathway could provide references for treatment of prostate cancer. PMID:26628405

  4. Screening of Target Genes and Regulatory Function of miRNAs as Prognostic Indicators for Prostate Cancer.

    PubMed

    Xiaoli, Zhang; Yawei, Wei; Lianna, Liu; Haifeng, Li; Hui, Zhang

    2015-01-01

    BACKGROUND MicroRNAs expression profiling of prostate cancer is becoming increasingly used due to its usefulness in diagnosis, staging, prognosis, and response to treatment. The aim of this study was to screen differentially expressed miRNAs in prostate cancer and analyze the functions and signal pathways of their target genes. MATERIAL AND METHODS High-throughput data of miRNAs were downloaded from The Cancer Genome Atlas (TCGA) database. A total of 551 samples (52 normal and 499 prostate cancer cases) and 1046 miRNAs expression values were selected for further analysis. Differentially expressed miRNAs between normal and prostate cancer tissues were identified using SAMR. StarBase and TargetSc