Sample records for functional gene annotation

  1. A Resource of Quantitative Functional Annotation for Homo sapiens Genes

    PubMed Central

    Ta?an, Murat; Drabkin, Harold J.; Beaver, John E.; Chua, Hon Nian; Dunham, Julie; Tian, Weidong; Blake, Judith A.; Roth, Frederick P.

    2012-01-01

    The body of human genomic and proteomic evidence continues to grow at ever-increasing rates, while annotation efforts struggle to keep pace. A surprisingly small fraction of human genes have clear, documented associations with specific functions, and new functions continue to be found for characterized genes. Here we assembled an integrated collection of diverse genomic and proteomic data for 21,341 human genes and make quantitative associations of each to 4333 Gene Ontology terms. We combined guilt-by-profiling and guilt-by-association approaches to exploit features unique to the data types. Performance was evaluated by cross-validation, prospective validation, and by manual evaluation with the biological literature. Functional-linkage networks were also constructed, and their utility was demonstrated by identifying candidate genes related to a glioma FLN using a seed network from genome-wide association studies. Our annotations are presented—alongside existing validated annotations—in a publicly accessible and searchable web interface. PMID:22384401

  2. Measuring semantic similarities by combining gene ontology annotations and gene co-function networks

    DOE PAGESBeta

    Peng, Jiajie; Uygun, Sahra; Kim, Taehyong; Wang, Yadong; Rhee, Seung Y; Chen, Jin

    2015-12-01

    Background: Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms. Results: We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstratemore »that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families. Conclusions: Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited.« less

  3. Functional annotation of human cytomegalovirus gene products: an update

    PubMed Central

    Van Damme, Ellen; Van Loock, Marnix

    2014-01-01

    Human cytomegalovirus is an opportunistic double-stranded DNA virus with one of the largest viral genomes known. The 235 kB genome is divided in a unique long (UL) and a unique short (US) region which are flanked by terminal and internal repeats. The expression of HCMV genes is highly complex and involves the production of protein coding transcripts, polyadenylated long non-coding RNAs, polyadenylated anti-sense transcripts and a variety of non-polyadenylated RNAs such as microRNAs. Although the function of many of these transcripts is unknown, they are suggested to play a direct or regulatory role in the delicately orchestrated processes that ensure HCMV replication and life-long persistence. This review focuses on annotating the complete viral genome based on three sources of information. First, previous reviews were used as a template for the functional keywords to ensure continuity; second, the Uniprot database was used to further enrich the functional database; and finally, the literature was manually curated for novel functions of HCMV gene products. Novel discoveries were discussed in light of the viral life cycle. This functional annotation highlights still poorly understood regions of the genome but more importantly it can give insight in functional clusters and/or may be helpful in the analysis of future transcriptomics and proteomics studies. PMID:24904534

  4. Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets

    Microsoft Academic Search

    Marc Aubry; Annabelle Monnier; Celine Chicault; Marie De Tayrac; Marie-dominique Galibert; Anita Burgun; Jean Mosser

    2006-01-01

    BACKGROUND: Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from

  5. Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets

    PubMed Central

    Aubry, Marc; Monnier, Annabelle; Chicault, Celine; de Tayrac, Marie; Galibert, Marie-Dominique; Burgun, Anita; Mosser, Jean

    2006-01-01

    Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence) or text-mining of the published scientific literature (literature profiling). Results We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human) and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2) and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells). Conclusion The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions. PMID:16674810

  6. Combining genetic diversity, informatics and metabolomics to facilitate annotation of plant gene function

    Microsoft Academic Search

    Takayuki Tohge; Alisdair R Fernie

    2010-01-01

    Given the ever-increasing number of species for which full-genome sequencing has been realized, there is a rising burden for gene functional annotation. In this study, we provide a detailed protocol that combines co-response gene analysis (using target genes of known function to allow the identification of nonannotated genes likely to be involved in a certain metabolic process) with the identification

  7. Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga

    E-print Network

    Yandell, Mark

    Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga University, East Lansing, Michigan, United States of America Abstract Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred

  8. Gene Expression and Functional Annotation of the Human and Mouse Choroid Plexus Epithelium

    PubMed Central

    Janssen, Sarah F.; van der Spek, Sophie J. F.; ten Brink, Jacoline B.; Essing, Anke H. W.; Gorgels, Theo G. M. F.; van der Spek, Peter J.; Jansonius, Nomdo M.; Bergen, Arthur A. B.

    2013-01-01

    Background The choroid plexus epithelium (CPE) is a lobed neuro-epithelial structure that forms the outer blood-brain barrier. The CPE protrudes into the brain ventricles and produces the cerebrospinal fluid (CSF), which is crucial for brain homeostasis. Malfunction of the CPE is possibly implicated in disorders like Alzheimer disease, hydrocephalus or glaucoma. To study human genetic diseases and potential new therapies, mouse models are widely used. This requires a detailed knowledge of similarities and differences in gene expression and functional annotation between the species. The aim of this study is to analyze and compare gene expression and functional annotation of healthy human and mouse CPE. Methods We performed 44k Agilent microarray hybridizations with RNA derived from laser dissected healthy human and mouse CPE cells. We functionally annotated and compared the gene expression data of human and mouse CPE using the knowledge database Ingenuity. We searched for common and species specific gene expression patterns and function between human and mouse CPE. We also made a comparison with previously published CPE human and mouse gene expression data. Results Overall, the human and mouse CPE transcriptomes are very similar. Their major functionalities included epithelial junctions, transport, energy production, neuro-endocrine signaling, as well as immunological, neurological and hematological functions and disorders. The mouse CPE presented two additional functions not found in the human CPE: carbohydrate metabolism and a more extensive list of (neural) developmental functions. We found three genes specifically expressed in the mouse CPE compared to human CPE, being ACE, PON1 and TRIM3 and no human specifically expressed CPE genes compared to mouse CPE. Conclusion Human and mouse CPE transcriptomes are very similar, and display many common functionalities. Nonetheless, we also identified a few genes and pathways which suggest that the CPE between mouse and man differ with respect to transport and metabolic functions. PMID:24391755

  9. Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga Nannochloropsis oceanica CCMP1779

    PubMed Central

    Tsai, Chia-Hong; Bullard, Blair; Cornish, Adam J.; Harvey, Christopher; Reca, Ida-Barbara; Thornburg, Chelsea; Achawanantakun, Rujira; Buehl, Christopher J.; Campbell, Michael S.; Cavalier, David; Childs, Kevin L.; Clark, Teresa J.; Deshpande, Rahul; Erickson, Erika; Armenia Ferguson, Ann; Handee, Witawas; Kong, Que; Li, Xiaobo; Liu, Bensheng; Lundback, Steven; Peng, Cheng; Roston, Rebecca L.; Sanjaya; Simpson, Jeffrey P.; TerBush, Allan; Warakanont, Jaruswan; Zäuner, Simone; Farre, Eva M.; Hegg, Eric L.; Jiang, Ning; Kuo, Min-Hao; Lu, Yan; Niyogi, Krishna K.; Ohlrogge, John; Osteryoung, Katherine W.; Shachar-Hill, Yair; Sears, Barbara B.; Sun, Yanni; Takahashi, Hideki; Yandell, Mark; Shiu, Shin-Han; Benning, Christoph

    2012-01-01

    Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred organism. Moreover, adequate molecular and genetic resources prerequisite for the rational engineering of marine algal feedstocks are lacking for most candidate species. Heterokonts of the genus Nannochloropsis naturally have high cellular oil content and are already in use for industrial production of high-value lipid products. First success in applying reverse genetics by targeted gene replacement makes Nannochloropsis oceanica an attractive model to investigate the cell and molecular biology and biochemistry of this fascinating organism group. Here we present the assembly of the 28.7 Mb genome of N. oceanica CCMP1779. RNA sequencing data from nitrogen-replete and nitrogen-depleted growth conditions support a total of 11,973 genes, of which in addition to automatic annotation some were manually inspected to predict the biochemical repertoire for this organism. Among others, more than 100 genes putatively related to lipid metabolism, 114 predicted transcription factors, and 109 transcriptional regulators were annotated. Comparison of the N. oceanica CCMP1779 gene repertoire with the recently published N. gaditana genome identified 2,649 genes likely specific to N. oceanica CCMP1779. Many of these N. oceanica–specific genes have putative orthologs in other species or are supported by transcriptional evidence. However, because similarity-based annotations are limited, functions of most of these species-specific genes remain unknown. Aside from the genome sequence and its analysis, protocols for the transformation of N. oceanica CCMP1779 are provided. The availability of genomic and transcriptomic data for Nannochloropsis oceanica CCMP1779, along with efficient transformation protocols, provides a blueprint for future detailed gene functional analysis and genetic engineering of Nannochloropsis species by a growing academic community focused on this genus. PMID:23166516

  10. Computational algorithms to predict Gene Ontology annotations

    PubMed Central

    2015-01-01

    Background Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful. Methods We used a set of computational algorithms and weighting schemes to infer novel gene annotations from a set of known ones. We used the latent semantic analysis approach, implementing two popular algorithms (Latent Semantic Indexing and Probabilistic Latent Semantic Analysis) and propose a novel method, the Semantic IMproved Latent Semantic Analysis, which adds a clustering step on the set of considered genes. Furthermore, we propose the improvement of these algorithms by weighting the annotations in the input set. Results We tested our methods and their weighted variants on the Gene Ontology annotation sets of three model organism genes (Bos taurus, Danio rerio and Drosophila melanogaster ). The methods showed their ability in predicting novel gene annotations and the weighting procedures demonstrated to lead to a valuable improvement, although the obtained results vary according to the dimension of the input annotation set and the considered algorithm. Conclusions Out of the three considered methods, the Semantic IMproved Latent Semantic Analysis is the one that provides better results. In particular, when coupled with a proper weighting policy, it is able to predict a significant number of novel annotations, demonstrating to actually be a helpful tool in supporting scientists in the curation process of gene functional annotations. PMID:25916950

  11. Insyght: navigating amongst abundant homologues, syntenies and gene functional annotations in bacteria, it's that symbol!

    PubMed

    Lacroix, Thomas; Loux, Valentin; Gendrault, Annie; Hoebeke, Mark; Gibrat, Jean-François

    2014-12-01

    High-throughput techniques have considerably increased the potential of comparative genomics whilst simultaneously posing many new challenges. One of those challenges involves efficiently mining the large amount of data produced and exploring the landscape of both conserved and idiosyncratic genomic regions across multiple genomes. Domains of application of these analyses are diverse: identification of evolutionary events, inference of gene functions, detection of niche-specific genes or phylogenetic profiling. Insyght is a comparative genomic visualization tool that combines three complementary displays: (i) a table for thoroughly browsing amongst homologues, (ii) a comparator of orthologue functional annotations and (iii) a genomic organization view designed to improve the legibility of rearrangements and distinctive loci. The latter display combines symbolic and proportional graphical paradigms. Synchronized navigation across multiple species and interoperability between the views are core features of Insyght. A gene filter mechanism is provided that helps the user to build a biologically relevant gene set according to multiple criteria such as presence/absence of homologues and/or various annotations. We illustrate the use of Insyght with scenarios. Currently, only Bacteria and Archaea are supported. A public instance is available at http://genome.jouy.inra.fr/Insyght. The tool is freely downloadable for private data set analysis. PMID:25249626

  12. Insyght: navigating amongst abundant homologues, syntenies and gene functional annotations in bacteria, it's that symbol!

    PubMed Central

    Lacroix, Thomas; Loux, Valentin; Gendrault, Annie; Hoebeke, Mark; Gibrat, Jean-François

    2014-01-01

    High-throughput techniques have considerably increased the potential of comparative genomics whilst simultaneously posing many new challenges. One of those challenges involves efficiently mining the large amount of data produced and exploring the landscape of both conserved and idiosyncratic genomic regions across multiple genomes. Domains of application of these analyses are diverse: identification of evolutionary events, inference of gene functions, detection of niche-specific genes or phylogenetic profiling. Insyght is a comparative genomic visualization tool that combines three complementary displays: (i) a table for thoroughly browsing amongst homologues, (ii) a comparator of orthologue functional annotations and (iii) a genomic organization view designed to improve the legibility of rearrangements and distinctive loci. The latter display combines symbolic and proportional graphical paradigms. Synchronized navigation across multiple species and interoperability between the views are core features of Insyght. A gene filter mechanism is provided that helps the user to build a biologically relevant gene set according to multiple criteria such as presence/absence of homologues and/or various annotations. We illustrate the use of Insyght with scenarios. Currently, only Bacteria and Archaea are supported. A public instance is available at http://genome.jouy.inra.fr/Insyght. The tool is freely downloadable for private data set analysis. PMID:25249626

  13. Discovering gene functional relationships using FAUN (Feature Annotation Using Nonnegative matrix factorization)

    PubMed Central

    2010-01-01

    Background Searching the enormous amount of information available in biomedical literature to extract novel functional relationships among genes remains a challenge in the field of bioinformatics. While numerous (software) tools have been developed to extract and identify gene relationships from biological databases, few effectively deal with extracting new (or implied) gene relationships, a process which is useful in interpretation of discovery-oriented genome-wide experiments. Results In this study, we develop a Web-based bioinformatics software environment called FAUN or Feature Annotation Using Nonnegative matrix factorization (NMF) to facilitate both the discovery and classification of functional relationships among genes. Both the computational complexity and parameterization of NMF for processing gene sets are discussed. FAUN is tested on three manually constructed gene document collections. Its utility and performance as a knowledge discovery tool is demonstrated using a set of genes associated with Autism. Conclusions FAUN not only assists researchers to use biomedical literature efficiently, but also provides utilities for knowledge discovery. This Web-based software environment may be useful for the validation and analysis of functional associations in gene subsets identified by high-throughput experiments. PMID:20946597

  14. Gene Expression and Functional Annotation of the Human Ciliary Body Epithelia

    PubMed Central

    Janssen, Sarah F.; Gorgels, Theo G. M. F.; Bossers, Koen; ten Brink, Jacoline B.; Essing, Anke H. W.; Nagtegaal, Martijn; van der Spek, Peter J.; Jansonius, Nomdo M.; Bergen, Arthur A. B.

    2012-01-01

    Purpose The ciliary body (CB) of the human eye consists of the non-pigmented (NPE) and pigmented (PE) neuro-epithelia. We investigated the gene expression of NPE and PE, to shed light on the molecular mechanisms underlying the most important functions of the CB. We also developed molecular signatures for the NPE and PE and studied possible new clues for glaucoma. Methods We isolated NPE and PE cells from seven healthy human donor eyes using laser dissection microscopy. Next, we performed RNA isolation, amplification, labeling and hybridization against 44×k Agilent microarrays. For microarray conformations, we used a literature study, RT-PCRs, and immunohistochemical stainings. We analyzed the gene expression data with R and with the knowledge database Ingenuity. Results The gene expression profiles and functional annotations of the NPE and PE were highly similar. We found that the most important functionalities of the NPE and PE were related to developmental processes, neural nature of the tissue, endocrine and metabolic signaling, and immunological functions. In total 1576 genes differed statistically significantly between NPE and PE. From these genes, at least 3 were cell-specific for the NPE and 143 for the PE. Finally, we observed high expression in the (N)PE of 35 genes previously implicated in molecular mechanisms related to glaucoma. Conclusion Our gene expression analysis suggested that the NPE and PE of the CB were quite similar. Nonetheless, cell-type specific differences were found. The molecular machineries of the human NPE and PE are involved in a range of neuro-endocrinological, developmental and immunological functions, and perhaps glaucoma. PMID:23028713

  15. Annotation of gene function in citrus using gene expression information and co-expression networks

    PubMed Central

    2014-01-01

    Background The genus Citrus encompasses major cultivated plants such as sweet orange, mandarin, lemon and grapefruit, among the world’s most economically important fruit crops. With increasing volumes of transcriptomics data available for these species, Gene Co-expression Network (GCN) analysis is a viable option for predicting gene function at a genome-wide scale. GCN analysis is based on a “guilt-by-association” principle whereby genes encoding proteins involved in similar and/or related biological processes may exhibit similar expression patterns across diverse sets of experimental conditions. While bioinformatics resources such as GCN analysis are widely available for efficient gene function prediction in model plant species including Arabidopsis, soybean and rice, in citrus these tools are not yet developed. Results We have constructed a comprehensive GCN for citrus inferred from 297 publicly available Affymetrix Genechip Citrus Genome microarray datasets, providing gene co-expression relationships at a genome-wide scale (33,000 transcripts). The comprehensive citrus GCN consists of a global GCN (condition-independent) and four condition-dependent GCNs that survey the sweet orange species only, all citrus fruit tissues, all citrus leaf tissues, or stress-exposed plants. All of these GCNs are clustered using genome-wide, gene-centric (guide) and graph clustering algorithms for flexibility of gene function prediction. For each putative cluster, gene ontology (GO) enrichment and gene expression specificity analyses were performed to enhance gene function, expression and regulation pattern prediction. The guide-gene approach was used to infer novel roles of genes involved in disease susceptibility and vitamin C metabolism, and graph-clustering approaches were used to investigate isoprenoid/phenylpropanoid metabolism in citrus peel, and citric acid catabolism via the GABA shunt in citrus fruit. Conclusions Integration of citrus gene co-expression networks, functional enrichment analysis and gene expression information provide opportunities to infer gene function in citrus. We present a publicly accessible tool, Network Inference for Citrus Co-Expression (NICCE, http://citrus.adelaide.edu.au/nicce/home.aspx), for the gene co-expression analysis in citrus. PMID:25023870

  16. PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data.

    PubMed

    Hawkins, Troy; Chitale, Meghana; Luban, Stanislav; Kihara, Daisuke

    2009-02-15

    Protein function prediction is a central problem in bioinformatics, increasing in importance recently due to the rapid accumulation of biological data awaiting interpretation. Sequence data represents the bulk of this new stock and is the obvious target for consideration as input, as newly sequenced organisms often lack any other type of biological characterization. We have previously introduced PFP (Protein Function Prediction) as our sequence-based predictor of Gene Ontology (GO) functional terms. PFP interprets the results of a PSI-BLAST search by extracting and scoring individual functional attributes, searching a wide range of E-value sequence matches, and utilizing conventional data mining techniques to fill in missing information. We have shown it to be effective in predicting both specific and low-resolution functional attributes when sufficient data is unavailable. Here we describe (1) significant improvements to the PFP infrastructure, including the addition of prediction significance and confidence scores, (2) a thorough benchmark of performance and comparisons to other related prediction methods, and (3) applications of PFP predictions to genome-scale data. We applied PFP predictions to uncharacterized protein sequences from 15 organisms. Among these sequences, 60-90% could be annotated with a GO molecular function term at high confidence (>or=80%). We also applied our predictions to the protein-protein interaction network of the Malaria plasmodium (Plasmodium falciparum). High confidence GO biological process predictions (>or=90%) from PFP increased the number of fully enriched interactions in this dataset from 23% of interactions to 94%. Our benchmark comparison shows significant performance improvement of PFP relative to GOtcha, InterProScan, and PSI-BLAST predictions. This is consistent with the performance of PFP as the overall best predictor in both the AFP-SIG '05 and CASP7 function (FN) assessments. PFP is available as a web service at http://dragon.bio.purdue.edu/pfp/. PMID:18655063

  17. Phylogenetic molecular function annotation

    NASA Astrophysics Data System (ADS)

    Engelhardt, Barbara E.; Jordan, Michael I.; Repo, Susanna T.; Brenner, Steven E.

    2009-07-01

    It is now easier to discover thousands of protein sequences in a new microbial genome than it is to biochemically characterize the specific activity of a single protein of unknown function. The molecular functions of protein sequences have typically been predicted using homology-based computational methods, which rely on the principle that homologous proteins share a similar function. However, some protein families include groups of proteins with different molecular functions. A phylogenetic approach for predicting molecular function (sometimes called "phylogenomics") is an effective means to predict protein molecular function. These methods incorporate functional evidence from all members of a family that have functional characterizations using the evolutionary history of the protein family to make robust predictions for the uncharacterized proteins. However, they are often difficult to apply on a genome-wide scale because of the time-consuming step of reconstructing the phylogenies of each protein to be annotated. Our automated approach for function annotation using phylogeny, the SIFTER (Statistical Inference of Function Through Evolutionary Relationships) methodology, uses a statistical graphical model to compute the probabilities of molecular functions for unannotated proteins. Our benchmark tests showed that SIFTER provides accurate functional predictions on various protein families, outperforming other available methods.

  18. Transitive functional annotation by shortest-path analysis of gene expression data

    PubMed Central

    Zhou, Xianghong; Kao, Ming-Chih J.; Wong, Wing Hung

    2002-01-01

    Current methods for the functional analysis of microarray gene expression data make the implicit assumption that genes with similar expression profiles have similar functions in cells. However, among genes involved in the same biological pathway, not all gene pairs show high expression similarity. Here, we propose that transitive expression similarity among genes can be used as an important attribute to link genes of the same biological pathway. Based on large-scale yeast microarray expression data, we use the shortest-path analysis to identify transitive genes between two given genes from the same biological process. We find that not only functionally related genes with correlated expression profiles are identified but also those without. In the latter case, we compare our method to hierarchical clustering, and show that our method can reveal functional relationships among genes in a more precise manner. Finally, we show that our method can be used to reliably predict the function of unknown genes from known genes lying on the same shortest path. We assigned functions for 146 yeast genes that are considered as unknown by the Saccharomyces Genome Database and by the Yeast Proteome Database. These genes constitute around 5% of the unknown yeast ORFome. PMID:12196633

  19. Annotating genes using textual patterns.

    PubMed

    Cakmak, Ali; Ozsoyoglu, Gultekin

    2007-01-01

    Annotating genes with Gene Ontology (GO) terms is crucial for biologists to characterize the traits of genes in a standardized way. However, manual curation of textual data, the most reliable form of gene annotation by GO terms, requires significant amounts of human effort, is very costly, and cannot catch up with the rate of increase in biomedical publications. In this paper, we present GEANN, a system to automatically infer new GO annotations for genes from biomedical papers based on the evidence support linked to PubMed, a biological literature database of 14 million papers. GEANN (i) extracts from text significant terms and phrases associated with a GO term, (ii) based on the extracted terms, constructs textual extraction patterns with reliability scores for GO terms, (iii) expands the pattern set through "pattern crosswalks", (iv) employs semantic pattern matching, rather than syntactic pattern matching, which allows for the recognition of phrases with close meanings, and (iv) annotates genes based on the "quality" of the matched pattern to the genomic entity occurring in the text. On the average, in our experiments, GEANN has reached to the precision level of 78% at the 57% recall level. PMID:17990494

  20. AMIGene: Annotation of MIcrobial Genes

    Microsoft Academic Search

    Stéphanie Bocs; Stéphane Cruveiller; David Vallenet; Grégory Nuel; Claudine Médigue

    2003-01-01

    AMIGene (Annotation of MIcrobial Genes) is an application for automatically identifying the most likely coding sequences (CDSs) in a large contig or a complete bacterial genome sequence. The first step in AMIGene is dedicated to the construction of Markov models that fit the input genomic data (i.e. the gene model), followed by the combination of well-known gene-finding methods and an

  1. Functional annotation of novel lineage-specific genes using co-expression and promoter analysis

    PubMed Central

    2010-01-01

    Background The diversity of placental architectures within and among mammalian orders is believed to be the result of adaptive evolution. Although, the genetic basis for these differences is unknown, some may arise from rapidly diverging and lineage-specific genes. Previously, we identified 91 novel lineage-specific transcripts (LSTs) from a cow term-placenta cDNA library, which are excellent candidates for adaptive placental functions acquired by the ruminant lineage. The aim of the present study was to infer functions of previously uncharacterized lineage-specific genes (LSGs) using co-expression, promoter, pathway and network analysis. Results Clusters of co-expressed genes preferentially expressed in liver, placenta and thymus were found using 49 previously uncharacterized LSTs as seeds. Over-represented composite transcription factor binding sites (TFBS) in promoters of clustered LSGs and known genes were then identified computationally. Functions were inferred for nine previously uncharacterized LSGs using co-expression analysis and pathway analysis tools. Our results predict that these LSGs may function in cell signaling, glycerophospholipid/fatty acid metabolism, protein trafficking, regulatory processes in the nucleus, and processes that initiate parturition and immune system development. Conclusions The placenta is a rich source of lineage-specific genes that function in the adaptive evolution of placental architecture and functions. We have shown that co-expression, promoter, and gene network analyses are useful methods to infer functions of LSGs with heretofore unknown functions. Our results indicate that many LSGs are involved in cellular recognition and developmental processes. Furthermore, they provide guidance for experimental approaches to validate the functions of LSGs and to study their evolution. PMID:20214810

  2. Comparison of four similarity measures based on GO annotations for Gene Clustering

    Microsoft Academic Search

    Hisham Al-mubaid; Anurag Nagar

    2008-01-01

    Gene ontology (GO) has fast become a dependable source for determining gene functions, gene similarity, and gene clustering. Furthermore, using GO and gene annotation databases, with semantic similarity measures, is now more acceptable in bioinformatics as means for gene functional analysis. In this paper, we compare four semantic similarity measures to compute the similarity between genes using GO annotations within

  3. SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data

    Microsoft Academic Search

    Maximilian Diehn; Gavin Sherlock; Gail Binkley; Heng Jin; John C. Matese; Tina Hernandez-boussard; Christian A. Rees; J. Michael Cherry; David Botstein; Patrick O. Brown; Ash A. Alizadeh

    2003-01-01

    The explosion in the number of functional genomic datasets generated with tools such as DNA micro- arrays has created a critical need for resources that facilitate the interpretation of large-scale biological data. SOURCE is a web-based database that brings together information from a broad range of resources, and provides it in manner particularly useful for genome-scale analyses. SOURCE's GeneReports include

  4. Metagenomic gene annotation by a homology-independent approach

    SciTech Connect

    Froula, Jeff; Zhang, Tao; Salmeen, Annette; Hess, Matthias; Kerfeld, Cheryl A.; Wang, Zhong; Du, Changbin

    2011-06-02

    Fully understanding the genetic potential of a microbial community requires functional annotation of all the genes it encodes. The recently developed deep metagenome sequencing approach has enabled rapid identification of millions of genes from a complex microbial community without cultivation. Current homology-based gene annotation fails to detect distantly-related or structural homologs. Furthermore, homology searches with millions of genes are very computational intensive. To overcome these limitations, we developed rhModeller, a homology-independent software pipeline to efficiently annotate genes from metagenomic sequencing projects. Using cellulases and carbonic anhydrases as two independent test cases, we demonstrated that rhModeller is much faster than HMMER but with comparable accuracy, at 94.5percent and 99.9percent accuracy, respectively. More importantly, rhModeller has the ability to detect novel proteins that do not share significant homology to any known protein families. As {approx}50percent of the 2 million genes derived from the cow rumen metagenome failed to be annotated based on sequence homology, we tested whether rhModeller could be used to annotate these genes. Preliminary results suggest that rhModeller is robust in the presence of missense and frameshift mutations, two common errors in metagenomic genes. Applying the pipeline to the cow rumen genes identified 4,990 novel cellulases candidates and 8,196 novel carbonic anhydrase candidates.In summary, we expect rhModeller to dramatically increase the speed and quality of metagnomic gene annotation.

  5. JAFA: a protein function annotation meta-server

    Microsoft Academic Search

    Iddo Friedberg; Tim Harder; Adam Godzik

    2006-01-01

    With the high number of sequences and structures streaming in from genomic projects, there is a need for more powerful and sophisticated annotation tools. Most problematic of the annotation efforts is predict- ing gene and protein function. Over the past few years there has been considerable progress in automated protein function prediction, using a diverse set of methods. Nevertheless, no

  6. Gene Characterization Index: Assessing the Depth of Gene Annotation

    PubMed Central

    Yusuf, Dimas; Brumm, Jochen; Cheung, Warren; Wahlestedt, Claes; Lenhard, Boris; Wasserman, Wyeth W.

    2008-01-01

    Background We introduce the Gene Characterization Index, a bioinformatics method for scoring the extent to which a protein-encoding gene is functionally described. Inherently a reflection of human perception, the Gene Characterization Index is applied for assessing the characterization status of individual genes, thus serving the advancement of both genome annotation and applied genomics research by rapid and unbiased identification of groups of uncharacterized genes for diverse applications such as directed functional studies and delineation of novel drug targets. Methodology/Principal Findings The scoring procedure is based on a global survey of researchers, who assigned characterization scores from 1 (poor) to 10 (extensive) for a sample of genes based on major online resources. By evaluating the survey as training data, we developed a bioinformatics procedure to assign gene characterization scores to all genes in the human genome. We analyzed snapshots of functional genome annotation over a period of 6 years to assess temporal changes reflected by the increase of the average Gene Characterization Index. Applying the Gene Characterization Index to genes within pharmaceutically relevant classes, we confirmed known drug targets as high-scoring genes and revealed potentially interesting novel targets with low characterization indexes. Removing known drug targets and genes linked to sequence-related patent filings from the entirety of indexed genes, we identified sets of low-scoring genes particularly suited for further experimental investigation. Conclusions/Significance The Gene Characterization Index is intended to serve as a tool to the scientific community and granting agencies for focusing resources and efforts on unexplored areas of the genome. The Gene Characterization Index is available from http://cisreg.ca/gci/. PMID:18213364

  7. Using reasoning to guide annotation with gene ontology terms in GOAT

    Microsoft Academic Search

    Michael Bada; Daniele Turi; Robin McEntire; Robert Stevens

    2004-01-01

    High-quality annotation of biological data is central to bioinformatics. Annotation using terms from ontologies provides reliable computational access to data. The Gene Ontology (GO), a structured controlled vocabulary of nearly 17,000 terms, is becoming the de facto standard for describing the functionality of gene products. Many prominent biomedical databases use GO as a source of terms for functional annotation of

  8. Functional Annotation of Cotesia congregata Bracovirus: Identification of Viral Genes Expressed in Parasitized Host Immune Tissues

    PubMed Central

    Thézé, Julien; Cambier, Sébastien; Poulain, Julie; Da Silva, Corinne; Bézier, Annie; Musset, Karine; Moreau, Sébastien J. M.; Drezen, Jean-Michel

    2014-01-01

    ABSTRACT Bracoviruses (BVs) from the Polydnaviridae family are symbiotic viruses used as biological weapons by parasitoid wasps to manipulate lepidopteran host physiology and induce parasitism success. BV particles are produced by wasp ovaries and injected along with the eggs into the caterpillar host body, where viral gene expression is necessary for wasp development. Recent sequencing of the proviral genome of Cotesia congregata BV (CcBV) identified 222 predicted virulence genes present on 35 proviral segments integrated into the wasp genome. To date, the expressions of only a few selected candidate virulence genes have been studied in the caterpillar host, and we lacked a global vision of viral gene expression. In this study, a large-scale transcriptomic analysis by 454 sequencing of two immune tissues (fat body and hemocytes) of parasitized Manduca sexta caterpillar hosts allowed the detection of expression of 88 CcBV genes expressed 24 h after the onset of parasitism. We linked the expression profiles of these genes to several factors, showing that different regulatory mechanisms control viral gene expression in the host. These factors include the presence of signal peptides in encoded proteins, diversification of promoter regions, and, more surprisingly, gene position on the proviral genome. Indeed, most genes for which expression could be detected are localized in particular proviral regions globally producing higher numbers of circles. Moreover, this polydnavirus (PDV) transcriptomic analysis also reveals that a majority of CcBV genes possess at least one intron and an arthropod transcription start site, consistent with an insect origin of these virulence genes. IMPORTANCE Bracoviruses (BVs) are symbiotic polydnaviruses used by parasitoid wasps to manipulate lepidopteran host physiology, ensuring wasp offspring survival. To date, the expressions of only a few selected candidate BV virulence genes have been studied in caterpillar hosts. We performed a large-scale analysis of BV gene expression in two immune tissues of Manduca sexta caterpillars parasitized by Cotesia congregata wasps. Genes for which expression could be detected corresponded to genes localized in particular regions of the viral genome globally producing higher numbers of circles. Our study thus brings an original global vision of viral gene expression and paves the way to the determination of the regulatory mechanisms enabling the expression of BV genes in targeted organisms, such as major insect pests. In addition, we identify sequence features suggesting that most BV virulence genes were acquired from insect genomes. PMID:24872581

  9. Functional Annotation Analytics of Rhodopseudomonas palustris Genomes

    PubMed Central

    Simmons, Shaneka S.; Isokpehi, Raphael D.; Brown, Shyretha D.; McAllister, Donee L.; Hall, Charnia C.; McDuffy, Wanaki M.; Medley, Tamara L.; Udensi, Udensi K.; Rajnarayanan, Rajendram V.; Ayensu, Wellington K.; Cohly, Hari H.P.

    2011-01-01

    Rhodopseudomonas palustris, a nonsulphur purple photosynthetic bacteria, has been extensively investigated for its metabolic versatility including ability to produce hydrogen gas from sunlight and biomass. The availability of the finished genome sequences of six R. palustris strains (BisA53, BisB18, BisB5, CGA009, HaA2 and TIE-1) combined with online bioinformatics software for integrated analysis presents new opportunities to determine the genomic basis of metabolic versatility and ecological lifestyles of the bacteria species. The purpose of this investigation was to compare the functional annotations available for multiple R. palustris genomes to identify annotations that can be further investigated for strain-specific or uniquely shared phenotypic characteristics. A total of 2,355 protein family Pfam domain annotations were clustered based on presence or absence in the six genomes. The clustering process identified groups of functional annotations including those that could be verified as strain-specific or uniquely shared phenotypes. For example, genes encoding water/glycerol transport were present in the genome sequences of strains CGA009 and BisB5, but absent in strains BisA53, BisB18, HaA2 and TIE-1. Protein structural homology modeling predicted that the two orthologous 240 aa R. palustris aquaporins have water-specific transport function. Based on observations in other microbes, the presence of aquaporin in R. palustris strains may improve freeze tolerance in natural conditions of rapid freezing such as nitrogen fixation at low temperatures where access to liquid water is a limiting factor for nitrogenase activation. In the case of adaptive loss of aquaporin genes, strains may be better adapted to survive in conditions of high-sugar content such as fermentation of biomass for biohydrogen production. Finally, web-based resources were developed to allow for interactive, user-defined selection of the relationship between protein family annotations and the R. palustris genomes. PMID:22084572

  10. Discovering gene annotations in biomedical text databases

    PubMed Central

    Cakmak, Ali; Ozsoyoglu, Gultekin

    2008-01-01

    Background Genes and gene products are frequently annotated with Gene Ontology concepts based on the evidence provided in genomics articles. Manually locating and curating information about a genomic entity from the biomedical literature requires vast amounts of human effort. Hence, there is clearly a need forautomated computational tools to annotate the genes and gene products with Gene Ontology concepts by computationally capturing the related knowledge embedded in textual data. Results In this article, we present an automated genomic entity annotation system, GEANN, which extracts information about the characteristics of genes and gene products in article abstracts from PubMed, and translates the discoveredknowledge into Gene Ontology (GO) concepts, a widely-used standardized vocabulary of genomic traits. GEANN utilizes textual "extraction patterns", and a semantic matching framework to locate phrases matching to a pattern and produce Gene Ontology annotations for genes and gene products. In our experiments, GEANN has reached to the precision level of 78% at therecall level of 61%. On a select set of Gene Ontology concepts, GEANN either outperforms or is comparable to two other automated annotation studies. Use of WordNet for semantic pattern matching improves the precision and recall by 24% and 15%, respectively, and the improvement due to semantic pattern matching becomes more apparent as the Gene Ontology terms become more general. Conclusion GEANN is useful for two distinct purposes: (i) automating the annotation of genomic entities with Gene Ontology concepts, and (ii) providing existing annotations with additional "evidence articles" from the literature. The use of textual extraction patterns that are constructed based on the existing annotations achieve high precision. The semantic pattern matching framework provides a more flexible pattern matching scheme with respect to "exactmatching" with the advantage of locating approximate pattern occurrences with similar semantics. Relatively low recall performance of our pattern-based approach may be enhanced either by employing a probabilistic annotation framework based on the annotation neighbourhoods in textual data, or, alternatively, the statistical enrichment threshold may be adjusted to lower values for applications that put more value on achieving higher recall values. PMID:18325104

  11. MUTANT MOUSE: bona fide Biosimulator for the Functional Annotation of Gene and Genome Networks

    Microsoft Academic Search

    Yoichi Gondo

    The advancements of genomics and genome projects led to the current paradigm that the blueprint of life is depicted in the\\u000a genome sequences. To decipher the life system, deductive methods have been applied from genome sequences to genes, transcripts,\\u000a proteins, organelles, cells, tissues, organs, organisms, and populations. As a result we encountered an astronomical scale\\u000a of complicated molecular and cellular

  12. Characterizing the state of the art in the computational assignment of gene function: lessons from the first critical assessment of functional annotation (CAFA)

    PubMed Central

    2013-01-01

    The assignment of gene function remains a difficult but important task in computational biology. The establishment of the first Critical Assessment of Functional Annotation (CAFA) was aimed at increasing progress in the field. We present an independent analysis of the results of CAFA, aimed at identifying challenges in assessment and at understanding trends in prediction performance. We found that well-accepted methods based on sequence similarity (i.e., BLAST) have a dominant effect. Many of the most informative predictions turned out to be either recovering existing knowledge about sequence similarity or were "post-dictions" already documented in the literature. These results indicate that deep challenges remain in even defining the task of function assignment, with a particular difficulty posed by the problem of defining function in a way that is not dependent on either flawed gold standards or the input data itself. In particular, we suggest that using the Gene Ontology (or other similar systematizations of function) as a gold standard is unlikely to be the way forward. PMID:23630983

  13. ConFunc - functional annotation in the twilight zone

    Microsoft Academic Search

    Mark N. Wass; Michael J. E. Sternberg

    2008-01-01

    Motivation: The success of genome sequencing has resulted in many protein sequences without functional annotation. We present ConFunc, an automated Gene Ontology (GO)-based protein function prediction approach, which uses conserved residues to generate sequence profiles to infer function. ConFunc split sets of sequences identified by PSI-BLAST into sub-alignments according to their GO annotations. Conserved residues are identified for each GO

  14. Gene calling and bacterial genome annotation with BG7.

    PubMed

    Tobes, Raquel; Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Kovach, Evdokim; Alekhin, Alexey; Pareja, Eduardo

    2015-01-01

    New massive sequencing technologies are providing many bacterial genome sequences from diverse taxa but a refined annotation of these genomes is crucial for obtaining scientific findings and new knowledge. Thus, bacterial genome annotation has emerged as a key point to investigate in bacteria. Any efficient tool designed specifically to annotate bacterial genomes sequenced with massively parallel technologies has to consider the specific features of bacterial genomes (absence of introns and scarcity of nonprotein-coding sequence) and of next-generation sequencing (NGS) technologies (presence of errors and not perfectly assembled genomes). These features make it convenient to focus on coding regions and, hence, on protein sequences that are the elements directly related with biological functions. In this chapter we describe how to annotate bacterial genomes with BG7, an open-source tool based on a protein-centered gene calling/annotation paradigm. BG7 is specifically designed for the annotation of bacterial genomes sequenced with NGS. This tool is sequence error tolerant maintaining their capabilities for the annotation of highly fragmented genomes or for annotating mixed sequences coming from several genomes (as those obtained through metagenomics samples). BG7 has been designed with scalability as a requirement, with a computing infrastructure completely based on cloud computing (Amazon Web Services). PMID:25343866

  15. Functional annotation and biological interpretation of proteomics data.

    PubMed

    Carnielli, Carolina M; Winck, Flavia V; Paes Leme, Adriana F

    2015-01-01

    Proteomics experiments often generate a vast amount of data. However, the simple identification and quantification of proteins from a cell proteome or subproteome is not sufficient for the full understanding of complex mechanisms occurring in the biological systems. Therefore, the functional annotation analysis of protein datasets using bioinformatics tools is essential for interpreting the results of high-throughput proteomics. Although large-scale proteomics data have rapidly increased, the biological interpretation of these results remains as a challenging task. Here we reviewed basic concepts and different programs that are commonly used in proteomics data functional annotation, emphasizing the main strategies focused in the use of gene ontology annotations. Furthermore, we explored the characteristics of some tools developed for functional annotation analysis, concerning the ease of use and typical caveats on ontology annotations. The utility and variations between different tools were assessed through the comparison of the resulting outputs generated for an example of proteomics dataset. PMID:25448015

  16. Detection of gene annotations and protein-protein interaction associated disorders through transitive relationships between integrated annotations

    PubMed Central

    2015-01-01

    Background Increasingly high amounts of heterogeneous and valuable controlled biomolecular annotations are available, but far from exhaustive and scattered in many databases. Several annotation integration and prediction approaches have been proposed, but these issues are still unsolved. We previously created a Genomic and Proteomic Knowledge Base (GPKB) that efficiently integrates many distributed biomolecular annotation and interaction data of several organisms, including 32,956,102 gene annotations, 273,522,470 protein annotations and 277,095 protein-protein interactions (PPIs). Results By comprehensively leveraging transitive relationships defined by the numerous association data integrated in GPKB, we developed a software procedure that effectively detects and supplement consistent biomolecular annotations not present in the integrated sources. According to some defined logic rules, it does so only when the semantic type of data and of their relationships, as well as the cardinality of the relationships, allow identifying molecular biology compliant annotations. Thanks to controlled consistency and quality enforced on data integrated in GPKB, and to the procedures used to avoid error propagation during their automatic processing, we could reliably identify many annotations, which we integrated in GPKB. They comprise 3,144 gene to pathway and 21,942 gene to biological function annotations of many organisms, and 1,027 candidate associations between 317 genetic disorders and 782 human PPIs. Overall estimated recall and precision of our approach were 90.56 % and 96.61 %, respectively. Co-functional evaluation of genes with known function showed high functional similarity between genes with new detected and known annotation to the same pathway; considering also the new detected gene functional annotations enhanced such functional similarity, which resembled the one existing between genes known to be annotated to the same pathway. Strong evidence was also found in the literature for the candidate associations detected between Cystic fibrosis disorder and the PPIs between the CFTR_HUMAN, DERL1_HUMAN, RNF5_HUMAN, AHSA1_HUMAN and GOPC_HUMAN proteins, and between the CHIP_HUMAN and HSP7C_HUMAN proteins. Conclusions Although identified gene annotations and PPI-genetic disorder candidate associations require biological validation, our approach intrinsically provides their in silico evidence based on available data. Public availability within the GPKB (http://www.bioinformatics.deib.polimi.it/GPKB/) of all identified and integrated annotations offers a valuable resource fostering new biomedical-molecular knowledge discoveries. PMID:26046679

  17. Identification and Functional Annotation of Genome-Wide ER-Regulated Genes in Breast Cancer Based on ChIP-Seq Data

    PubMed Central

    Ding, Min; Wang, Haiyun; Chen, Jiajia; Shen, Bairong; Xu, Zhonghua

    2012-01-01

    Estrogen receptor (ER) is a crucial molecule symbol of breast cancer. Molecular interactions between ER complexes and DNA regulate the expression of genes responsible for cancer cell phenotypes. However, the positions and mechanisms of the ER binding with downstream gene targets are far from being fully understood. ChIP-Seq is an important assay for the genome-wide study of protein-DNA interactions. In this paper, we explored the genome-wide chromatin localization of ER-DNA binding regions by analyzing ChIP-Seq data from MCF-7 breast cancer cell line. By integrating three peak detection algorithms and two datasets, we localized 933 ER binding sites, 92% among which were located far away from promoters, suggesting long-range control by ER. Moreover, 489 genes in the vicinity of ER binding sites were identified as estrogen response elements by comparison with expression data. In addition, 836 single nucleotide polymorphisms (SNPs) in or near 157 ER-regulated genes were found in the vicinity of ER binding sites. Furthermore, we annotated the function of the nearest-neighbor genes of these binding sites using Gene Ontology (GO), KEGG, and GeneGo pathway databases. The results revealed novel ER-regulated genes pathways for further experimental validation. ER was found to affect every developed stage of breast cancer by regulating genes related to the development, progression, and metastasis. This study provides a deeper understanding of the regulatory mechanisms of ER and its associated genes. PMID:23346221

  18. Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO)

    Microsoft Academic Search

    Selina S. Dwight; Midori A. Harris; Kara Dolinski; Catherine A. Ball; Gail Binkley; Karen R. Christie; Dianna G. Fisk; Laurie Issel-tarver; Mark Schroeder; Gavin Sherlock; Anand Sethuraman; Shuai Weng; David Botstein; J. Michael Cherry

    2002-01-01

    The Saccharomyces Genome Database (SGD) resources, ranging from genetic and physical maps to genome-wide analysis tools, reflect the scientific progress in identifying genes and their functions over the last decade. As emphasis shifts from identi- fication of the genes to identification of the role of their gene products in the cell, SGD seeks to provide its users with annotations that

  19. Taxonomic Precision of Different Hypervariable Regions of 16S rRNA Gene and Annotation Methods for Functional Bacterial Groups in Biological Wastewater Treatment

    PubMed Central

    Guo, Feng; Ju, Feng; Cai, Lin; Zhang, Tong

    2013-01-01

    High throughput sequencing of 16S rRNA gene leads us into a deeper understanding on bacterial diversity for complex environmental samples, but introduces blurring due to the relatively low taxonomic capability of short read. For wastewater treatment plant, only those functional bacterial genera categorized as nutrient remediators, bulk/foaming species, and potential pathogens are significant to biological wastewater treatment and environmental impacts. Precise taxonomic assignment of these bacteria at least at genus level is important for microbial ecological research and routine wastewater treatment monitoring. Therefore, the focus of this study was to evaluate the taxonomic precisions of different ribosomal RNA (rRNA) gene hypervariable regions generated from a mix activated sludge sample. In addition, three commonly used classification methods including RDP Classifier, BLAST-based best-hit annotation, and the lowest common ancestor annotation by MEGAN were evaluated by comparing their consistency. Under an unsupervised way, analysis of consistency among different classification methods suggests there are no hypervariable regions with good taxonomic coverage for all genera. Taxonomic assignment based on certain regions of the 16S rRNA genes, e.g. the V1&V2 regions – provide fairly consistent taxonomic assignment for a relatively wide range of genera. Hence, it is recommended to use these regions for studying functional groups in activated sludge. Moreover, the inconsistency among methods also demonstrated that a specific method might not be suitable for identification of some bacterial genera using certain 16S rRNA gene regions. As a general rule, drawing conclusions based only on one sequencing region and one classification method should be avoided due to the potential false negative results. PMID:24146837

  20. AutoFACT: An Automatic Functional Annotation and Classification Tool

    PubMed Central

    Koski, Liisa B; Gray, Michael W; Lang, B Franz; Burger, Gertraud

    2005-01-01

    Background Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets. Results We present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1) analyzes nucleotide and protein sequence data; (2) determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3) assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4) generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%. Conclusion AutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at . PMID:15960857

  1. BABELOMICS: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments

    PubMed Central

    Al-Shahrour, Fátima; Minguez, Pablo; Vaquerizas, Juan M.; Conde, Lucía; Dopazo, Joaquín

    2005-01-01

    We present Babelomics, a complete suite of web tools for the functional analysis of groups of genes in high-throughput experiments, which includes the use of information on Gene Ontology terms, interpro motifs, KEGG pathways, Swiss-Prot keywords, analysis of predicted transcription factor binding sites, chromosomal positions and presence in tissues with determined histological characteristics, through five integrated modules: FatiGO (fast assignment and transference of information), FatiWise, transcription factor association test, GenomeGO and tissues mining tool, respectively. Additionally, another module, FatiScan, provides a new procedure that integrates biological information in combination with experimental results in order to find groups of genes with modest but coordinate significant differential behaviour. FatiScan is highly sensitive and is capable of finding significant asymmetries in the distribution of genes of common function across a list of ordered genes even if these asymmetries were not extreme. The strong multiple-testing nature of the contrasts made by the tools is taken into account. All the tools are integrated in the gene expression analysis package GEPAS. Babelomics is the natural evolution of our tool FatiGO (which analysed almost 22 000 experiments during the last year) to include more sources on information and new modes of using it. Babelomics can be found at . PMID:15980512

  2. JAFA: a protein function annotation meta-server

    PubMed Central

    Friedberg, Iddo; Harder, Tim; Godzik, Adam

    2006-01-01

    With the high number of sequences and structures streaming in from genomic projects, there is a need for more powerful and sophisticated annotation tools. Most problematic of the annotation efforts is predicting gene and protein function. Over the past few years there has been considerable progress in automated protein function prediction, using a diverse set of methods. Nevertheless, no single method reports all the information possible, and molecular biologists resort to ‘shopping around’ using different methods: a cumbersome and time-consuming practice. Here we present the Joined Assembly of Function Annotations, or JAFA server. JAFA queries several function prediction servers with a protein sequence and assembles the returned predictions in a legible, non-redundant format. In this manner, JAFA combines the predictions of several servers to provide a comprehensive view of what are the predicted functions of the proteins. JAFA also offers its own output, and the individual programs' predictions for further processing. JAFA is available for use from . PMID:16845030

  3. Protein function annotation by homology-based inference

    PubMed Central

    Loewenstein, Yaniv; Raimondo, Domenico; Redfern, Oliver C; Watson, James; Frishman, Dmitrij; Linial, Michal; Orengo, Christine; Thornton, Janet; Tramontano, Anna

    2009-01-01

    With many genomes now sequenced, computational annotation methods to characterize genes and proteins from their sequence are increasingly important. The BioSapiens Network has developed tools to address all stages of this process, and here we review progress in the automated prediction of protein function based on protein sequence and structure. PMID:19226439

  4. GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes

    Microsoft Academic Search

    David M. A. Martin; Matthew Berriman; Geoffrey J. Barton

    2004-01-01

    Background: The function of a novel gene product is typically predicted by transitive assignment of annotation from similar sequences. We describe a novel method, GOtcha, for predicting gene product function by annotation with Gene Ontology (GO) terms. GOtcha predicts GO term associations with term-specific probability (P-score) measures of confidence. Term-specific probabilities are a novel feature of GOtcha and allow the

  5. Evolutionary Trace Annotation of Protein Function in the Structural Proteome

    PubMed Central

    Erdin, Serkan; Ward, R. Matthew; Venner, Eric

    2010-01-01

    By design, structural genomics (SG) solves many structures that cannot be assigned function based on homology to known proteins. Alternative function annotation methods are therefore needed and this study focuses on function prediction with three-dimensional (3D) templates: small structural motifs built of just a few functionally critical residues. Although experimentally proven functional residues are scarce, we show here that Evolutionary Trace (ET) rankings of residue importance are sufficient to build 3D templates, match them, and then assign Gene Ontology (GO) functions in enzymes and non-enzymes alike. In a high specificity mode, this Evolutionary Trace Annotation (ETA) method covered half (53%) of the 2384 annotated SG protein controls. Three-quarters (76%) of predictions were both correct and complete. The positive predictive value for all GO depths (all-depth PPV) was 84%, and it rose to 94% over GO depths 1– 3 (depth 3 PPV). In a high sensitivity mode coverage rose significantly (84%) while accuracy fell moderately: 68% of predictions were both correct and complete, all-depth PPV was 75%, and depth 3 PPV was 86%. These data concur with prior mutational experiments showing that ET rank information identifies key functional determinants in proteins. In practice, ETA predicted functions in 42% of 3461 un-annotated SG proteins. In 529 cases—including 280 non-enzymes and 21 for metal ion ligands—the expected accuracy is 84% at any GO depth and 94% down to GO depth 3, while for the remaining 931 the expected accuracies are 60% and 71%, respectively. Thus local structural comparisons of evolutionarily important residues can help decipher protein functions to known reliability levels and without prior assumption on functional mechanisms. ETA is available at http://mammoth.bcm.tmc.edu/eta. PMID:20036248

  6. Transcript Annotation in FANTOM3: Mouse Gene Catalog Based on Physical cDNAs

    Microsoft Academic Search

    Norihiro Maeda; Takeya Kasukawaa; Rieko Oyama; Julian Gough; Martin Frith; Pär G. Engström; Boris Lenhard; Rajith N. Aturaliya; Serge Batalov; Kirk W. Beisel; Carol J. Bult; Colin F. Fletcher; Alistair R. R. Forrest; Masaaki Furuno; David Hill; Masayoshi Itoh; Mutsumi Kanamori-Katayama; Shintaro Katayama; Masaru Katoh; Tsugumi Kawashima; John Quackenbushb; Timothy Ravasi; Brian Z. Ring; Kazuhiro Shibata; Koji Sugiura; Yoichi Takenaka; Rohan D. Teasdale; Christine A. Wells; Yunxia Zhu; Chikatoshi Kai; Jun Kawai; David A. Hume; Piero Carninci; Yoshihide Hayashizaki

    2006-01-01

    The international FANTOM consortium aims to produce a comprehensive picture of the mammalian transcriptome, based upon an extensive cDNA collection and functional annotation of full-length enriched cDNAs. The previous dataset, FANTOM2, comprised 60,770 full-length enriched cDNAs. Functional annotation revealed that this cDNA dataset contained only about half of the estimated number of mouse protein-coding genes, indicating that a number of

  7. Integrating Gene Ontology and Blast to predict gene functions

    Microsoft Academic Search

    WANG Cheng-gang; MO Zhi-hong

    2007-01-01

    A GoBlast system was built to predict gene function by integrating Blast search and Gene Ontology (GO) annotations together. The operation system was based on Debian Linux 3.1, with Apache as the web server and Mysql database as the data storage system. FASTA files with GO annotations were taken as the sequence source for blast alignment, which were formatted by

  8. Improving functional annotation for industrial microbes: a case study with Pichia pastoris

    PubMed Central

    Dikicioglu, Duygu; Wood, Valerie; Rutherford, Kim M.; McDowall, Mark D.; Oliver, Stephen G.

    2014-01-01

    The research communities studying microbial model organisms, such as Escherichia coli or Saccharomyces cerevisiae, are well served by model organism databases that have extensive functional annotation. However, this is not true of many industrial microbes that are used widely in biotechnology. In this Opinion piece, we use Pichia (Komagataella) pastoris to illustrate the limitations of the available annotation. We consider the resources that can be implemented in the short term both to improve Gene Ontology (GO) annotation coverage based on annotation transfer, and to establish curation pipelines for the literature corpus of this organism. PMID:24929579

  9. Transcriptome assembly, gene annotation and tissue gene expression atlas of the rainbow trout.

    PubMed

    Salem, Mohamed; Paneru, Bam; Al-Tobasei, Rafet; Abdouni, Fatima; Thorgaard, Gary H; Rexroad, Caird E; Yao, Jianbo

    2015-01-01

    Efforts to obtain a comprehensive genome sequence for rainbow trout are ongoing and will be complemented by transcriptome information that will enhance genome assembly and annotation. Previously, transcriptome reference sequences were reported using data from different sources. Although the previous work added a great wealth of sequences, a complete and well-annotated transcriptome is still needed. In addition, gene expression in different tissues was not completely addressed in the previous studies. In this study, non-normalized cDNA libraries were sequenced from 13 different tissues of a single doubled haploid rainbow trout from the same source used for the rainbow trout genome sequence. A total of ~1.167 billion paired-end reads were de novo assembled using the Trinity RNA-Seq assembler yielding 474,524 contigs > 500 base-pairs. Of them, 287,593 had homologies to the NCBI non-redundant protein database. The longest contig of each cluster was selected as a reference, yielding 44,990 representative contigs. A total of 4,146 contigs (9.2%), including 710 full-length sequences, did not match any mRNA sequences in the current rainbow trout genome reference. Mapping reads to the reference genome identified an additional 11,843 transcripts not annotated in the genome. A digital gene expression atlas revealed 7,678 housekeeping and 4,021 tissue-specific genes. Expression of about 16,000-32,000 genes (35-71% of the identified genes) accounted for basic and specialized functions of each tissue. White muscle and stomach had the least complex transcriptomes, with high percentages of their total mRNA contributed by a small number of genes. Brain, testis and intestine, in contrast, had complex transcriptomes, with a large numbers of genes involved in their expression patterns. This study provides comprehensive de novo transcriptome information that is suitable for functional and comparative genomics studies in rainbow trout, including annotation of the genome. PMID:25793877

  10. Transcriptome Assembly, Gene Annotation and Tissue Gene Expression Atlas of the Rainbow Trout

    PubMed Central

    Salem, Mohamed; Paneru, Bam; Al-Tobasei, Rafet; Abdouni, Fatima; Thorgaard, Gary H.; Rexroad, Caird E.; Yao, Jianbo

    2015-01-01

    Efforts to obtain a comprehensive genome sequence for rainbow trout are ongoing and will be complemented by transcriptome information that will enhance genome assembly and annotation. Previously, transcriptome reference sequences were reported using data from different sources. Although the previous work added a great wealth of sequences, a complete and well-annotated transcriptome is still needed. In addition, gene expression in different tissues was not completely addressed in the previous studies. In this study, non-normalized cDNA libraries were sequenced from 13 different tissues of a single doubled haploid rainbow trout from the same source used for the rainbow trout genome sequence. A total of ~1.167 billion paired-end reads were de novo assembled using the Trinity RNA-Seq assembler yielding 474,524 contigs > 500 base-pairs. Of them, 287,593 had homologies to the NCBI non-redundant protein database. The longest contig of each cluster was selected as a reference, yielding 44,990 representative contigs. A total of 4,146 contigs (9.2%), including 710 full-length sequences, did not match any mRNA sequences in the current rainbow trout genome reference. Mapping reads to the reference genome identified an additional 11,843 transcripts not annotated in the genome. A digital gene expression atlas revealed 7,678 housekeeping and 4,021 tissue-specific genes. Expression of about 16,000–32,000 genes (35–71% of the identified genes) accounted for basic and specialized functions of each tissue. White muscle and stomach had the least complex transcriptomes, with high percentages of their total mRNA contributed by a small number of genes. Brain, testis and intestine, in contrast, had complex transcriptomes, with a large numbers of genes involved in their expression patterns. This study provides comprehensive de novo transcriptome information that is suitable for functional and comparative genomics studies in rainbow trout, including annotation of the genome. PMID:25793877

  11. JAFA: a protein function annotation meta-server.

    PubMed

    Friedberg, Iddo; Harder, Tim; Godzik, Adam

    2006-07-01

    With the high number of sequences and structures streaming in from genomic projects, there is a need for more powerful and sophisticated annotation tools. Most problematic of the annotation efforts is predicting gene and protein function. Over the past few years there has been considerable progress in automated protein function prediction, using a diverse set of methods. Nevertheless, no single method reports all the information possible, and molecular biologists resort to 'shopping around' using different methods: a cumbersome and time-consuming practice. Here we present the Joined Assembly of Function Annotations, or JAFA server. JAFA queries several function prediction servers with a protein sequence and assembles the returned predictions in a legible, non-redundant format. In this manner, JAFA combines the predictions of several servers to provide a comprehensive view of what are the predicted functions of the proteins. JAFA also offers its own output, and the individual programs' predictions for further processing. JAFA is available for use from http://jafa.burnham.org. PMID:16845030

  12. Functional annotation of a full-length mouse cDNA collection

    Microsoft Academic Search

    J. Kawai; A. Shinagawa; K. Shibata; M. Yoshino; M. Itoh; Y. Ishii; T. Arakawa; A. Hara; Y. Fukunishi; H. Konno; J. Adachi; S. Fukuda; K. Aizawa; M. Izawa; K. Nishi; H. Kiyosawa; S. Kondo; I. Yamanaka; T. Saito; Y. Okazaki; T. Gojobori; H. Bono; T. Kasukawa; R. Saito; K. Kadota; H. Matsuda; M. Ashburner; S. Batalov; T. Casavant; W. Fleischmann; T. Gaasterland; C. Gissi; B. King; H. Kochiwa; P. Kuehl; S. Lewis; Y. Matsuo; I. Nikaido; G. Pesole; J. Quackenbush; L. M. Schriml; F. Staubli; R. Suzuki; M. Tomita; L. Wagner; T. Washio; K. Sakai; T. Okido; M. Furuno; H. Aono; R. Baldarelli; G. Barsh; J. Blake; D. Boffelli; N. Bojunga; P. Carninci; M. F. de Bonaldo; M. J. Brownstein; C. Bult; C. Fletcher; M. Fujita; M. Gariboldi; S. Gustincich; D. Hill; M. Hofmann; D. A. Hume; M. Kamiya; N. H. Lee; P. Lyons; L. Marchionni; J. Mashima; J. Mazzarelli; P. Mombaerts; P. Nordone; B. Ring; M. Ringwald; I. Rodriguez; N. Sakamoto; H. Sasaki; K. Sato; C. Schönbach; T. Seya; Y. Shibata; K.-F. Storch; H. Suzuki; K. Toyo-oka; K. H. Wang; C. Weitz; C. Whittaker; L. Wilming; A. Wynshaw-Boris; K. Yoshida; Y. Hasegawa; H. Kawaji; S. Kohtsuki; Y. Hayashizaki

    2001-01-01

    The RIKEN Mouse Gene Encyclopaedia Project, a systematic approach to determining the full coding potential of the mouse genome, involves collection and sequencing of full-length complementary DNAs and physical mapping of the corresponding genes to the mouse genome. We organized an international functional annotation meeting (FANTOM) to annotate the first 21,076 cDNAs to be analysed in this project. Here we

  13. A robust data-driven approach for gene ontology annotation

    PubMed Central

    Li, Yanpeng; Yu, Hong

    2014-01-01

    Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks. PMID:25425037

  14. Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes

    Microsoft Academic Search

    Stéphanie Bocs; Antoine Danchin; Claudine Médigue

    2002-01-01

    Background: Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent

  15. Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks

    PubMed Central

    Daraselia, Nikolai; Yuryev, Anton; Egorov, Sergei; Mazo, Ilya; Ispolatov, Iaroslav

    2007-01-01

    Background Uncovering cellular roles of a protein is a task of tremendous importance and complexity that requires dedicated experimental work as well as often sophisticated data mining and processing tools. Protein functions, often referred to as its annotations, are believed to manifest themselves through topology of the networks of inter-proteins interactions. In particular, there is a growing body of evidence that proteins performing the same function are more likely to interact with each other than with proteins with other functions. However, since functional annotation and protein network topology are often studied separately, the direct relationship between them has not been comprehensively demonstrated. In addition to having the general biological significance, such demonstration would further validate the data extraction and processing methods used to compose protein annotation and protein-protein interactions datasets. Results We developed a method for automatic extraction of protein functional annotation from scientific text based on the Natural Language Processing (NLP) technology. For the protein annotation extracted from the entire PubMed, we evaluated the precision and recall rates, and compared the performance of the automatic extraction technology to that of manual curation used in public Gene Ontology (GO) annotation. In the second part of our presentation, we reported a large-scale investigation into the correspondence between communities in the literature-based protein networks and GO annotation groups of functionally related proteins. We found a comprehensive two-way match: proteins within biological annotation groups form significantly denser linked network clusters than expected by chance and, conversely, densely linked network communities exhibit a pronounced non-random overlap with GO groups. We also expanded the publicly available GO biological process annotation using the relations extracted by our NLP technology. An increase in the number and size of GO groups without any noticeable decrease of the link density within the groups indicated that this expansion significantly broadens the public GO annotation without diluting its quality. We revealed that functional GO annotation correlates mostly with clustering in a physical interaction protein network, while its overlap with indirect regulatory network communities is two to three times smaller. Conclusion Protein functional annotations extracted by the NLP technology expand and enrich the existing GO annotation system. The GO functional modularity correlates mostly with the clustering in the physical interaction network, suggesting that the essential role of structural organization maintained by these interactions. Reciprocally, clustering of proteins in physical interaction networks can serve as an evidence for their functional similarity. PMID:17620146

  16. Gene Ontology annotations at SGD: new data sources and annotation methods

    Microsoft Academic Search

    Eurie L. Hong; Rama Balakrishnan; Qing Dong; Karen R. Christie; Julie Park; Gail Binkley; Maria C. Costanzo; Selina S. Dwight; Stacia R. Engel; Dianna G. Fisk; Jodi E. Hirschman; Benjamin C. Hitz; Cynthia J. Krieger; Michael S. Livstone; Stuart R. Miyasato; Robert S. Nash; Rose Oughtred; Marek S. Skrzypek; Shuai Weng; Edith D. Wong; Kathy K. Zhu; Kara Dolinski; David Botstein; J. Michael Cherry

    2008-01-01

    The Saccharomyces Genome Database (SGD; http:\\/\\/ www.yeastgenome.org\\/) collects and organizes biological information about the chromosomal features and gene products of the budding yeast Saccharomyces cerevisiae. Although published data from traditional experimental methods are the primary sources of evidence supporting Gene Ontology (GO) annotations for a gene product, high-throughput experiments and computational predictions can also provide valuable insights in the absence

  17. High-throughput functional annotation and data mining with the Blast2GO suite

    PubMed Central

    Götz, Stefan; García-Gómez, Juan Miguel; Terol, Javier; Williams, Tim D.; Nagaraj, Shivashankar H.; Nueda, María José; Robles, Montserrat; Talón, Manuel; Dopazo, Joaquín; Conesa, Ana

    2008-01-01

    Functional genomics technologies have been widely adopted in the biological research of both model and non-model species. An efficient functional annotation of DNA or protein sequences is a major requirement for the successful application of these approaches as functional information on gene products is often the key to the interpretation of experimental results. Therefore, there is an increasing need for bioinformatics resources which are able to cope with large amount of sequence data, produce valuable annotation results and are easily accessible to laboratories where functional genomics projects are being undertaken. We present the Blast2GO suite as an integrated and biologist-oriented solution for the high-throughput and automatic functional annotation of DNA or protein sequences based on the Gene Ontology vocabulary. The most outstanding Blast2GO features are: (i) the combination of various annotation strategies and tools controlling type and intensity of annotation, (ii) the numerous graphical features such as the interactive GO-graph visualization for gene-set function profiling or descriptive charts, (iii) the general sequence management features and (iv) high-throughput capabilities. We used the Blast2GO framework to carry out a detailed analysis of annotation behaviour through homology transfer and its impact in functional genomics research. Our aim is to offer biologists useful information to take into account when addressing the task of functionally characterizing their sequence data. PMID:18445632

  18. GeneDB—an annotation database for pathogens

    PubMed Central

    Logan-Klumpler, Flora J.; De Silva, Nishadi; Boehme, Ulrike; Rogers, Matthew B.; Velarde, Giles; McQuillan, Jacqueline A.; Carver, Tim; Aslett, Martin; Olsen, Christian; Subramanian, Sandhya; Phan, Isabelle; Farris, Carol; Mitra, Siddhartha; Ramasamy, Gowthaman; Wang, Haiming; Tivey, Adrian; Jackson, Andrew; Houston, Robin; Parkhill, Julian; Holden, Matthew; Harb, Omar S.; Brunk, Brian P.; Myler, Peter J.; Roos, David; Carrington, Mark; Smith, Deborah F.; Hertz-Fowler, Christiane; Berriman, Matthew

    2012-01-01

    GeneDB (http://www.genedb.org) is a genome database for prokaryotic and eukaryotic pathogens and closely related organisms. The resource provides a portal to genome sequence and annotation data, which is primarily generated by the Pathogen Genomics group at the Wellcome Trust Sanger Institute. It combines data from completed and ongoing genome projects with curated annotation, which is readily accessible from a web based resource. The development of the database in recent years has focused on providing database-driven annotation tools and pipelines, as well as catering for increasingly frequent assembly updates. The website has been significantly redesigned to take advantage of current web technologies, and improve usability. The current release stores 41 data sets, of which 17 are manually curated and maintained by biologists, who review and incorporate data from the scientific literature, as well as other sources. GeneDB is primarily a production and annotation database for the genomes of predominantly pathogenic organisms. PMID:22116062

  19. GLAD: an Online Database of Gene List Annotation for Drosophila

    PubMed Central

    Hu, Yanhui; Comjean, Aram; Perkins, Lizabeth A.; Perrimon, Norbert; Mohr, Stephanie E.

    2015-01-01

    We present a resource of high quality lists of functionally related Drosophila genes, e.g. based on protein domains (kinases, transcription factors, etc.) or cellular function (e.g. autophagy, signal transduction). To establish these lists, we relied on different inputs, including curation from databases or the literature and mapping from other species. Moreover, as an added curation and quality control step, we asked experts in relevant fields to review many of the lists. The resource is available online for scientists to search and view, and is editable based on community input. Annotation of gene groups is an ongoing effort and scientific need will typically drive decisions regarding which gene lists to pursue. We anticipate that the number of lists will increase over time; that the composition of some lists will grow and/or change over time as new information becomes available; and that the lists will benefit the scientific community, e.g. at experimental design and data analysis stages. Based on this, we present an easily updatable online database, available at www.flyrnai.org/glad, at which gene group lists can be viewed, searched and downloaded.

  20. CATH: comprehensive structural and functional annotations for genome sequences

    PubMed Central

    Sillitoe, Ian; Lewis, Tony E.; Cuff, Alison; Das, Sayoni; Ashford, Paul; Dawson, Natalie L.; Furnham, Nicholas; Laskowski, Roman A.; Lee, David; Lees, Jonathan G.; Lehtinen, Sonja; Studer, Romain A.; Thornton, Janet; Orengo, Christine A.

    2015-01-01

    The latest version of the CATH-Gene3D protein structure classification database (4.0, http://www.cathdb.info) provides annotations for over 235 000 protein domain structures and includes 25 million domain predictions. This article provides an update on the major developments in the 2 years since the last publication in this journal including: significant improvements to the predictive power of our functional families (FunFams); the release of our ‘current’ putative domain assignments (CATH-B); a new, strictly non-redundant data set of CATH domains suitable for homology benchmarking experiments (CATH-40) and a number of improvements to the web pages. PMID:25348408

  1. GOToolBox: functional analysis of gene datasets based on Gene Ontology

    Microsoft Academic Search

    David Martin; Christine Brun; Elisabeth Remy; Pierre Mouren; Denis Thieffry; Bernard Jacq

    2004-01-01

    We have developed methods and tools based on the Gene Ontology (GO) resource allowing the identification of statistically over- or under-represented terms in a gene dataset; the clustering of functionally related genes within a set; and the retrieval of genes sharing annotations with a query gene. GO annotations can also be constrained to a slim hierarchy or a given level

  2. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology

    Microsoft Academic Search

    Evelyn Camon; Michele Magrane; Daniel Barrell; Vivian Lee; Emily Dimmer; John Maslen; David Binns; Nicola Harte; Rodrigo Lopez; Rolf Apweiler

    2004-01-01

    The Gene Ontology Annotation (GOA) database (http:\\/\\/www.ebi.ac.uk\\/GOA) aims to provide high- quality electronic and manual annotations to the UniProt Knowledgebase (Swiss-Prot, TrEMBL and PIR-PSD) using the standardized vocabulary of the Gene Ontology (GO). As a supplementary archive of GO annotation, GOA promotes a high level of integra- tion of the knowledge represented in UniProt with other databases. This is achieved

  3. Drosophila gene expression pattern annotation through multi-instance multi-label learning.

    PubMed

    Li, Ying-Xin; Ji, Shuiwang; Kumar, Sudhir; Ye, Jieping; Zhou, Zhi-Hua

    2012-01-01

    In the studies of Drosophila embryogenesis, a large number of two-dimensional digital images of gene expression patterns have been produced to build an atlas of spatio-temporal gene expression dynamics across developmental time. Gene expressions captured in these images have been manually annotated with anatomical and developmental ontology terms using a controlled vocabulary (CV), which are useful in research aimed at understanding gene functions, interactions, and networks. With the rapid accumulation of images, the process of manual annotation has become increasingly cumbersome, and computational methods to automate this task are urgently needed. However, the automated annotation of embryo images is challenging. This is because the annotation terms spatially correspond to local expression patterns of images, yet they are assigned collectively to groups of images and it is unknown which term corresponds to which region of which image in the group. In this paper, we address this problem using a new machine learning framework, Multi-Instance Multi-Label (MIML) learning. We first show that the underlying nature of the annotation task is a typical MIML learning problem. Then, we propose two support vector machine algorithms under the MIML framework for the task. Experimental results on the FlyExpress database (a digital library of standardized Drosophila gene expression pattern images) reveal that the exploitation of MIML framework leads to significant performance improvement over state-of-the-art approaches. PMID:21519115

  4. De Novo Assembly, Functional Annotation and Comparative Analysis of Withania somnifera Leaf and Root Transcriptomes to Identify Putative Genes Involved in the Withanolides Biosynthesis

    PubMed Central

    Gupta, Parul; Goel, Ridhi; Pathak, Sumya; Srivastava, Apeksha; Singh, Surya Pratap; Sangwan, Rajender Singh; Asif, Mehar Hasan; Trivedi, Prabodh Kumar

    2013-01-01

    Withania somnifera is one of the most valuable medicinal plants used in Ayurvedic and other indigenous medicine systems due to bioactive molecules known as withanolides. As genomic information regarding this plant is very limited, little information is available about biosynthesis of withanolides. To facilitate the basic understanding about the withanolide biosynthesis pathways, we performed transcriptome sequencing for Withania leaf (101L) and root (101R) which specifically synthesize withaferin A and withanolide A, respectively. Pyrosequencing yielded 8,34,068 and 7,21,755 reads which got assembled into 89,548 and 1,14,814 unique sequences from 101L and 101R, respectively. A total of 47,885 (101L) and 54,123 (101R) could be annotated using TAIR10, NR, tomato and potato databases. Gene Ontology and KEGG analyses provided a detailed view of all the enzymes involved in withanolide backbone synthesis. Our analysis identified members of cytochrome P450, glycosyltransferase and methyltransferase gene families with unique presence or differential expression in leaf and root and might be involved in synthesis of tissue-specific withanolides. We also detected simple sequence repeats (SSRs) in transcriptome data for use in future genetic studies. Comprehensive sequence resource developed for Withania, in this study, will help to elucidate biosynthetic pathway for tissue-specific synthesis of secondary plant products in non-model plant organisms as well as will be helpful in developing strategies for enhanced biosynthesis of withanolides through biotechnological approaches. PMID:23667511

  5. CATH FunFHMMer web server: protein functional annotations using functional family assignments.

    PubMed

    Das, Sayoni; Sillitoe, Ian; Lee, David; Lees, Jonathan G; Dawson, Natalie L; Ward, John; Orengo, Christine A

    2015-07-01

    The widening function annotation gap in protein databases and the increasing number and diversity of the proteins being sequenced presents new challenges to protein function prediction methods. Multidomain proteins complicate the protein sequence-structure-function relationship further as new combinations of domains can expand the functional repertoire, creating new proteins and functions. Here, we present the FunFHMMer web server, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classification of the domain-based CATH-Gene3D resource. Our server also provides valuable information for the prediction of functional sites. The predictive power of FunFHMMer has been validated on a set of 95 proteins where FunFHMMer performs better than BLAST, Pfam and CDD. Recent validation by an independent international competition ranks FunFHMMer as one of the top function prediction methods in predicting GO annotations for both the Biological Process and Molecular Function Ontology. The FunFHMMer web server is available at http://www.cathdb.info/search/by_funfhmmer. PMID:25964299

  6. CATH FunFHMMer web server: protein functional annotations using functional family assignments

    PubMed Central

    Das, Sayoni; Sillitoe, Ian; Lee, David; Lees, Jonathan G.; Dawson, Natalie L.; Ward, John; Orengo, Christine A.

    2015-01-01

    The widening function annotation gap in protein databases and the increasing number and diversity of the proteins being sequenced presents new challenges to protein function prediction methods. Multidomain proteins complicate the protein sequence–structure–function relationship further as new combinations of domains can expand the functional repertoire, creating new proteins and functions. Here, we present the FunFHMMer web server, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classification of the domain-based CATH-Gene3D resource. Our server also provides valuable information for the prediction of functional sites. The predictive power of FunFHMMer has been validated on a set of 95 proteins where FunFHMMer performs better than BLAST, Pfam and CDD. Recent validation by an independent international competition ranks FunFHMMer as one of the top function prediction methods in predicting GO annotations for both the Biological Process and Molecular Function Ontology. The FunFHMMer web server is available at http://www.cathdb.info/search/by_funfhmmer. PMID:25964299

  7. Lynx web services for annotations and systems analysis of multi-gene disorders

    PubMed Central

    Sulakhe, Dinanath; Taylor, Andrew; Balasubramanian, Sandhya; Feng, Bo; Xie, Bingqing; Börnigen, Daniela; Dave, Utpal J.; Foster, Ian T.; Gilliam, T. Conrad; Maltsev, Natalia

    2014-01-01

    Lynx is a web-based integrated systems biology platform that supports annotation and analysis of experimental data and generation of weighted hypotheses on molecular mechanisms contributing to human phenotypes and disorders of interest. Lynx has integrated multiple classes of biomedical data (genomic, proteomic, pathways, phenotypic, toxicogenomic, contextual and others) from various public databases as well as manually curated data from our group and collaborators (LynxKB). Lynx provides tools for gene list enrichment analysis using multiple functional annotations and network-based gene prioritization. Lynx provides access to the integrated database and the analytical tools via REST based Web Services (http://lynx.ci.uchicago.edu/webservices.html). This comprises data retrieval services for specific functional annotations, services to search across the complete LynxKB (powered by Lucene), and services to access the analytical tools built within the Lynx platform. PMID:24948611

  8. Insect Innate Immunity Database (IIID): An Annotation Tool for Identifying Immune Genes in Insect Genomes

    PubMed Central

    Brucker, Robert M.; Funkhouser, Lisa J.; Setia, Shefali; Pauly, Rini; Bordenstein, Seth R.

    2012-01-01

    The innate immune system is an ancient component of host defense. Since innate immunity pathways are well conserved throughout many eukaryotes, immune genes in model animals can be used to putatively identify homologous genes in newly sequenced genomes of non-model organisms. With the initiation of the “i5k” project, which aims to sequence 5,000 insect genomes by 2016, many novel insect genomes will soon become publicly available, yet few annotation resources are currently available for insects. Thus, we developed an online tool called the Insect Innate Immunity Database (IIID) to provide an open access resource for insect immunity and comparative biology research (http://www.vanderbilt.edu/IIID). The database provides users with simple exploratory tools to search the immune repertoires of five insect models (including Nasonia), spanning three orders, for specific immunity genes or genes within a particular immunity pathway. As a proof of principle, we used an initial database with only four insect models to annotate potential immune genes in the parasitoid wasp genus Nasonia. Results specify 306 putative immune genes in the genomes of N. vitripennis and its two sister species N. giraulti and N. longicornis. Of these genes, 146 were not found in previous annotations of Nasonia immunity genes. Combining these newly identified immune genes with those in previous annotations, Nasonia possess 489 putative immunity genes, the largest immune repertoire found in insects to date. While these computational predictions need to be complemented with functional studies, the IIID database can help initiate and augment annotations of the immune system in the plethora of insect genomes that will soon become available. PMID:22984621

  9. MeSH Key Terms for Validation and Annotation of Gene Expression Clusters

    E-print Network

    Rocha, Luis

    and Luis M Rocha 1 Keywords: gene expression analysis, validation, information retrieval, automated functional annotation Integration of different sources of information is a great challenge for the analysisRNA assaying with microarrays, for example, numerical analysis often attempts to identify clusters of co

  10. SUS-BAR: a database of pig proteins with statistically validated structural and functional annotation.

    PubMed

    Piovesan, Damiano; Profiti, Giuseppe; Martelli, Pier Luigi; Fariselli, Piero; Fontanesi, Luca; Casadio, Rita

    2013-01-01

    Given the relevance of the pig proteome in different studies, including human complex maladies, a statistical validation of the annotation is required for a better understanding of the role of specific genes and proteins in the complex networks underlying biological processes in the animal. Presently, approximately 80% of the pig proteome is still poorly annotated, and the existence of protein sequences is routinely inferred automatically by sequence alignment towards preexisting sequences. In this article, we introduce SUS-BAR, a database that derives information mainly from UniProt Knowledgebase and that includes 26 206 pig protein sequences. In SUS-BAR, 16 675 of the pig protein sequences are endowed with statistically validated functional and structural annotation. Our statistical validation is determined by adopting a cluster-centric annotation procedure that allows transfer of different types of annotation, including structure and function. Each sequence in the database can be associated with a set of statistically validated Gene Ontologies (GOs) of the three main sub-ontologies (Molecular Function, Biological Process and Cellular Component), with Pfam functional domains, and when possible, with a cluster Hidden Markov Model that allows modelling the 3D structure of the protein. A database search allows some statistics demonstrating the enrichment in both GO and Pfam annotations of the pig proteins as compared with UniProt Knowledgebase annotation. Searching in SUS-BAR allows retrieval of the pig protein annotation for further analysis. The search is also possible on the basis of specific GO terms and this allows retrieval of all the pig sequences participating into a given biological process, after annotation with our system. Alternatively, the search is possible on the basis of structural information, allowing retrieval of all the pig sequences with the same structural characteristics. PMID:24065691

  11. Dense Subgraphs with Restrictions and Applications to Gene Annotation Graphs

    NASA Astrophysics Data System (ADS)

    Saha, Barna; Hoch, Allison; Khuller, Samir; Raschid, Louiqa; Zhang, Xiao-Ning

    In this paper, we focus on finding complex annotation patterns representing novel and interesting hypotheses from gene annotation data. We define a generalization of the densest subgraph problem by adding an additional distance restriction (defined by a separate metric) to the nodes of the subgraph. We show that while this generalization makes the problem NP-hard for arbitrary metrics, when the metric comes from the distance metric of a tree, or an interval graph, the problem can be solved optimally in polynomial time. We also show that the densest subgraph problem with a specified subset of vertices that have to be included in the solution can be solved optimally in polynomial time. In addition, we consider other extensions when not just one solution needs to be found, but we wish to list all subgraphs of almost maximum density as well. We apply this method to a dataset of genes and their annotations obtained from The Arabidopsis Information Resource (TAIR). A user evaluation confirms that the patterns found in the distance restricted densest subgraph for a dataset of photomorphogenesis genes are indeed validated in the literature; a control dataset validates that these are not random patterns. Interestingly, the complex annotation patterns potentially lead to new and as yet unknown hypotheses. We perform experiments to determine the properties of the dense subgraphs, as we vary parameters, including the number of genes and the distance.

  12. Algal Functional Annotation Tool from the DOE-UCLA Institute for Genomics and Proteomics

    DOE Data Explorer

    Lopez, David

    The Algal Functional Annotation Tool is a bioinformatics resource to visualize pathway maps, identify enriched biological terms, or convert gene identifiers to elucidate biological function in silico. These types of analysis have been catered to support lists of gene identifiers, such as those coming from transcriptome gene expression analysis. By analyzing the functional annotation of an interesting set of genes, common biological motifs may be elucidated and a first-pass analysis can point further research in the right direction. Currently, the following databases have been parsed, processed, and added to the tool: 1( Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathways Database, 2) MetaCyc Encyclopedia of Metabolic Pathways, 3) Panther Pathways Database, 4) Reactome Pathways Database, 5) Gene Ontology, 6) MapMan Ontology, 7) KOG (Eukaryotic Clusters of Orthologous Groups), 5)Pfam, 6) InterPro.

  13. Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads

    PubMed Central

    Carr, Rogan; Borenstein, Elhanan

    2014-01-01

    To assess the functional capacities of microbial communities, including those inhabiting the human body, shotgun metagenomic reads are often aligned to a database of known genes. Such homology-based annotation practices critically rely on the assumption that short reads can map to orthologous genes of similar function. This assumption, however, and the various factors that impact short read annotation, have not been systematically evaluated. To address this challenge, we generated an extremely large database of simulated reads (totaling 15.9 Gb), spanning over 500,000 microbial genes and 170 curated genomes and including, for many genomes, every possible read of a given length. We annotated each read using common metagenomic protocols, fully characterizing the effect of read length, sequencing error, phylogeny, database coverage, and mapping parameters. We additionally rigorously quantified gene-, genome-, and protocol-specific annotation biases. Overall, our findings provide a first comprehensive evaluation of the capabilities and limitations of functional metagenomic annotation, providing crucial goal-specific best-practice guidelines to inform future metagenomic research. PMID:25148512

  14. GoMapMan: integration, consolidation and visualization of plant gene annotations within the MapMan ontology.

    PubMed

    Ramsak, Živa; Baebler, Špela; Rotter, Ana; Korbar, Matej; Mozetic, Igor; Usadel, Björn; Gruden, Kristina

    2014-01-01

    GoMapMan (http://www.gomapman.org) is an open web-accessible resource for gene functional annotations in the plant sciences. It was developed to facilitate improvement, consolidation and visualization of gene annotations across several plant species. GoMapMan is based on the MapMan ontology, organized in the form of a hierarchical tree of biological concepts, which describe gene functions. Currently, genes of the model species Arabidopsis and three crop species (potato, tomato and rice) are included. The main features of GoMapMan are (i) dynamic and interactive gene product annotation through various curation options; (ii) consolidation of gene annotations for different plant species through the integration of orthologue group information; (iii) traceability of gene ontology changes and annotations; (iv) integration of external knowledge about genes from different public resources; and (v) providing gathered information to high-throughput analysis tools via dynamically generated export files. All of the GoMapMan functionalities are openly available, with the restriction on the curation functions, which require prior registration to ensure traceability of the implemented changes. PMID:24194592

  15. Maize microarray annotation database

    PubMed Central

    2011-01-01

    Background Microarray technology has matured over the past fifteen years into a cost-effective solution with established data analysis protocols for global gene expression profiling. The Agilent-016047 maize 44 K microarray was custom-designed from EST sequences, but only reporter sequences with EST accession numbers are publicly available. The following information is lacking: (a) reporter - gene model match, (b) number of reporters per gene model, (c) potential for cross hybridization, (d) sense/antisense orientation of reporters, (e) position of reporter on B73 genome sequence (for eQTL studies), and (f) functional annotations of genes represented by reporters. To address this, we developed a strategy to annotate the Agilent-016047 maize microarray, and built a publicly accessible annotation database. Description Genomic annotation of the 42,034 reporters on the Agilent-016047 maize microarray was based on BLASTN results of the 60-mer reporter sequences and their corresponding ESTs against the maize B73 RefGen v2 "Working Gene Set" (WGS) predicted transcripts and the genome sequence. The agreement between the EST, WGS transcript and gDNA BLASTN results were used to assign the reporters into six genomic annotation groups. These annotation groups were: (i) "annotation by sense gene model" (23,668 reporters), (ii) "annotation by antisense gene model" (4,330); (iii) "annotation by gDNA" without a WGS transcript hit (1,549); (iv) "annotation by EST", in which case the EST from which the reporter was designed, but not the reporter itself, has a WGS transcript hit (3,390); (v) "ambiguous annotation" (2,608); and (vi) "inconclusive annotation" (6,489). Functional annotations of reporters were obtained by BLASTX and Blast2GO analysis of corresponding WGS transcripts against GenBank. The annotations are available in the Maize Microarray Annotation Database http://MaizeArrayAnnot.bi.up.ac.za/, as well as through a GBrowse annotation file that can be uploaded to the MaizeGDB genome browser as a custom track. The database was used to re-annotate lists of differentially expressed genes reported in case studies of published work using the Agilent-016047 maize microarray. Up to 85% of reporters in each list could be annotated with confidence by a single gene model, however up to 10% of reporters had ambiguous annotations. Overall, more than 57% of reporters gave a measurable signal in tissues as diverse as anthers and leaves. Conclusions The Maize Microarray Annotation Database will assist users of the Agilent-016047 maize microarray in (i) refining gene lists for global expression analysis, and (ii) confirming the annotation of candidate genes before functional studies. PMID:21961731

  16. A Semi-Quantitative, Synteny-Based Method to Improve Functional Predictions for Hypothetical and Poorly Annotated Bacterial and Archaeal Genes

    Microsoft Academic Search

    Alexis P. Yelton; Brian C. Thomas; Sheri L. Simmons; Paul Wilmes; Adam Zemla; Michael P. Thelen; Nicholas Justice; Jillian F. Banfield

    2011-01-01

    During microbial evolution, genome rearrangement increases with increasing sequence divergence. If the relationship between synteny and sequence divergence can be modeled, gene clusters in genomes of distantly related organisms exhibiting anomalous synteny can be identified and used to infer functional conservation. We applied the phylogenetic pairwise comparison method to establish and model a strong correlation between synteny and sequence divergence

  17. Image-level and group-level models for Drosophila gene expression pattern annotation

    PubMed Central

    2013-01-01

    Background Drosophila melanogaster has been established as a model organism for investigating the developmental gene interactions. The spatio-temporal gene expression patterns of Drosophila melanogaster can be visualized by in situ hybridization and documented as digital images. Automated and efficient tools for analyzing these expression images will provide biological insights into the gene functions, interactions, and networks. To facilitate pattern recognition and comparison, many web-based resources have been created to conduct comparative analysis based on the body part keywords and the associated images. With the fast accumulation of images from high-throughput techniques, manual inspection of images will impose a serious impediment on the pace of biological discovery. It is thus imperative to design an automated system for efficient image annotation and comparison. Results We present a computational framework to perform anatomical keywords annotation for Drosophila gene expression images. The spatial sparse coding approach is used to represent local patches of images in comparison with the well-known bag-of-words (BoW) method. Three pooling functions including max pooling, average pooling and Sqrt (square root of mean squared statistics) pooling are employed to transform the sparse codes to image features. Based on the constructed features, we develop both an image-level scheme and a group-level scheme to tackle the key challenges in annotating Drosophila gene expression pattern images automatically. To deal with the imbalanced data distribution inherent in image annotation tasks, the undersampling method is applied together with majority vote. Results on Drosophila embryonic expression pattern images verify the efficacy of our approach. Conclusion In our experiment, the three pooling functions perform comparably well in feature dimension reduction. The undersampling with majority vote is shown to be effective in tackling the problem of imbalanced data. Moreover, combining sparse coding and image-level scheme leads to consistent performance improvement in keywords annotation. PMID:24299119

  18. Towards integrative gene functional similarity measurement

    PubMed Central

    2014-01-01

    Background In Gene Ontology, the "Molecular Function" (MF) categorization is a widely used knowledge framework for gene function comparison and prediction. Its structure and annotation provide a convenient way to compare gene functional similarities at the molecular level. The existing gene similarity measures, however, solely rely on one or few aspects of MF without utilizing all the rich information available including structure, annotation, common terms, lowest common parents. Results We introduce a rank-based gene semantic similarity measure called InteGO by synergistically integrating the state-of-the-art gene-to-gene similarity measures. By integrating three GO based seed measures, InteGO significantly improves the performance by about two-fold in all the three species studied (yeast, Arabidopsis and human). Conclusions InteGO is a systematic and novel method to study gene functional associations. The software and description are available at http://www.msu.edu/~jinchen/InteGO. PMID:24564710

  19. Functional Annotation and Comparative Analysis of a Zygopteran Transcriptome

    PubMed Central

    Shanku, Alexander G.; McPeek, Mark A.; Kern, Andrew D.

    2013-01-01

    In this paper we present a de novo assembly of the transcriptome of the damselfly (Enallagma hageni) through the use of 454 pyrosequencing. E. hageni is a member of the suborder Zygoptera, in the order Odonata, and Odonata organisms form the basal lineage of the winged insects (Pterygota). To date, sequence data used in phylogenetic analysis of Enallagma species have been derived from either mitochondrial DNA or ribosomal nuclear DNA. This Enallagma transcriptome contained 31,661 contigs that were assembled and translated into 14,813 individual open reading frames. Using these data, we constructed an extensive dataset of 634 orthologous nuclear protein-encoding genes across 11 species of Arthropoda and used Bayesian techniques to elucidate the position of Enallagma in the arthropod phylogenetic tree. Additionally, we demonstrated that the Enallagma transcriptome contains 169 genes that are evolving at rates that differ relative to those of the rest of the transcriptome (29 accelerated and 140 decreased), and, through multiple Gene Ontology searches and clustering methods, we present the first functional annotation of any palaeopteran’s transcriptome in the literature. PMID:23550132

  20. Functional Annotation and Comparative Analysis of a Zygopteran Transcriptome.

    PubMed

    Shanku, Alexander G; McPeek, Mark A; Kern, Andrew D

    2013-03-11

    In this paper we present a de novo assembly of the transcriptome of the damselfly, Enallagma hageni, through the use of 454 pyrosequencing. E. hageni is a member of the suborder Zygoptera within the order Odonata, and the Odonata are the basal lineage of the winged insects (Pterygota). To date, sequence data used in phylogenetic analysis of Enallagma species have been derived from either mtDNA or ribosomal nuclear DNA. This transcriptome contained 31,661 contigs that were assembled and translated into 14,813 individual open reading frames. Using these data, we constructed an extensive dataset of 634 orthologous nuclear protein-coding genes across 11 species of Arthropoda, and used Bayesian techniques to elucidate Enallagma's place in the Arthropod phylogenetic tree. Additionally, we demonstrate that the Enallagma transcriptome contains 169 genes that are evolving at rates that differ relative to the rest of the transcriptome (29 accelerated and 140 decreased), and through multiple Gene Ontology searches and clustering methods, we present the first functional-annotation of any palaeopteran's transcriptome in the literature. PMID:23550132

  1. Analysis of CATMA transcriptome data identifies hundreds of novel functional genes and improves gene models in the Arabidopsis genome

    Microsoft Academic Search

    Sébastien Aubourg; Marie-Laure Martin-Magniette; Véronique Brunaud; Ludivine Taconnat; Frédérique Bitton; Sandrine Balzergue; Pauline E Jullien; Mathieu Ingouff; Vincent Thareau; Thomas Schiex; Alain Lecharny; Jean-Pierre Renou

    2007-01-01

    BACKGROUND: Since the finishing of the sequencing of the Arabidopsis thaliana genome, the Arabidopsis community and the annotator centers have been working on the improvement of gene annotation at the structural and functional levels. In this context, we have used the large CATMA resource on the Arabidopsis transcriptome to search for genes missed by different annotation processes. Probes on the

  2. Automated annotation of gene expression image sequences via non-parametric factor analysis and conditional random fields

    PubMed Central

    Pruteanu-Malinici, Iulian; Majoros, William H.; Ohler, Uwe

    2013-01-01

    Motivation: Computational approaches for the annotation of phenotypes from image data have shown promising results across many applications, and provide rich and valuable information for studying gene function and interactions. While data are often available both at high spatial resolution and across multiple time points, phenotypes are frequently annotated independently, for individual time points only. In particular, for the analysis of developmental gene expression patterns, it is biologically sensible when images across multiple time points are jointly accounted for, such that spatial and temporal dependencies are captured simultaneously. Methods: We describe a discriminative undirected graphical model to label gene-expression time-series image data, with an efficient training and decoding method based on the junction tree algorithm. The approach is based on an effective feature selection technique, consisting of a non-parametric sparse Bayesian factor analysis model. The result is a flexible framework, which can handle large-scale data with noisy incomplete samples, i.e. it can tolerate data missing from individual time points. Results: Using the annotation of gene expression patterns across stages of Drosophila embryonic development as an example, we demonstrate that our method achieves superior accuracy, gained by jointly annotating phenotype sequences, when compared with previous models that annotate each stage in isolation. The experimental results on missing data indicate that our joint learning method successfully annotates genes for which no expression data are available for one or more stages. Contact: uwe.ohler@duke.edu PMID:23812993

  3. Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database

    PubMed Central

    Drabkin, Harold J.; Blake, Judith A.

    2012-01-01

    The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as ‘GO’ or ‘homology’ or ‘phenotype’. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as ‘papers selected for GO that refer to genes with NO GO annotation’. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications. PMID:23110975

  4. The H-Invitational Database (H-InvDB), a comprehensive annotation resource for human genes and transcripts*

    PubMed Central

    2008-01-01

    Here we report the new features and improvements in our latest release of the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/), a comprehensive annotation resource for human genes and transcripts. H-InvDB, originally developed as an integrated database of the human transcriptome based on extensive annotation of large sets of full-length cDNA (FLcDNA) clones, now provides annotation for 120 558 human mRNAs extracted from the International Nucleotide Sequence Databases (INSD), in addition to 54 978 human FLcDNAs, in the latest release H-InvDB_4.6. We mapped those human transcripts onto the human genome sequences (NCBI build 36.1) and determined 34 699 human gene clusters, which could define 34 057 (98.1%) protein-coding and 642 (1.9%) non-protein-coding loci; 858 (2.5%) transcribed loci overlapped with predicted pseudogenes. For all these transcripts and genes, we provide comprehensive annotation including gene structures, gene functions, alternative splicing variants, functional non-protein-coding RNAs, functional domains, predicted sub cellular localizations, metabolic pathways, predictions of protein 3D structure, mapping of SNPs and microsatellite repeat motifs, co-localization with orphan diseases, gene expression profiles, orthologous genes, protein–protein interactions (PPI) and annotation for gene families. The current H-InvDB annotation resources consist of two main views: Transcript view and Locus view and eight sub-databases: the DiseaseInfo Viewer, H-ANGEL, the Clustering Viewer, G-integra, the TOPO Viewer, Evola, the PPI view and the Gene family/group. PMID:18089548

  5. FSim: A Novel Functional Similarity Search Algorithm and Tool for Discovering Functionally Related Gene Products

    PubMed Central

    Hu, Qiang; Wang, ZhiGang; Zhang, ZhengGuo

    2014-01-01

    Background. During the analysis of genomics data, it is often required to quantify the functional similarity of genes and their products based on the annotation information from gene ontology (GO) with hierarchical structure. A flexible and user-friendly way to estimate the functional similarity of genes utilizing GO annotation is therefore highly desired. Results. We proposed a novel algorithm using a level coefficient-weighted model to measure the functional similarity of gene products based on multiple ontologies of hierarchical GO annotations. The performance of our algorithm was evaluated and found to be superior to the other tested methods. We implemented the proposed algorithm in a software package, FSim, based on R statistical and computing environment. It can be used to discover functionally related genes for a given gene, group of genes, or set of function terms. Conclusions. FSim is a flexible tool to analyze functional gene groups based on the GO annotation databases. PMID:25184141

  6. Annotator: postprocessing software for generating function-based signatures from quantitative mass spectrometry.

    PubMed

    Sylvester, Juliesta E; Bray, Tyler S; Kron, Stephen J

    2012-03-01

    Mass spectrometry is used to investigate global changes in protein abundance in cell lysates. Increasingly powerful methods of data collection have emerged over the past decade, but this has left researchers with the task of sifting through mountains of data for biologically significant results. Often, the end result is a list of proteins with no obvious quantitative relationships to define the larger context of changes in cell behavior. Researchers are often forced to perform a manual analysis from this list or to fall back on a range of disparate tools, which can hinder the communication of results and their reproducibility. To address these methodological problems, we developed Annotator, an application that filters validated mass spectrometry data and applies a battery of standardized heuristic and statistical tests to determine significance. To address systems-level interpretations, we incorporated UniProt and Gene Ontology keywords as statistical units of analysis, yielding quantitative information about changes in abundance for an entire functional category. This provides a consistent and quantitative method for formulating conclusions about cellular behavior, independent of network models or standard enrichment analyses. Annotator allows for "bottom-up" annotations that are based on experimental data and not inferred by comparison to external or hypothetical models. Annotator was developed as an independent postprocessing platform that runs on all common operating systems, thereby providing a useful tool for establishing the inherently dynamic nature of functional annotations, which depend on results from ongoing proteomic experiments. Annotator is available for download at http://people.cs.uchicago.edu/?tyler/annotator/annotator_desktop_0.1.tar.gz . PMID:22224429

  7. Canine candidate genes for dilated cardiomyopathy: annotation of and polymorphic markers for 14 genes

    PubMed Central

    Wiersma, Anje C; Leegwater, Peter AJ; van Oost, Bernard A; Ollier, William E; Dukes-McEwan, Joanna

    2007-01-01

    Background Dilated cardiomyopathy is a myocardial disease occurring in humans and domestic animals and is characterized by dilatation of the left ventricle, reduced systolic function and increased sphericity of the left ventricle. Dilated cardiomyopathy has been observed in several, mostly large and giant, dog breeds, such as the Dobermann and the Great Dane. A number of genes have been identified, which are associated with dilated cardiomyopathy in the human, mouse and hamster. These genes mainly encode structural proteins of the cardiac myocyte. Results We present the annotation of, and marker development for, 14 of these genes of the dog genome, i.e. ?-cardiac actin, caveolin 1, cysteine-rich protein 3, desmin, lamin A/C, LIM-domain binding factor 3, myosin heavy polypeptide 7, phospholamban, sarcoglycan ?, titin cap, ?-tropomyosin, troponin I, troponin T and vinculin. A total of 33 Single Nucleotide Polymorphisms were identified for these canine genes and 11 polymorphic microsatellite repeats were developed. Conclusion The presented polymorphisms provide a tool to investigate the role of the corresponding genes in canine Dilated Cardiomyopathy by linkage analysis or association studies. PMID:17949487

  8. CARMO: a comprehensive annotation platform for functional exploration of rice multi-omics data.

    PubMed

    Wang, Jiawei; Qi, Meifang; Liu, Jian; Zhang, Yijing

    2015-07-01

    High-throughput technology is gradually becoming a powerful tool for routine research in rice. Interpretation of biological significance from the huge amount of data is a critical but non-trivial task, especially for rice, for which gene annotations rely heavily on sequence similarity rather than direct experimental evidence. Here we describe the annotation platform for comprehensive annotation of rice multi-omics data (CARMO), which provides multiple web-based analysis tools for in-depth data mining and visualization. The central idea involves systematic integration of 1819 samples from omics studies and diverse sources of functional evidence (15 401 terms), which are further organized into gene sets and higher-level gene modules. In this way, the high-throughput data may easily be compared across studies and platforms, and integration of multiple types of evidence allows biological interpretation from the level of gene functional modules with high confidence. In addition, the functions and pathways for thousands of genes lacking description or validation may be deduced based on concerted expression of genes within the constructed co-expression networks or gene modules. Overall, CARMO provides comprehensive annotations for transcriptomic datasets, epi-genomic modification sites, single nucleotide polymorphisms identified from genome re-sequencing, and the large gene lists derived from these omics studies. Well-organized results, as well as multiple tools for interactive visualization, are available through a user-friendly web interface. Finally, we illustrate how CARMO enables biological insights using four examples, demonstrating that CARMO is a highly useful resource for intensive data mining and hypothesis generation based on rice multi-omics data. CARMO is freely available online (http://bioinfo.sibs.ac.cn/carmo). PMID:26040787

  9. Use of Gene Ontology Annotation to understand the peroxisome proteome in humans

    PubMed Central

    Mutowo-Meullenet, Prudence; Huntley, Rachael P.; Dimmer, Emily C.; Alam-Faruque, Yasmin; Sawford, Tony; Jesus Martin, Maria; O’Donovan, Claire; Apweiler, Rolf

    2013-01-01

    The Gene Ontology (GO) is the de facto standard for the functional description of gene products, providing a consistent, information-rich terminology applicable across species and information repositories. The UniProt Consortium uses both manual and automatic GO annotation approaches to curate UniProt Knowledgebase (UniProtKB) entries. The selection of a protein set prioritized for manual annotation has implications for the characteristics of the information provided to users working in a specific field or interested in particular pathways or processes. In this article, we describe an organelle-focused, manual curation initiative targeting proteins from the human peroxisome. We discuss the steps taken to define the peroxisome proteome and the challenges encountered in defining the boundaries of this protein set. We illustrate with the use of examples how GO annotations now capture cell and tissue type information and the advantages that such an annotation approach provides to users. Database URL: http://www.ebi.ac.uk/GOA/ and http://www.uniprot.org PMID:23327938

  10. GOToolBox: functional analysis of gene datasets based on Gene Ontology

    PubMed Central

    Martin, David; Brun, Christine; Remy, Elisabeth; Mouren, Pierre; Thieffry, Denis; Jacq, Bernard

    2004-01-01

    We have developed methods and tools based on the Gene Ontology (GO) resource allowing the identification of statistically over- or under-represented terms in a gene dataset; the clustering of functionally related genes within a set; and the retrieval of genes sharing annotations with a query gene. GO annotations can also be constrained to a slim hierarchy or a given level of the ontology. The source codes are available upon request, and distributed under the GPL license. PMID:15575967

  11. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource.

    PubMed

    Seaver, Samuel M D; Gerdes, Svetlana; Frelin, Océane; Lerma-Ortiz, Claudia; Bradbury, Louis M T; Zallot, Rémi; Hasnain, Ghulam; Niehaus, Thomas D; El Yacoubi, Basma; Pasternak, Shiran; Olson, Robert; Pusch, Gordon; Overbeek, Ross; Stevens, Rick; de Crécy-Lagard, Valérie; Ware, Doreen; Hanson, Andrew D; Henry, Christopher S

    2014-07-01

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic models. To overcome these problems, we have developed the PlantSEED, an integrated, metabolism-centric database to support subsystems-based annotation and metabolic model reconstruction for plant genomes. PlantSEED combines SEED subsystems technology, first developed for microbial genomes, with refined protein families and biochemical data to assign fully consistent functional annotations to orthologous genes, particularly those encoding primary metabolic pathways. Seamless integration with its parent, the prokaryotic SEED database, makes PlantSEED a unique environment for cross-kingdom comparative analysis of plant and bacterial genomes. The consistent annotations imposed by PlantSEED permit rapid reconstruction and modeling of primary metabolism for all plant genomes in the database. This feature opens the unique possibility of model-based assessment of the completeness and accuracy of gene annotation and thus allows computational identification of genes and pathways that are restricted to certain genomes or need better curation. We demonstrate the PlantSEED system by producing consistent annotations for 10 reference genomes. We also produce a functioning metabolic model for each genome, gapfilling to identify missing annotations and proposing gene candidates for missing annotations. Models are built around an extended biomass composition representing the most comprehensive published to date. To our knowledge, our models are the first to be published for seven of the genomes analyzed. PMID:24927599

  12. High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource

    PubMed Central

    Seaver, Samuel M. D.; Gerdes, Svetlana; Frelin, Océane; Lerma-Ortiz, Claudia; Bradbury, Louis M. T.; Zallot, Rémi; Hasnain, Ghulam; Niehaus, Thomas D.; El Yacoubi, Basma; Pasternak, Shiran; Olson, Robert; Pusch, Gordon; Overbeek, Ross; Stevens, Rick; de Crécy-Lagard, Valérie; Ware, Doreen; Hanson, Andrew D.; Henry, Christopher S.

    2014-01-01

    The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today’s annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic models. To overcome these problems, we have developed the PlantSEED, an integrated, metabolism-centric database to support subsystems-based annotation and metabolic model reconstruction for plant genomes. PlantSEED combines SEED subsystems technology, first developed for microbial genomes, with refined protein families and biochemical data to assign fully consistent functional annotations to orthologous genes, particularly those encoding primary metabolic pathways. Seamless integration with its parent, the prokaryotic SEED database, makes PlantSEED a unique environment for cross-kingdom comparative analysis of plant and bacterial genomes. The consistent annotations imposed by PlantSEED permit rapid reconstruction and modeling of primary metabolism for all plant genomes in the database. This feature opens the unique possibility of model-based assessment of the completeness and accuracy of gene annotation and thus allows computational identification of genes and pathways that are restricted to certain genomes or need better curation. We demonstrate the PlantSEED system by producing consistent annotations for 10 reference genomes. We also produce a functioning metabolic model for each genome, gapfilling to identify missing annotations and proposing gene candidates for missing annotations. Models are built around an extended biomass composition representing the most comprehensive published to date. To our knowledge, our models are the first to be published for seven of the genomes analyzed. PMID:24927599

  13. Functional annotations of diabetes nephropathy susceptibility loci through analysis of genome-wide renal gene expression in rat models of diabetes mellitus

    Microsoft Academic Search

    Yaomin Hu; Pamela J Kaisaki; Karène Argoud; Steven P Wilder; Karin J Wallace; Peng Y Woon; Christine Blancher; Lise Tarnow; Per-Henrik Groop; Samy Hadjadj; Michel Marre; Hans-Henrik Parving; Martin Farrall; Roger D Cox; Mark Lathrop; Nathalie Vionnet; Marie-Thérèse Bihoreau; Dominique Gauguier

    2009-01-01

    BACKGROUND: Hyperglycaemia in diabetes mellitus (DM) alters gene expression regulation in various organs and contributes to long term vascular and renal complications. We aimed to generate novel renal genome-wide gene transcription data in rat models of diabetes in order to test the responsiveness to hyperglycaemia and renal structural changes of positional candidate genes at selected diabetic nephropathy (DN) susceptibility loci.

  14. Annotation technology

    Microsoft Academic Search

    Ilia A. Ovsiannikov; Michael A. Arbib; Thomas H. Mcneill

    1999-01-01

    Annotation Technology is a systematized set of recommendations for design of successful advanced annotation software covering the architectural, functional and user-interface aspects. It is grounded in a careful examination of 17 existing systems accompanied by our own empirical study of annotation types, applications and desired functionality. To validate the recommendations of Annotation Technology, we have also developed Annotator, a system

  15. Comprehensive Functional Annotation of 77 Prostate Cancer Risk Loci

    PubMed Central

    Hazelett, Dennis J.; Rhie, Suhn Kyong; Gaddis, Malaina; Yan, Chunli; Lakeland, Daniel L.; Coetzee, Simon G.; Henderson, Brian E.; Noushmehr, Houtan; Cozen, Wendy; Kote-Jarai, Zsofia; Eeles, Rosalind A.; Easton, Douglas F.; Haiman, Christopher A.; Lu, Wange; Farnham, Peggy J.; Coetzee, Gerhard A.

    2014-01-01

    Genome-wide association studies (GWAS) have revolutionized the field of cancer genetics, but the causal links between increased genetic risk and onset/progression of disease processes remain to be identified. Here we report the first step in such an endeavor for prostate cancer. We provide a comprehensive annotation of the 77 known risk loci, based upon highly correlated variants in biologically relevant chromatin annotations— we identified 727 such potentially functional SNPs. We also provide a detailed account of possible protein disruption, microRNA target sequence disruption and regulatory response element disruption of all correlated SNPs at . 88% of the 727 SNPs fall within putative enhancers, and many alter critical residues in the response elements of transcription factors known to be involved in prostate biology. We define as risk enhancers those regions with enhancer chromatin biofeatures in prostate-derived cell lines with prostate-cancer correlated SNPs. To aid the identification of these enhancers, we performed genomewide ChIP-seq for H3K27-acetylation, a mark of actively engaged enhancers, as well as the transcription factor TCF7L2. We analyzed in depth three variants in risk enhancers, two of which show significantly altered androgen sensitivity in LNCaP cells. This includes rs4907792, that is in linkage disequilibrium () with an eQTL for NUDT11 (on the X chromosome) in prostate tissue, and rs10486567, the index SNP in intron 3 of the JAZF1 gene on chromosome 7. Rs4907792 is within a critical residue of a strong consensus androgen response element that is interrupted in the protective allele, resulting in a 56% decrease in its androgen sensitivity, whereas rs10486567 affects both NKX3-1 and FOXA-AR motifs where the risk allele results in a 39% increase in basal activity and a 28% fold-increase in androgen stimulated enhancer activity. Identification of such enhancer variants and their potential target genes represents a preliminary step in connecting risk to disease process. PMID:24497837

  16. Comprehensive functional annotation of 77 prostate cancer risk loci.

    PubMed

    Hazelett, Dennis J; Rhie, Suhn Kyong; Gaddis, Malaina; Yan, Chunli; Lakeland, Daniel L; Coetzee, Simon G; Henderson, Brian E; Noushmehr, Houtan; Cozen, Wendy; Kote-Jarai, Zsofia; Eeles, Rosalind A; Easton, Douglas F; Haiman, Christopher A; Lu, Wange; Farnham, Peggy J; Coetzee, Gerhard A

    2014-01-01

    Genome-wide association studies (GWAS) have revolutionized the field of cancer genetics, but the causal links between increased genetic risk and onset/progression of disease processes remain to be identified. Here we report the first step in such an endeavor for prostate cancer. We provide a comprehensive annotation of the 77 known risk loci, based upon highly correlated variants in biologically relevant chromatin annotations--we identified 727 such potentially functional SNPs. We also provide a detailed account of possible protein disruption, microRNA target sequence disruption and regulatory response element disruption of all correlated SNPs at r(2) ? 0.88%. 88% of the 727 SNPs fall within putative enhancers, and many alter critical residues in the response elements of transcription factors known to be involved in prostate biology. We define as risk enhancers those regions with enhancer chromatin biofeatures in prostate-derived cell lines with prostate-cancer correlated SNPs. To aid the identification of these enhancers, we performed genomewide ChIP-seq for H3K27-acetylation, a mark of actively engaged enhancers, as well as the transcription factor TCF7L2. We analyzed in depth three variants in risk enhancers, two of which show significantly altered androgen sensitivity in LNCaP cells. This includes rs4907792, that is in linkage disequilibrium (r(2) = 0.91) with an eQTL for NUDT11 (on the X chromosome) in prostate tissue, and rs10486567, the index SNP in intron 3 of the JAZF1 gene on chromosome 7. Rs4907792 is within a critical residue of a strong consensus androgen response element that is interrupted in the protective allele, resulting in a 56% decrease in its androgen sensitivity, whereas rs10486567 affects both NKX3-1 and FOXA-AR motifs where the risk allele results in a 39% increase in basal activity and a 28% fold-increase in androgen stimulated enhancer activity. Identification of such enhancer variants and their potential target genes represents a preliminary step in connecting risk to disease process. PMID:24497837

  17. miRDB: an online resource for microRNA target prediction and functional annotations

    PubMed Central

    Wong, Nathan; Wang, Xiaowei

    2015-01-01

    MicroRNAs (miRNAs) are small non-coding RNAs that are extensively involved in many physiological and disease processes. One major challenge in miRNA studies is the identification of genes regulated by miRNAs. To this end, we have developed an online resource, miRDB (http://mirdb.org), for miRNA target prediction and functional annotations. Here, we describe recently updated features of miRDB, including 2.1 million predicted gene targets regulated by 6709 miRNAs. In addition to presenting precompiled prediction data, a new feature is the web server interface that allows submission of user-provided sequences for miRNA target prediction. In this way, users have the flexibility to study any custom miRNAs or target genes of interest. Another major update of miRDB is related to functional miRNA annotations. Although thousands of miRNAs have been identified, many of the reported miRNAs are not likely to play active functional roles or may even have been falsely identified as miRNAs from high-throughput studies. To address this issue, we have performed combined computational analyses and literature mining, and identified 568 and 452 functional miRNAs in humans and mice, respectively. These miRNAs, as well as associated functional annotations, are presented in the FuncMir Collection in miRDB. PMID:25378301

  18. Integrating information retrieval with distant supervision for Gene Ontology annotation

    PubMed Central

    Zhu, Dongqing; Li, Dingcheng; Carterette, Ben; Liu, Hongfang

    2014-01-01

    This article describes our participation of the Gene Ontology Curation task (GO task) in BioCreative IV where we participated in both subtasks: A) identification of GO evidence sentences (GOESs) for relevant genes in full-text articles and B) prediction of GO terms for relevant genes in full-text articles. For subtask A, we trained a logistic regression model to detect GOES based on annotations in the training data supplemented with more noisy negatives from an external resource. Then, a greedy approach was applied to associate genes with sentences. For subtask B, we designed two types of systems: (i) search-based systems, which predict GO terms based on existing annotations for GOESs that are of different textual granularities (i.e., full-text articles, abstracts, and sentences) using state-of-the-art information retrieval techniques (i.e., a novel application of the idea of distant supervision) and (ii) a similarity-based system, which assigns GO terms based on the distance between words in sentences and GO terms/synonyms. Our best performing system for subtask A achieves an F1 score of 0.27 based on exact match and 0.387 allowing relaxed overlap match. Our best performing system for subtask B, a search-based system, achieves an F1 score of 0.075 based on exact match and 0.301 considering hierarchical matches. Our search-based systems for subtask B significantly outperformed the similarity-based system. Database URL: https://github.com/noname2020/Bioc PMID:25183856

  19. Semantic Multimedia Document Adaptation with Functional Annotations Sebastien Laborie

    E-print Network

    Joseph Fourier Grenoble-I, Université

    Semantic Multimedia Document Adaptation with Functional Annotations S´ebastien Laborie IRIT ­ Paul of presentation contexts for multimedia documents requires the adaptation of document specifica- tions. In an earlier work, we have proposed a seman- tic adaptation framework for multimedia documents. This framework

  20. Functional Annotation, Genome Organization and Phylogeny of the Grapevine (Vitis vinifera) Terpene Synthase Gene Family Based on Genome Assembly, FLcDNA Cloning, and Enzyme Assays

    Microsoft Academic Search

    Diane M Martin; Sébastien Aubourg; Marina B Schouwey; Laurent Daviet; Michel Schalk; Omid Toub; Steven T Lund; Jörg Bohlmann

    2010-01-01

    BACKGROUND: Terpenoids are among the most important constituents of grape flavour and wine bouquet, and serve as useful metabolite markers in viticulture and enology. Based on the initial 8-fold sequencing of a nearly homozygous Pinot noir inbred line, 89 putative terpenoid synthase genes (VvTPS) were predicted by in silico analysis of the grapevine (Vitis vinifera) genome assembly 1. The finding

  1. Optimizing high performance computing workflow for protein functional annotation.

    PubMed

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-09-10

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296

  2. Optimizing high performance computing workflow for protein functional annotation

    PubMed Central

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-01-01

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296

  3. An automated annotation tool for genomic DNA sequences using GeneScan and BLAST

    Microsoft Academic Search

    Andrew M. Lynn; Chakresh Kumar Jain; K. Kosalai; Pranjan Barman; Nupur Thakur; Harish Batra; Alok Bhattacharya

    2001-01-01

    Genomic sequence data are often available well before the annotated sequence is published. We present a method for analysis\\u000a of genomic DNA to identify coding sequences using the GeneScan algorithm and characterize these resultant sequences by BLAST.\\u000a The routines are used to develop a system for automated annotation of genome DNA sequences.

  4. Probabilistic annotation of protein sequences based on functional classifications

    E-print Network

    Levy, Emmanuel D; Ouzounis, Christos A; Gilks, Walter R; Audit, Benjamin

    2005-12-14

    was obtained by performing the re-annotation for different values of the threshold S0 between 45 (100% coverage by definition of the filtered ENZYME database) and 841. (?,?) correspond to the univariate and multivariate Bayesian methods at the highest... that gapA can acquire the gapB activity with only two amino acids muta- tions (D32A and L187N) [24]; actually, gapB possesses these mutations. Therefore, a reasonable hypothesis is that gapA and gapB originate from a gene duplication event followed...

  5. CDD: specific functional annotation with the Conserved Domain Database.

    PubMed

    Marchler-Bauer, Aron; Anderson, John B; Chitsaz, Farideh; Derbyshire, Myra K; DeWeese-Scott, Carol; Fong, Jessica H; Geer, Lewis Y; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; He, Siqian; Hurwitz, David I; Jackson, John D; Ke, Zhaoxi; Lanczycki, Christopher J; Liebert, Cynthia A; Liu, Chunlei; Lu, Fu; Lu, Shennan; Marchler, Gabriele H; Mullokandov, Mikhail; Song, James S; Tasneem, Asba; Thanki, Narmada; Yamashita, Roxanne A; Zhang, Dachuan; Zhang, Naigong; Bryant, Stephen H

    2009-01-01

    NCBI's Conserved Domain Database (CDD) is a collection of multiple sequence alignments and derived database search models, which represent protein domains conserved in molecular evolution. The collection can be accessed at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml, and is also part of NCBI's Entrez query and retrieval system, cross-linked to numerous other resources. CDD provides annotation of domain footprints and conserved functional sites on protein sequences. Precalculated domain annotation can be retrieved for protein sequences tracked in NCBI's Entrez system, and CDD's collection of models can be queried with novel protein sequences via the CD-Search service at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. Starting with the latest version of CDD, v2.14, information from redundant and homologous domain models is summarized at a superfamily level, and domain annotation on proteins is flagged as either 'specific' (identifying molecular function with high confidence) or as 'non-specific' (identifying superfamily membership only). PMID:18984618

  6. Integrative structural annotation of de novo RNA-Seq provides an accurate reference gene set of the enormous genome of the onion (Allium cepa L.)

    PubMed Central

    Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil

    2015-01-01

    The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. PMID:25362073

  7. GeneSense: a new approach for human gene annotation integrated with protein-protein interaction networks

    PubMed Central

    Chen, Zhongzhong; Zhang, Tianhong; Lin, Jun; Yan, Zidan; Wang, Yongren; Zheng, Weiqiang; Weng, Kevin C.

    2014-01-01

    Virtually all cellular functions involve protein-protein interactions (PPIs). As an increasing number of PPIs are identified and vast amount of information accumulated, researchers are finding different ways to interrogate the data and understand the interactions in context. However, it is widely recognized that a significant portion of the data is scattered, redundant, not considered high quality, and not readily accessible to researchers in a systematic fashion. In addition, it is challenging to identify the optimal protein targets in the current PPI networks. The GeneSense server was developed to integrate gene annotation and PPI networks in an expandable architecture that incorporates selected databases with the aim to assemble, analyze, evaluate and disseminate protein-protein association information in a comprehensive and user-friendly manner. Three network models including nodenet, leafnet and loopnet are used to identify the optimal protein targets in the complex networks. GeneSense is freely available at www.biomedsense.org/genesense.php. PMID:24667292

  8. Annotation of proteins of unknown function: initial enzyme results.

    PubMed

    McKay, Talia; Hart, Kaitlin; Horn, Alison; Kessler, Haeja; Dodge, Greg; Bardhi, Keti; Bardhi, Kostandina; Mills, Jeffrey L; Bernstein, Herbert J; Craig, Paul A

    2015-03-01

    Working with a combination of ProMOL (a plugin for PyMOL that searches a library of enzymatic motifs for local structural homologs), BLAST and Pfam (servers that identify global sequence homologs), and Dali (a server that identifies global structural homologs), we have begun the process of assigning functional annotations to the approximately 3,500 structures in the Protein Data Bank that are currently classified as having "unknown function". Using a limited template library of 388 motifs, over 500 promising in silico matches have been identified by ProMOL, among which 65 exceptionally good matches have been identified. The characteristics of the exceptionally good matches are discussed. PMID:25630330

  9. Using the Gene Ontology Hierarchy when Predicting Gene Function

    E-print Network

    Mostafavi, Sara

    2012-01-01

    The problem of multilabel classification when the labels are related through a hierarchical categorization scheme occurs in many application domains such as computational biology. For example, this problem arises naturally when trying to automatically assign gene function using a controlled vocabularies like Gene Ontology. However, most existing approaches for predicting gene functions solve independent classification problems to predict genes that are involved in a given function category, independently of the rest. Here, we propose two simple methods for incorporating information about the hierarchical nature of the categorization scheme. In the first method, we use information about a gene's previous annotation to set an initial prior on its label. In a second approach, we extend a graph-based semi-supervised learning algorithm for predicting gene function in a hierarchy. We show that we can efficiently solve this problem by solving a linear system of equations. We compare these approaches with a previous ...

  10. BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources

    PubMed Central

    2009-01-01

    Online gene annotation resources are indispensable for analysis of genomics data. However, the landscape of these online resources is highly fragmented, and scientists often visit dozens of these sites for each gene in a candidate gene list. Here, we introduce BioGPS http://biogps.gnf.org, a centralized gene portal for aggregating distributed gene annotation resources. Moreover, BioGPS embraces the principle of community intelligence, enabling any user to easily and directly contribute to the BioGPS platform. PMID:19919682

  11. Initiating the mollusk genomics annotation community: toward creating the complete curated gene-set of the Japanese Pearl Oyster, Pinctada fucata.

    PubMed

    Kawashima, Takeshi; Takeuchi, Takeshi; Koyanagi, Ryo; Kinoshita, Shigeharu; Endo, Hirotoshi; Endo, Kazuyoshi

    2013-10-01

    The genome sequence of the Japanese pearl oyster, the first draft genome from a mollusk, was published in February 2012. In order to curate the draft genome assemblies and annotate the predicted gene models, two annotation Jamborees were held in Okinawa and Tokyo. To date, 761 genes have been surveyed and curated. A preparatory meeting and a debriefing were held at the Misaki Marine Biological Station before and after the Jamborees. These four events, in conjunction with the sequence-decoding project, have facilitated the first series of gene annotations. Genome annotators among the Jamboree participants added 22 functional categories to the annotation system to date. Of these, 17 are included in Generic Gene Ontology. The other five categories are specific to molluskan biology, such as "Byssus Formation" and "Shell Formation", including Biomineralization and Acidic Proteins. A total of 731 genes from our latest version of gene models are annotated and classified into these 22 categories. The resulting data will serve as a useful reference for future genomic analyses of this species as well as comparative analyses among mollusks. PMID:24125643

  12. Protein Function Annotation By Local Binding Site Surface Similarity

    PubMed Central

    Spitzer, Russell; Cleves, Ann E.; Varela, Rocco; Jain, Ajay N.

    2013-01-01

    Hundreds of protein crystal structures exist for proteins whose function cannot be confidently determined from sequence similarity. Surflex-PSIM, a previously reported surface-based protein similarity algorithm, provides an alternative method for hypothesizing function for such proteins. The method now supports fully automatic binding site detection and is fast enough to screen comprehensive databases of protein binding sites. The binding site detection methodology was validated on apo/holo cognate protein pairs, correctly identifying 91% of ligand binding sites in holo structures and 88% in apo structures where corresponding sites existed. For correctly detected apo binding sites, the cognate holo site was the most similar binding site 87% of the time. PSIM was used to screen a set of proteins that had poorly characterized functions at the time of crystallization, but were later biochemically annotated. Using a fully automated protocol, this set of 8 proteins was screened against approximately 60,000 ligand binding sites from the PDB. PSIM correctly identified functional matches that pre-dated query protein biochemical annotation for five out of the eight query proteins. A panel of twelve currently unannotated proteins was also screened, resulting in a large number of statistically significant binding site matches, some of which suggest likely functions for the poorly characterized proteins. PMID:24166661

  13. Genome Wide Re-Annotation of Caldicellulosiruptor saccharolyticus with New Insights into Genes Involved in Biomass Degradation and Hydrogen Production

    PubMed Central

    Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh

    2015-01-01

    Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac_0437 and Csac_0424 encode for glycoside hydrolases (GH) and are proposed to be involved in the decomposition of recalcitrant plant polysaccharides. Similarly, HPs: Csac_0732, Csac_1862, Csac_1294 and Csac_0668 are suggested to play a significant role in biohydrogen production. Function prediction of these HPs by using our integrated approach will considerably enhance the interpretation of large-scale experiments targeting this industrially important organism. PMID:26196387

  14. Large-scale collection and annotation of gene models for date palm (Phoenix dactylifera, L.).

    PubMed

    Zhang, Guangyu; Pan, Linlin; Yin, Yuxin; Liu, Wanfei; Huang, Dawei; Zhang, Tongwu; Wang, Lei; Xin, Chengqi; Lin, Qiang; Sun, Gaoyuan; Ba Abdullah, Mohammed M; Zhang, Xiaowei; Hu, Songnian; Al-Mssallem, Ibrahim S; Yu, Jun

    2012-08-01

    The date palm (Phoenix dactylifera L.), famed for its sugar-rich fruits (dates) and cultivated by humans since 4,000 B.C., is an economically important crop in the Middle East, Northern Africa, and increasingly other places where climates are suitable. Despite a long history of human cultivation, the understanding of P. dactylifera genetics and molecular biology are rather limited, hindered by lack of basic data in high quality from genomics and transcriptomics. Here we report a large-scale effort in generating gene models (assembled expressed sequence tags or ESTs and mapped to a genome assembly) for P. dactylifera, using the long-read pyrosequencing platform (Roche/454 GS FLX Titanium) in high coverage. We built fourteen cDNA libraries from different P. dactylifera tissues (cultivar Khalas) and acquired 15,778,993 raw sequencing reads-about one million sequencing reads per library-and the pooled sequences were assembled into 67,651 non-redundant contigs and 301,978 singletons. We annotated 52,725 contigs based on the plant databases and 45 contigs based on functional domains referencing to the Pfam database. From the annotated contigs, we assigned GO (Gene Ontology) terms to 36,086 contigs and KEGG pathways to 7,032 contigs. Our comparative analysis showed that 70.6 % (47,930), 69.4 % (47,089), 68.4 % (46,441), and 69.3 % (47,048) of the P. dactylifera gene models are shared with rice, sorghum, Arabidopsis, and grapevine, respectively. We also assigned our gene models into house-keeping and tissue-specific genes based on their tissue specificity. PMID:22736259

  15. Annotation of a 95-kb Populus deltoides genomic sequence reveals a disease resistance gene cluster and novel class I and class II transposable elements

    Microsoft Academic Search

    M. Lescot; S. Rombauts; J. Zhang; S. Aubourg; C. Mathé; S. Jansson; P. Rouzé; W. Boerjan

    2004-01-01

    Poplar has become a model system for functional genomics in woody plants. Here, we report the sequencing and annotation of the first large contiguous stretch of genomic sequence (95 kb) of poplar, corresponding to a bacterial artificial chromosome clone mapped 0.6 centiMorgan from the Melampsora larici-populina resistance locus. The annotation revealed 15 putative genetic objects, of which five were classified as hypothetical genes

  16. Synergistic use of plant-prokaryote comparative genomics for functional annotations

    PubMed Central

    2011-01-01

    Background Identifying functions for all gene products in all sequenced organisms is a central challenge of the post-genomic era. However, at least 30-50% of the proteins encoded by any given genome are of unknown or vaguely known function, and a large number are wrongly annotated. Many of these ‘unknown’ proteins are common to prokaryotes and plants. We set out to predict and experimentally test the functions of such proteins. Our approach to functional prediction integrates comparative genomics based mainly on microbial genomes with functional genomic data from model microorganisms and post-genomic data from plants. This approach bridges the gap between automated homology-based annotations and the classical gene discovery efforts of experimentalists, and is more powerful than purely computational approaches to identifying gene-function associations. Results Among Arabidopsis genes, we focused on those (2,325 in total) that (i) are unique or belong to families with no more than three members, (ii) occur in prokaryotes, and (iii) have unknown or poorly known functions. Computer-assisted selection of promising targets for deeper analysis was based on homology-independent characteristics associated in the SEED database with the prokaryotic members of each family. In-depth comparative genomic analysis was performed for 360 top candidate families. From this pool, 78 families were connected to general areas of metabolism and, of these families, specific functional predictions were made for 41. Twenty-one predicted functions have been experimentally tested or are currently under investigation by our group in at least one prokaryotic organism (nine of them have been validated, four invalidated, and eight are in progress). Ten additional predictions have been independently validated by other groups. Discovering the function of very widespread but hitherto enigmatic proteins such as the YrdC or YgfZ families illustrates the power of our approach. Conclusions Our approach correctly predicted functions for 19 uncharacterized protein families from plants and prokaryotes; none of these functions had previously been correctly predicted by computational methods. The resulting annotations could be propagated with confidence to over six thousand homologous proteins encoded in over 900 bacterial, archaeal, and eukaryotic genomes currently available in public databases. PMID:21810204

  17. Towards Experimental Annotation of Genes by High Throughput Sequencing

    SciTech Connect

    Bradbury, Andrew [Los Alamos National Laboratory

    2010-06-03

    Andrew Bradbury of Los Alamos National Laboratory discusses turning annotation into a sequencing pipeline on June 3, 2010 at the "Sequencing, Finishing, Analysis in the Future" meeting in Santa Fe, NM

  18. Approaching the functional annotation of fungal virulence factors using cross-species genetic interaction profiling.

    PubMed

    Brown, Jessica C S; Madhani, Hiten D

    2012-01-01

    In many human fungal pathogens, genes required for disease remain largely unannotated, limiting the impact of virulence gene discovery efforts. We tested the utility of a cross-species genetic interaction profiling approach to obtain clues to the molecular function of unannotated pathogenicity factors in the human pathogen Cryptococcus neoformans. This approach involves expression of C. neoformans genes of interest in each member of the Saccharomyces cerevisiae gene deletion library, quantification of their impact on growth, and calculation of the cross-species genetic interaction profiles. To develop functional predictions, we computed and analyzed the correlations of these profiles with existing genetic interaction profiles of S. cerevisiae deletion mutants. For C. neoformans LIV7, which has no S. cerevisiae ortholog, this profiling approach predicted an unanticipated role in the Golgi apparatus. Validation studies in C. neoformans demonstrated that Liv7 is a functional Golgi factor where it promotes the suppression of the exposure of a specific immunostimulatory molecule, mannose, on the cell surface, thereby inhibiting phagocytosis. The genetic interaction profile of another pathogenicity gene that lacks an S. cerevisiae ortholog, LIV6, strongly predicted a role in endosome function. This prediction was also supported by studies of the corresponding C. neoformans null mutant. Our results demonstrate the utility of quantitative cross-species genetic interaction profiling for the functional annotation of fungal pathogenicity proteins of unknown function including, surprisingly, those that are not conserved in sequence across fungi. PMID:23300468

  19. Approaching the Functional Annotation of Fungal Virulence Factors Using Cross-Species Genetic Interaction Profiling

    PubMed Central

    Brown, Jessica C. S.; Madhani, Hiten D.

    2012-01-01

    In many human fungal pathogens, genes required for disease remain largely unannotated, limiting the impact of virulence gene discovery efforts. We tested the utility of a cross-species genetic interaction profiling approach to obtain clues to the molecular function of unannotated pathogenicity factors in the human pathogen Cryptococcus neoformans. This approach involves expression of C. neoformans genes of interest in each member of the Saccharomyces cerevisiae gene deletion library, quantification of their impact on growth, and calculation of the cross-species genetic interaction profiles. To develop functional predictions, we computed and analyzed the correlations of these profiles with existing genetic interaction profiles of S. cerevisiae deletion mutants. For C. neoformans LIV7, which has no S. cerevisiae ortholog, this profiling approach predicted an unanticipated role in the Golgi apparatus. Validation studies in C. neoformans demonstrated that Liv7 is a functional Golgi factor where it promotes the suppression of the exposure of a specific immunostimulatory molecule, mannose, on the cell surface, thereby inhibiting phagocytosis. The genetic interaction profile of another pathogenicity gene that lacks an S. cerevisiae ortholog, LIV6, strongly predicted a role in endosome function. This prediction was also supported by studies of the corresponding C. neoformans null mutant. Our results demonstrate the utility of quantitative cross-species genetic interaction profiling for the functional annotation of fungal pathogenicity proteins of unknown function including, surprisingly, those that are not conserved in sequence across fungi. PMID:23300468

  20. Functional Annotation of Putative Regulatory Elements at Cancer Susceptibility Loci

    PubMed Central

    Rosse, Stephanie A; Auer, Paul L; Carlson, Christopher S

    2014-01-01

    Most cancer-associated genetic variants identified from genome-wide association studies (GWAS) do not obviously change protein structure, leading to the hypothesis that the associations are attributable to regulatory polymorphisms. Translating genetic associations into mechanistic insights can be facilitated by knowledge of the causal regulatory variant (or variants) responsible for the statistical signal. Experimental validation of candidate functional variants is onerous, making bioinformatic approaches necessary to prioritize candidates for laboratory analysis. Thus, a systematic approach for recognizing functional (and, therefore, likely causal) variants in noncoding regions is an important step toward interpreting cancer risk loci. This review provides a detailed introduction to current regulatory variant annotations, followed by an overview of how to leverage these resources to prioritize candidate functional polymorphisms in regulatory regions. PMID:25288875

  1. Predicting function: from genes to genomes and back1

    Microsoft Academic Search

    Peer Bork; Thomas Dandekar; Yolande Diaz-Lazcoz; Frank Eisenhaber; Martijn Huynen; Yanping Yuan

    1998-01-01

    Predicting function from sequence using computational tools is a highly complicated procedure that is generally done for each gene individually. This review focuses on the added value that is provided by completely sequenced genomes in function prediction. Various levels of sequence annotation and function prediction are discussed, ranging from genomic sequence to that of complex cellular processes. Protein function is

  2. In Silico Functional Pathway Annotation of 86 Established Prostate Cancer Risk Variants

    PubMed Central

    Loo, Lenora W. M.; Fong, Aaron Y. W.; Cheng, Iona; Le Marchand, Loïc

    2015-01-01

    Heritability is one of the strongest risk factors of prostate cancer, emphasizing the importance of the genetic contribution towards prostate cancer risk. To date, 86 established prostate cancer risk variants have been identified by genome-wide association studies (GWAS). To determine if these risk variants are located near genes that interact together in biological networks or pathways contributing to prostate cancer initiation or progression, we generated gene sets based on proximity to the 86 prostate cancer risk variants. We took two approaches to generate gene lists. The first strategy included all immediate flanking genes, up- and downstream of the risk variant, regardless of distance from the index variant, and the second strategy included genes closest to the index GWAS marker and to variants in high LD (r2 ?0.8 in Europeans) with the index variant, within a 100 kb window up- and downstream. Pathway mapping of the two gene sets supported the importance of the androgen receptor-mediated signaling in prostate cancer biology. In addition, the hedgehog and Wnt/?-catenin signaling pathways were identified in pathway mapping for the flanking gene set. We also used the HaploReg resource to examine the 86 risk loci and variants high LD (r2 ?0.8) for functional elements. We found that there was a 12.8 fold (p = 2.9 x 10-4) enrichment for enhancer motifs in a stem cell line and a 4.4 fold (p = 1.1 x 10-3) enrichment of DNase hypersensitivity in a prostate adenocarcinoma cell line, indicating that the risk and correlated variants are enriched for transcriptional regulatory motifs. Our pathway-based functional annotation of the prostate cancer risk variants highlights the potential regulatory function that GWAS risk markers, and their highly correlated variants, exert on genes. Our study also shows that these genes may function cooperatively in key signaling pathways in prostate cancer biology. PMID:25658610

  3. Ontology-based functional classification of genes: Evaluation with reference sets and overlap analysis

    Microsoft Academic Search

    Sidahmed Benabderrahmane; Marie Dominique Devignes; Malika Smail Tabbone; Amedeo Napoli; Olivier Poch

    2011-01-01

    Functional classification involves grouping genes according to their molecular functions or the biological processes they participate in. This unsupervised classification task is essential for interpreting gene datasets produced by post-genomic experiments. As the functional annotation of genes is mostly based on the Gene Ontology (GO), many similarity measures using the GO have been described, but few of them have been

  4. An integrated gene annotation and transcriptional profiling approach towards the full gene content of the Drosophila genome

    Microsoft Academic Search

    B Beckmann; B Koch; V Solovyev; C Busold; K Fellenberg; M Boutros; M Vingron; F Sauer; R Paro; Zentrum für Molekulare; Biologie Heidelberg; Max Planck; Mount Kisko

    2003-01-01

    Background: While the genome sequences for a variety of organisms are now available, the precise number of the genes encoded is still a matter of debate. For the human genome several stringent annotation approaches have resulted in the same number of potential genes, but a careful comparison revealed only limited overlap. This indicates that only the combination of different computational

  5. A semi-automated genome annotation comparison and integration scheme

    PubMed Central

    2013-01-01

    Background Different genome annotation services have been developed in recent years and widely used. However, the functional annotation results from different services are often not the same and a scheme to obtain consensus functional annotations by integrating different results is in demand. Results This article presents a semi-automated scheme that is capable of comparing functional annotations from different sources and consequently obtaining a consensus genome functional annotation result. In this study, we used four automated annotation services to annotate a newly sequenced genome--Arcobacter butzleri ED-1. Our scheme is divided into annotation comparison and annotation determination sections. In the functional annotation comparison section, we employed gene synonym lists to tackle term difference problems. Multiple techniques from information retrieval were used to preprocess the functional annotations. Based on the functional annotation comparison results, we designed a decision tree to obtain a consensus functional annotation result. Experimental results show that our approach can greatly reduce the workload of manual comparison by automatically comparing 87% of the functional annotations. In addition, it automatically determined 87% of the functional annotations, leaving only 13% of the genes for manual curation. We applied this approach across six phylogenetically different genomes in order to assess the performance consistency. The results showed that our scheme is able to automatically perform, on average, 73% and 86% of the annotation comparison and determination tasks, respectively. Conclusions We propose a semi-automatic and effective scheme to compare and determine genome functional annotations. It greatly reduces the manual work required in genome functional annotation. As this scheme does not require any specific biological knowledge, it is readily applicable for genome annotation comparison and genome re-annotation projects. PMID:23725374

  6. PANDA: pathway and annotation explorer for visualizing and interpreting gene-centric data.

    PubMed

    Hart, Steven N; Moore, Raymond M; Zimmermann, Michael T; Oliver, Gavin R; Egan, Jan B; Bryce, Alan H; Kocher, Jean-Pierre A

    2015-01-01

    Objective. Bringing together genomics, transcriptomics, proteomics, and other -omics technologies is an important step towards developing highly personalized medicine. However, instrumentation has advances far beyond expectations and now we are able to generate data faster than it can be interpreted. Materials and Methods. We have developed PANDA (Pathway AND Annotation) Explorer, a visualization tool that integrates gene-level annotation in the context of biological pathways to help interpret complex data from disparate sources. PANDA is a web-based application that displays data in the context of well-studied pathways like KEGG, BioCarta, and PharmGKB. PANDA represents data/annotations as icons in the graph while maintaining the other data elements (i.e., other columns for the table of annotations). Custom pathways from underrepresented diseases can be imported when existing data sources are inadequate. PANDA also allows sharing annotations among collaborators. Results. In our first use case, we show how easy it is to view supplemental data from a manuscript in the context of a user's own data. Another use-case is provided describing how PANDA was leveraged to design a treatment strategy from the somatic variants found in the tumor of a patient with metastatic sarcomatoid renal cell carcinoma. Conclusion. PANDA facilitates the interpretation of gene-centric annotations by visually integrating this information with context of biological pathways. The application can be downloaded or used directly from our website: http://bioinformaticstools.mayo.edu/research/panda-viewer/. PMID:26038725

  7. PANDA: pathway and annotation explorer for visualizing and interpreting gene-centric data

    PubMed Central

    Zimmermann, Michael T.; Oliver, Gavin R.; Egan, Jan B.; Bryce, Alan H.

    2015-01-01

    Objective. Bringing together genomics, transcriptomics, proteomics, and other -omics technologies is an important step towards developing highly personalized medicine. However, instrumentation has advances far beyond expectations and now we are able to generate data faster than it can be interpreted. Materials and Methods. We have developed PANDA (Pathway AND Annotation) Explorer, a visualization tool that integrates gene-level annotation in the context of biological pathways to help interpret complex data from disparate sources. PANDA is a web-based application that displays data in the context of well-studied pathways like KEGG, BioCarta, and PharmGKB. PANDA represents data/annotations as icons in the graph while maintaining the other data elements (i.e., other columns for the table of annotations). Custom pathways from underrepresented diseases can be imported when existing data sources are inadequate. PANDA also allows sharing annotations among collaborators. Results. In our first use case, we show how easy it is to view supplemental data from a manuscript in the context of a user’s own data. Another use-case is provided describing how PANDA was leveraged to design a treatment strategy from the somatic variants found in the tumor of a patient with metastatic sarcomatoid renal cell carcinoma. Conclusion. PANDA facilitates the interpretation of gene-centric annotations by visually integrating this information with context of biological pathways. The application can be downloaded or used directly from our website: http://bioinformaticstools.mayo.edu/research/panda-viewer/.

  8. Functional annotation of the transcriptome of Sorghum bicolor in response to osmotic stress and abscisic acid

    PubMed Central

    2011-01-01

    Background Higher plants exhibit remarkable phenotypic plasticity allowing them to adapt to an extensive range of environmental conditions. Sorghum is a cereal crop that exhibits exceptional tolerance to adverse conditions, in particular, water-limiting environments. This study utilized next generation sequencing (NGS) technology to examine the transcriptome of sorghum plants challenged with osmotic stress and exogenous abscisic acid (ABA) in order to elucidate genes and gene networks that contribute to sorghum's tolerance to water-limiting environments with a long-term aim of developing strategies to improve plant productivity under drought. Results RNA-Seq results revealed transcriptional activity of 28,335 unique genes from sorghum root and shoot tissues subjected to polyethylene glycol (PEG)-induced osmotic stress or exogenous ABA. Differential gene expression analyses in response to osmotic stress and ABA revealed a strong interplay among various metabolic pathways including abscisic acid and 13-lipoxygenase, salicylic acid, jasmonic acid, and plant defense pathways. Transcription factor analysis indicated that groups of genes may be co-regulated by similar regulatory sequences to which the expressed transcription factors bind. We successfully exploited the data presented here in conjunction with published transcriptome analyses for rice, maize, and Arabidopsis to discover more than 50 differentially expressed, drought-responsive gene orthologs for which no function had been previously ascribed. Conclusions The present study provides an initial assemblage of sorghum genes and gene networks regulated by osmotic stress and hormonal treatment. We are providing an RNA-Seq data set and an initial collection of transcription factors, which offer a preliminary look into the cascade of global gene expression patterns that arise in a drought tolerant crop subjected to abiotic stress. These resources will allow scientists to query gene expression and functional annotation in response to drought. PMID:22008187

  9. Joint stage recognition and anatomical annotation of drosophila gene expression patterns

    PubMed Central

    Cai, Xiao; Wang, Hua; Huang, Heng; Ding, Chris

    2012-01-01

    Motivation: Staining the mRNA of a gene via in situ hybridization (ISH) during the development of a Drosophila melanogaster embryo delivers the detailed spatio-temporal patterns of the gene expression. Many related biological problems such as the detection of co-expressed genes, co-regulated genes and transcription factor binding motifs rely heavily on the analysis of these image patterns. To provide the text-based pattern searching for facilitating related biological studies, the images in the Berkeley Drosophila Genome Project (BDGP) study are annotated with developmental stage term and anatomical ontology terms manually by domain experts. Due to the rapid increase in the number of such images and the inevitable bias annotations by human curators, it is necessary to develop an automatic method to recognize the developmental stage and annotate anatomical terms. Results: In this article, we propose a novel computational model for jointly stage classification and anatomical terms annotation of Drosophila gene expression patterns. We propose a novel Tri-Relational Graph (TG) model that comprises the data graph, anatomical term graph, developmental stage term graph, and connect them by two additional graphs induced from stage or annotation label assignments. Upon the TG model, we introduce a Preferential Random Walk (PRW) method to jointly recognize developmental stage and annotate anatomical terms by utilizing the interrelations between two tasks. The experimental results on two refined BDGP datasets demonstrate that our joint learning method can achieve superior prediction results on both tasks than the state-of-the-art methods. Availability: http://ranger.uta.edu/%7eheng/Drosophila/ Contact: heng@uta.edu PMID:22689756

  10. Comparison of Gene Coexpression Profiles and Construction of Conserved Gene Networks to Find Functional Modules

    PubMed Central

    Okamura, Yasunobu; Obayashi, Takeshi; Kinoshita, Kengo

    2015-01-01

    Background Computational approaches toward gene annotation are a formidable challenge, now that many genome sequences have been determined. Each gene has its own function, but complicated cellular functions are achieved by sets of genes. Therefore, sets of genes with strong functional relationships must be identified. For this purpose, the similarities of gene expression patterns and gene sequences have been separately utilized, although the combined information will provide a better solution. Result & Discussion We propose a new method to find functional modules, by comparing gene coexpression profiles among species. A coexpression pattern is represented as a list of coexpressed genes with each guide gene. We compared two coexpression lists, one from a human guide gene and the other from a homologous mouse gene, and defined a measure to evaluate the similarity between the lists. Based on this coexpression similarity, we detected the highly conserved genes, and constructed human gene networks with conserved coexpression between human and mouse. Some of the tightly coupled genes (modules) showed clear functional enrichment, such as immune system and cell cycle, indicating that our method could identify functionally related genes without any prior knowledge. We also found a few functional modules without any annotations, which may be good candidates for novel functional modules. All of the comparisons are available at the http://v1.coxsimdb.info web database. PMID:26147120

  11. Coordinated international action to accelerate Genome to Phenome- The Functional Annotation of Animal Genomes (FAANG) Project

    Technology Transfer Automated Retrieval System (TEKTRAN)

    We describe the organization of a nascent international effort - the "Functional Annotation of ANimal Genomes" project - whose aim is to produce comprehensive maps of functional elements in the genomes of domesticated animal species....

  12. IDconverter and IDClight: Conversion and annotation of gene and protein IDs

    PubMed Central

    Alibés, Andreu; Yankilevich, Patricio; Cañada, Andrés; Díaz-Uriarte, Ramón

    2007-01-01

    Background Researchers involved in the annotation of large numbers of gene, clone or protein identifiers are usually required to perform a one-by-one conversion for each identifier. When the field of research is one such as microarray experiments, this number may be around 30,000. Results To help researchers map accession numbers and identifiers among clones, genes, proteins and chromosomal positions, we have designed and developed IDconverter and IDClight. They are two user-friendly, freely available web server applications that also provide additional functional information by mapping the identifiers on to pathways, Gene Ontology terms, and literature references. Both tools are high-throughput oriented and include identifiers for the most common genomic databases. These tools have been compared to other similar tools, showing that they are among the fastest and the most up-to-date. Conclusion These tools provide a fast and intuitive way of enriching the information coming out of high-throughput experiments like microarrays. They can be valuable both to wet-lab researchers and to bioinformaticians. PMID:17214880

  13. Gene Function Prediction Based on the Gene Ontology Hierarchical Structure

    PubMed Central

    Cheng, Liangxi; Lin, Hongfei; Hu, Yuncui; Wang, Jian; Yang, Zhihao

    2014-01-01

    The information of the Gene Ontology annotation is helpful in the explanation of life science phenomena, and can provide great support for the research of the biomedical field. The use of the Gene Ontology is gradually affecting the way people store and understand bioinformatic data. To facilitate the prediction of gene functions with the aid of text mining methods and existing resources, we transform it into a multi-label top-down classification problem and develop a method that uses the hierarchical relationships in the Gene Ontology structure to relieve the quantitative imbalance of positive and negative training samples. Meanwhile the method enhances the discriminating ability of classifiers by retaining and highlighting the key training samples. Additionally, the top-down classifier based on a tree structure takes the relationship of target classes into consideration and thus solves the incompatibility between the classification results and the Gene Ontology structure. Our experiment on the Gene Ontology annotation corpus achieves an F-value performance of 50.7% (precision: 52.7% recall: 48.9%). The experimental results demonstrate that when the size of training set is small, it can be expanded via topological propagation of associated documents between the parent and child nodes in the tree structure. The top-down classification model applies to the set of texts in an ontology structure or with a hierarchical relationship. PMID:25192339

  14. Gene function prediction based on the Gene Ontology hierarchical structure.

    PubMed

    Cheng, Liangxi; Lin, Hongfei; Hu, Yuncui; Wang, Jian; Yang, Zhihao

    2014-01-01

    The information of the Gene Ontology annotation is helpful in the explanation of life science phenomena, and can provide great support for the research of the biomedical field. The use of the Gene Ontology is gradually affecting the way people store and understand bioinformatic data. To facilitate the prediction of gene functions with the aid of text mining methods and existing resources, we transform it into a multi-label top-down classification problem and develop a method that uses the hierarchical relationships in the Gene Ontology structure to relieve the quantitative imbalance of positive and negative training samples. Meanwhile the method enhances the discriminating ability of classifiers by retaining and highlighting the key training samples. Additionally, the top-down classifier based on a tree structure takes the relationship of target classes into consideration and thus solves the incompatibility between the classification results and the Gene Ontology structure. Our experiment on the Gene Ontology annotation corpus achieves an F-value performance of 50.7% (precision: 52.7% recall: 48.9%). The experimental results demonstrate that when the size of training set is small, it can be expanded via topological propagation of associated documents between the parent and child nodes in the tree structure. The top-down classification model applies to the set of texts in an ontology structure or with a hierarchical relationship. PMID:25192339

  15. De novo transcriptome assembly, gene annotation, marker development, and miRNA potential target genes validation under abiotic stresses in Oenanthe javanica.

    PubMed

    Jiang, Qian; Wang, Feng; Tan, Hua-Wei; Li, Meng-Yao; Xu, Zhi-Sheng; Tan, Guo-Fei; Xiong, Ai-Sheng

    2015-04-01

    Oenanthe javanica is an aquatic perennial herb with known medicinal properties and an edible vegetable with high vitamin and mineral content. The understanding of the biology of O. javanica is limited by the absence of information on its genome, transcriptome, and small RNA. In this study, transcriptome sequencing and small RNA sequencing were performed to annotate function genes, develop SSR markers and analyze potential target genes of miRNAs in O. javanica. All reads with total nucleotides number of 1,440,321,408 bp were assembled into 58,072 transcripts and 40,208 unigenes. A total of 1,233 SSRs were identified from O. javanica. Generated unigenes were aligned against seven databases and annotated with functions. A total of 29 potential targets were predicted. Expression of 10 miRNAs and their corresponding target genes under abiotic stresses (heat, cold, salinity, and drought) was validated. All ten miRNAs were confirmed to response to abiotic stresses. A pair of miRNA and its target gene was found. This study can serve as a valuable resource for future studies on O. javanica, which may focus on novel gene discovery, SSR development, gene mapping, and miRNA-affected processes and pathways. This can promote the development of the useful medicinal properties of O. javanica in medical science. PMID:25416420

  16. The DAWGPAWS pipeline for the annotation of genes and transposable elements in plant genomes

    Microsoft Academic Search

    James C Estill; Jeffrey L Bennetzen

    2009-01-01

    BACKGROUND: High quality annotation of the genes and transposable elements in complex genomes requires a human-curated integration of multiple sources of computational evidence. These evidences include results from a diversity of ab initio prediction programs as well as homology-based searches. Most of these programs operate on a single contiguous sequence at a time, and the results are generated in a

  17. Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis.

    PubMed

    Lees, Jonathan G; Lee, David; Studer, Romain A; Dawson, Natalie L; Sillitoe, Ian; Das, Sayoni; Yeats, Corin; Dessailly, Benoit H; Rentzsch, Robert; Orengo, Christine A

    2014-01-01

    Gene3D (http://gene3d.biochem.ucl.ac.uk) is a database of protein domain structure annotations for protein sequences. Domains are predicted using a library of profile HMMs from 2738 CATH superfamilies. Gene3D assigns domain annotations to Ensembl and UniProt sequence sets including >6000 cellular genomes and >20 million unique protein sequences. This represents an increase of 45% in the number of protein sequences since our last publication. Thanks to improvements in the underlying data and pipeline, we see large increases in the domain coverage of sequences. We have expanded this coverage by integrating Pfam and SUPERFAMILY domain annotations, and we now resolve domain overlaps to provide highly comprehensive composite multi-domain architectures. To make these data more accessible for comparative genome analyses, we have developed novel search algorithms for searching genomes to identify related multi-domain architectures. In addition to providing domain family annotations, we have now developed a pipeline for 3D homology modelling of domains in Gene3D. This has been applied to the human genome and will be rolled out to other major organisms over the next year. PMID:24270792

  18. In Silico Structural and Functional Annotation of Hypothetical Proteins of Vibrio cholerae O139.

    PubMed

    Islam, Md Saiful; Shahik, Shah Md; Sohel, Md; Patwary, Noman I A; Hasan, Md Anayet

    2015-06-01

    In developing countries threat of cholera is a significant health concern whenever water purification and sewage disposal systems are inadequate. Vibrio cholerae is one of the responsible bacteria involved in cholera disease. The complete genome sequence of V. cholerae deciphers the presence of various genes and hypothetical proteins whose function are not yet understood. Hence analyzing and annotating the structure and function of hypothetical proteins is important for understanding the V. cholerae. V. cholerae O139 is the most common and pathogenic bacterial strain among various V. cholerae strains. In this study sequence of six hypothetical proteins of V. cholerae O139 has been annotated from NCBI. Various computational tools and databases have been used to determine domain family, protein-protein interaction, solubility of protein, ligand binding sites etc. The three dimensional structure of two proteins were modeled and their ligand binding sites were identified. We have found domains and families of only one protein. The analysis revealed that these proteins might have antibiotic resistance activity, DNA breaking-rejoining activity, integrase enzyme activity, restriction endonuclease, etc. Structural prediction of these proteins and detection of binding sites from this study would indicate a potential target aiding docking studies for therapeutic designing against cholera. PMID:26175663

  19. In Silico Structural and Functional Annotation of Hypothetical Proteins of Vibrio cholerae O139

    PubMed Central

    Islam, Md. Saiful; Shahik, Shah Md.; Sohel, Md.; Patwary, Noman I. A.

    2015-01-01

    In developing countries threat of cholera is a significant health concern whenever water purification and sewage disposal systems are inadequate. Vibrio cholerae is one of the responsible bacteria involved in cholera disease. The complete genome sequence of V. cholerae deciphers the presence of various genes and hypothetical proteins whose function are not yet understood. Hence analyzing and annotating the structure and function of hypothetical proteins is important for understanding the V. cholerae. V. cholerae O139 is the most common and pathogenic bacterial strain among various V. cholerae strains. In this study sequence of six hypothetical proteins of V. cholerae O139 has been annotated from NCBI. Various computational tools and databases have been used to determine domain family, protein-protein interaction, solubility of protein, ligand binding sites etc. The three dimensional structure of two proteins were modeled and their ligand binding sites were identified. We have found domains and families of only one protein. The analysis revealed that these proteins might have antibiotic resistance activity, DNA breaking-rejoining activity, integrase enzyme activity, restriction endonuclease, etc. Structural prediction of these proteins and detection of binding sites from this study would indicate a potential target aiding docking studies for therapeutic designing against cholera. PMID:26175663

  20. IDconverter and IDClight: Conversion and annotation of gene and protein IDs

    Microsoft Academic Search

    Andreu Alibés; Patricio Yankilevich; Ramón Díaz-uriarte

    2007-01-01

    BACKGROUND: Researchers involved in the annotation of large numbers of gene, clone or protein identifiers are usually required to perform a one-by-one conversion for each identifier. When the field of research is one such as microarray experiments, this number may be around 30,000. RESULTS: To help researchers map accession numbers and identifiers among clones, genes, proteins and chromosomal positions, we

  1. Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation

    PubMed Central

    2014-01-01

    Background Microbiome-wide gene expression profiling through high-throughput RNA sequencing (‘metatranscriptomics’) offers a powerful means to functionally interrogate complex microbial communities. Key to successful exploitation of these datasets is the ability to confidently match relatively short sequence reads to known bacterial transcripts. In the absence of reference genomes, such annotation efforts may be enhanced by assembling reads into longer contiguous sequences (‘contigs’), prior to database search strategies. Since reads from homologous transcripts may derive from several species, represented at different abundance levels, it is not clear how well current assembly pipelines perform for metatranscriptomic datasets. Here we evaluate the performance of four currently employed assemblers including de novo transcriptome assemblers - Trinity and Oases; the metagenomic assembler - Metavelvet; and the recently developed metatranscriptomic assembler IDBA-MT. Results We evaluated the performance of the assemblers on a previously published dataset of single-end RNA sequence reads derived from the large intestine of an inbred non-obese diabetic mouse model of type 1 diabetes. We found that Trinity performed best as judged by contigs assembled, reads assigned to contigs, and number of reads that could be annotated to a known bacterial transcript. Only 15.5% of RNA sequence reads could be annotated to a known transcript in contrast to 50.3% with Trinity assembly. Paired-end reads generated from the same mouse samples resulted in modest performance gains. A database search estimated that the assemblies are unlikely to erroneously merge multiple unrelated genes sharing a region of similarity (<2% of contigs). A simulated dataset based on ten species confirmed these findings. A more complex simulated dataset based on 72 species found that greater assembly errors were introduced than is expected by sequencing quality. Through the detailed evaluation of assembly performance, the insights provided by this study will help drive the design of future metatranscriptomic analyses. Conclusion Assembly of metatranscriptome datasets greatly improved read annotation. Of the four assemblers evaluated, Trinity provided the best performance. For more complex datasets, reads generated from transcripts sharing considerable sequence similarity can be a source of significant assembly error, suggesting a need to collate reads on the basis of common taxonomic origin prior to assembly. PMID:25411636

  2. PIPA: A High-Throughput Pipeline for Protein Function Annotation Chenggang Yu, Valmik Desai, Nela Zavaljevski, and Jaques Reifman

    E-print Network

    PIPA: A High-Throughput Pipeline for Protein Function Annotation Chenggang Yu, Valmik Desai, Nela of multisource predictions. We developed Pipeline for Protein Annotation (PIPA), a genome-wide protein function annotation pipeline that runs in a high-performance computing environment. PIPA integrates different tools

  3. Gene networks in Drosophila melanogaster: integrating experimental data to predict gene function

    PubMed Central

    Costello, James C; Dalkilic, Mehmet M; Beason, Scott M; Gehlhausen, Jeff R; Patwardhan, Rupali; Middha, Sumit; Eads, Brian D; Andrews, Justen R

    2009-01-01

    Background Discovering the functions of all genes is a central goal of contemporary biomedical research. Despite considerable effort, we are still far from achieving this goal in any metazoan organism. Collectively, the growing body of high-throughput functional genomics data provides evidence of gene function, but remains difficult to interpret. Results We constructed the first network of functional relationships for Drosophila melanogaster by integrating most of the available, comprehensive sets of genetic interaction, protein-protein interaction, and microarray expression data. The complete integrated network covers 85% of the currently known genes, which we refined to a high confidence network that includes 20,000 functional relationships among 5,021 genes. An analysis of the network revealed a remarkable concordance with prior knowledge. Using the network, we were able to infer a set of high-confidence Gene Ontology biological process annotations on 483 of the roughly 5,000 previously unannotated genes. We also show that this approach is a means of inferring annotations on a class of genes that cannot be annotated based solely on sequence similarity. Lastly, we demonstrate the utility of the network through reanalyzing gene expression data to both discover clusters of coregulated genes and compile a list of candidate genes related to specific biological processes. Conclusions Here we present the the first genome-wide functional gene network in D. melanogaster. The network enables the exploration, mining, and reanalysis of experimental data, as well as the interpretation of new data. The inferred annotations provide testable hypotheses of previously uncharacterized genes. PMID:19758432

  4. Likelihood-Based Gene Annotations for Gap Filling and Quality Assessment in Genome-Scale Metabolic Models

    PubMed Central

    Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.; Chia, Nicholas; Price, Nathan D.

    2014-01-01

    Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genes and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to obtain a more accurate network. All described workflows are implemented as part of the DOE Systems Biology Knowledgebase (KBase) and are publicly available via API or command-line web interface. PMID:25329157

  5. Comparative Analysis of Chloroplast Genomes: Functional Annotation, Genome-Based Phylogeny, and Deduced Evolutionary Patterns

    PubMed Central

    Rivas, Javier De Las; Lozano, Juan Jose; Ortiz, Angel R.

    2002-01-01

    All protein sequences from 19 complete chloroplast genomes (cpDNA) have been studied using a new computational method able to analyze functional correlations among series of protein sequences contained in complete proteomes. First, all open reading frames (ORFs) from the cpDNAs, comprising a total of 2266 protein sequences, were compared against the 3168 proteins from Synechocystis PCC6803 complete genome to find functionally related orthologous proteins. Additionally, all cpDNA genomes were pairwise compared to find orthologous groups not present in cyanobacteria. Annotations in the cluster of othologous proteins database and CyanoBase were used as reference for the functional assignments. Following this protocol, new functional assignments were made for ORFs of unknown function and for ycfs (hypothetical chloroplast frames), which still lack a functional assignment. Using this information, a matrix of functional relationships was derived from profiles of the presence and/or absence of orthologous proteins; the matrix included 1837 proteins in 277 orthologous clusters. A factor analysis study of this matrix, followed by cluster analysis, allowed us to obtain accurate phylogenetic reconstructions and the detection of genes probably involved in speciation as phylogenetic correlates. Finally, by grouping common evolutionary patterns, we show that it is possible to determine functionally linked protein networks. This has allowed us to suggest putative associations for some unknown ORFs. PMID:11932241

  6. An integrated gene annotation and transcriptional profiling approach towards the full gene content of the Drosophila genome

    Microsoft Academic Search

    M Hild; B Beckmann; SA Haas; B Koch; V Solovyev; C Busold; K Fellenberg; M Boutros; M Vingron; F Sauer; JD Hoheisel; R Paro

    2003-01-01

    Background  While the genome sequences for a variety of organisms are now available, the precise number of the genes encoded is still\\u000a a matter of debate. For the human genome several stringent annotation approaches have resulted in the same number of potential\\u000a genes, but a careful comparison revealed only limited overlap. This indicates that only the combination of different computational\\u000a prediction

  7. Rapid Annotation of Anonymous Sequences from Genome Projects Using Semantic Similarities and a Weighting Scheme in Gene Ontology

    PubMed Central

    Fontana, Paolo; Cestaro, Alessandro; Velasco, Riccardo; Formentin, Elide; Toppo, Stefano

    2009-01-01

    Background Large-scale sequencing projects have now become routine lab practice and this has led to the development of a new generation of tools involving function prediction methods, bringing the latter back to the fore. The advent of Gene Ontology, with its structured vocabulary and paradigm, has provided computational biologists with an appropriate means for this task. Methodology We present here a novel method called ARGOT (Annotation Retrieval of Gene Ontology Terms) that is able to process quickly thousands of sequences for functional inference. The tool exploits for the first time an integrated approach which combines clustering of GO terms, based on their semantic similarities, with a weighting scheme which assesses retrieved hits sharing a certain number of biological features with the sequence to be annotated. These hits may be obtained by different methods and in this work we have based ARGOT processing on BLAST results. Conclusions The extensive benchmark involved 10,000 protein sequences, the complete S. cerevisiae genome and a small subset of proteins for purposes of comparison with other available tools. The algorithm was proven to outperform existing methods and to be suitable for function prediction of single proteins due to its high degree of sensitivity, specificity and coverage. PMID:19247487

  8. New in protein structure and function annotation: hotspots, single nucleotide polymorphisms and the 'Deep Web'.

    PubMed

    Bromberg, Yana; Yachdav, Guy; Ofran, Yanay; Schneider, Reinhard; Rost, Burkhard

    2009-05-01

    The rapidly increasing quantity of protein sequence data continues to widen the gap between available sequences and annotations. Comparative modeling suggests some aspects of the 3D structures of approximately half of all known proteins; homology- and network-based inferences annotate some aspect of function for a similar fraction of the proteome. For most known protein sequences, however, there is detailed knowledge about neither their function nor their structure. Comprehensive efforts towards the expert curation of sequence annotations have failed to meet the demand of the rapidly increasing number of available sequences. Only the automated prediction of protein function in the absence of homology can close the gap between available sequences and annotations in the foreseeable future. This review focuses on two novel methods for automated annotation, and briefly presents an outlook on how modern web software may revolutionize the field of protein sequence annotation. First, predictions of protein binding sites and functional hotspots, and the evolution of these into the most successful type of prediction of protein function from sequence will be discussed. Second, a new tool, comprehensive in silico mutagenesis, which contributes important novel predictions of function and at the same time prepares for the onset of the next sequencing revolution, will be described. While these two new sub-fields of protein prediction represent the breakthroughs that have been achieved methodologically, it will then be argued that a different development might further change the way biomedical researchers benefit from annotations: modern web software can connect the worldwide web in any browser with the 'Deep Web' (ie, proprietary data resources). The availability of this direct connection, and the resulting access to a wealth of data, may impact drug discovery and development more than any existing method that contributes to protein annotation. PMID:19396742

  9. Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data

    PubMed Central

    Gilchrist, Michael J; Christensen, Mikkel B; Harland, Richard; Pollet, Nicolas; Smith, James C; Ueno, Naoto; Papalopulu, Nancy

    2008-01-01

    Background Non-sequence gene data (images, literature, etc.) can be found in many different public databases. Access to these data is mostly by text based methods using gene names; however, gene annotation is neither complete, nor fully systematic between organisms, and is also not generally stable over time. This provides some challenges for text based access, especially for cross-species searches. We propose a method for non-sequence data retrieval based on sequence similarity, which removes dependence on annotation and text searches. This work was motivated by the need to provide better access to large numbers of in situ images, and the observation that such image data were usually associated with a specific gene sequence. Sequence similarity searches are found in existing gene oriented databases, but mostly give indirect access to non-sequence data via navigational links. Results Three applications were built to explore the proposed method: accessing image data, literature and gene names. Searches are initiated with the sequence of the user's gene of interest, which is searched against a database of sequences associated with the target data. The matching (non-sequence) target data are returned directly to the user's browser, organised by sequence similarity. The method worked well for the intended application in image data management. Comparison with text based searches of the image data set showed the accuracy of the method. Applied to literature searches it facilitated retrieval of mostly high relevance references. Applied to gene name data it provided a useful analysis of name variation of related genes within and between species. Conclusion This method makes a powerful and useful addition to existing methods for searching gene data based on text retrieval or curated gene lists. In particular the method facilitates cross-species comparisons, and enables the handling of novel or otherwise un-annotated genes. Applications using the method are quick and easy to build, and the data require little maintenance. This approach largely circumvents the need for annotation, which can be a major obstacle to the development of genomic scale data resources. PMID:18928517

  10. Annotating functional RNAs in genomes using Eric P. Nawrocki

    E-print Network

    Eddy, Sean

    -free grammars; homology search; genome annotation; tRNA; rRNA; SRP RNA; RNase P RNA; CRISPR; riboswitch; 1 involved in the transport or biosynthesis of their target metabolite [35]. The bacterial 6S RNA promotes stationary phase of bacterial growth [71]. Other RNA elements are important for defending cells against

  11. Gene Ontology Automatic Annotation Using a Domain Based Gene Product Similarity Measure

    Microsoft Academic Search

    Mihail Popescu; James M. Keller; Joyce A Mitchell

    2005-01-01

    Recent years have seen an explosive growth in the amount of biological data available for analysis. The large volume of data collected makes it necessary to automatically classify and sort such data on a very large scale. Typically, investigators use computational sequence analysis tools to assign functions to newly found gene products. The problem is to find the functions of

  12. Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones

    SciTech Connect

    Imanishi, Tadashi; Itoh, Takeshi; Suzuki, Yutaka; O'Donovan, Claire; Fukuchi, Satoshi; Koyanagi, Kanako O.; Barrero, Roberto A.; Tamura, Takuro; Yamaguchi-Kabata, Yumi; Tanino, Motohiko; Yura, Kei; Miyazaki, Satoru; Ikeo, Kazuho; Homma, Keiichi; Kasprzyk, Arek; Nishikawa, Tetsuo; Hirakawa, Mika; Thierry-Mieg, Jean; Thierry-Mieg, Danielle; Ashurst, Jennifer; Jia, Libin; Nakao, Mitsuteru; Thomas, Michael A.; Mulder, Nicola; Karavidopoulou, Youla; Jin, Lihua; Kim, Sangsoo; Yasuda, Tomohiro; Lenhard, Boris; Eveno, Eric; Suzuki, Yoshiyuki; Yamasaki, Chisato; Takeda, Jun-ichi; Gough, Craig; Hilton, Phillip; Fujii, Yasuyuki; Sakai, Hiroaki; Tanaka, Susumu; Amid, Clara; Bellgard, Matthew; de Fatima Bonaldo, Maria; Bono Hidemasa; Bromberg, Susan K.; Brookes, Anthony J.; Bruford, Elspeth; Carninci Piero; Chelala, Claude; Couillault, Christine; de Souza, Sandro J.; Debily, Marie-Anne; Devignes, Marie-Dominique; Dubchak, Inna; Endo, Toshinori; Estreicher, Anne; Eyras, Eduardo; Fukami-Kobayashi, Kaoru; Gopinath, Gopal R.; Graudens, Esther; Hahn, Yoonsoo; Han, Michael; Han, Ze-Guang; Hanada, Kousuke; Hanaoka, Hideki; Harada, Erimi; Hashimoto, Katsuyuki; Hinz, Ursula; Hirai, Momoki; Hishiki, Teruyoshi; Hopkinson, Ian; Imbeaud, Sandrine; Inoko, Hidetoshi; Kanapin, Alexander; Kaneko, Yayoi; Kasukawa, Takeya; Kelso, Janet; Kersey, Paul; Kikuno Reiko; Kimura, Kouichi; Korn, Bernhard; Kuryshev, Vladimir; Makalowska, Izabela; Makino Takashi; Mano, Shuhei; Mariage-Samson, Regine; Mashima, Jun; Matsuda, Hideo; Mewes, Hans-Werner; Minoshima, Shinsei; Nagai, Keiichi; Nagasaki, Hideki; Nagata, Naoki; Nigam, Rajni; Ogasawara, Osamu; Ohara, Osamu; Ohtsubo, Masafumi; Okada, Norihiro; Okido, Toshihisa; Oota, Satoshi; Ota, Motonori; Ota, Toshio; Otsuki, Tetsuji; Piatier-Tonneau, Dominique; Poustka, Annemarie; Ren, Shuang-Xi; Saitou, Naruya; Sakai, Katsunaga; Sakamoto, Shigetaka; Sakate, Ryuichi; Schupp, Ingo; Servant, Florence; Sherry, Stephen; Shiba Rie; et al.

    2004-01-15

    The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4 percent of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5 percent of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for nonprotein-coding RNA genes . In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.

  13. Functional Annotation of Proteomic Data from Chicken Heterophils and Macrophages Induced by Carbon Nanotube Exposure

    PubMed Central

    Li, Yun-Ze; Cheng, Chung-Shi; Chen, Chao-Jung; Li, Zi-Lin; Lin, Yao-Tung; Chen, Shuen-Ei; Huang, San-Yuan

    2014-01-01

    With the expanding applications of carbon nanotubes (CNT) in biomedicine and agriculture, questions about the toxicity and biocompatibility of CNT in humans and domestic animals are becoming matters of serious concern. This study used proteomic methods to profile gene expression in chicken macrophages and heterophils in response to CNT exposure. Two-dimensional gel electrophoresis identified 12 proteins in macrophages and 15 in heterophils, with differential expression patterns in response to CNT co-incubation (0, 1, 10, and 100 ?g/mL of CNT for 6 h) (p < 0.05). Gene ontology analysis showed that most of the differentially expressed proteins are associated with protein interactions, cellular metabolic processes, and cell mobility, suggesting activation of innate immune functions. Western blot analysis with heat shock protein 70, high mobility group protein, and peptidylprolyl isomerase A confirmed the alterations of the profiled proteins. The functional annotations were further confirmed by effective cell migration, promoted interleukin-1? secretion, and more cell death in both macrophages and heterophils exposed to CNT (p < 0.05). In conclusion, results of this study suggest that CNT exposure affects protein expression, leading to activation of macrophages and heterophils, resulting in altered cytoskeleton remodeling, cell migration, and cytokine production, and thereby mediates tissue immune responses. PMID:24823882

  14. Applying Support Vector Machines for Gene ontology based gene function prediction

    Microsoft Academic Search

    Arunachalam Vinayagam; Rainer König; Jutta Moormann; Falk Schubert; Roland Eils; Karl-heinz Glatting; Sándor Suhai

    2004-01-01

    BACKGROUND: The current progress in sequencing projects calls for rapid, reliable and accurate function assignments of gene products. A variety of methods has been designed to annotate sequences on a large scale. However, these methods can either only be applied for specific subsets, or their results are not formalised, or they do not provide precise confidence estimates for their predictions.

  15. Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text

    Microsoft Academic Search

    Soumya Ray; Mark Craven

    2005-01-01

    Background: The BioCreative text mining evaluation investigated the application of text mining methods to the task of automatically extracting information from text in biomedical research articles. We participated in Task 2 of the evaluation. For this task, we built a system to automatically annotate a given protein with codes from the Gene Ontology (GO) using the text of an article

  16. A statistical framework to predict functional non-coding regions in the human genome through integrated analysis of annotation data.

    PubMed

    Lu, Qiongshi; Hu, Yiming; Sun, Jiehuan; Cheng, Yuwei; Cheung, Kei-Hoi; Zhao, Hongyu

    2015-01-01

    Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu. PMID:26015273

  17. A Statistical Framework to Predict Functional Non-Coding Regions in the Human Genome Through Integrated Analysis of Annotation Data

    PubMed Central

    Lu, Qiongshi; Hu, Yiming; Sun, Jiehuan; Cheng, Yuwei; Cheung, Kei-Hoi; Zhao, Hongyu

    2015-01-01

    Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu PMID:26015273

  18. Woods: A fast and accurate functional annotator and classifier of genomic and metagenomic sequences.

    PubMed

    Sharma, Ashok K; Gupta, Ankit; Kumar, Sanjiv; Dhakan, Darshan B; Sharma, Vineet K

    2015-07-01

    Functional annotation of the gigantic metagenomic data is one of the major time-consuming and computationally demanding tasks, which is currently a bottleneck for the efficient analysis. The commonly used homology-based methods to functionally annotate and classify proteins are extremely slow. Therefore, to achieve faster and accurate functional annotation, we have developed an orthology-based functional classifier 'Woods' by using a combination of machine learning and similarity-based approaches. Woods displayed a precision of 98.79% on independent genomic dataset, 96.66% on simulated metagenomic dataset and >97% on two real metagenomic datasets. In addition, it performed >87 times faster than BLAST on the two real metagenomic datasets. Woods can be used as a highly efficient and accurate classifier with high-throughput capability which facilitates its usability on large metagenomic datasets. PMID:25863333

  19. De Novo Assembly, Gene Annotation, and Marker Discovery in Stored-Product Pest Liposcelis entomophila (Enderlein) Using Transcriptome Sequences

    PubMed Central

    Wei, Dan-Dan; Chen, Er-Hu; Ding, Tian-Bo; Chen, Shi-Chun; Dou, Wei; Wang, Jin-Jun

    2013-01-01

    Background As a major stored-product pest insect, Liposcelis entomophila has developed high levels of resistance to various insecticides in grain storage systems. However, the molecular mechanisms underlying resistance and environmental stress have not been characterized. To date, there is a lack of genomic information for this species. Therefore, studies aimed at profiling the L. entomophila transcriptome would provide a better understanding of the biological functions at the molecular levels. Methodology/Principal Findings We applied Illumina sequencing technology to sequence the transcriptome of L. entomophila. A total of 54,406,328 clean reads were obtained and that de novo assembled into 54,220 unigenes, with an average length of 571 bp. Through a similarity search, 33,404 (61.61%) unigenes were matched to known proteins in the NCBI non-redundant (Nr) protein database. These unigenes were further functionally annotated with gene ontology (GO), cluster of orthologous groups of proteins (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. A large number of genes potentially involved in insecticide resistance were manually curated, including 68 putative cytochrome P450 genes, 37 putative glutathione S-transferase (GST) genes, 19 putative carboxyl/cholinesterase (CCE) genes, and other 126 transcripts to contain target site sequences or encoding detoxification genes representing eight types of resistance enzymes. Furthermore, to gain insight into the molecular basis of the L. entomophila toward thermal stresses, 25 heat shock protein (Hsp) genes were identified. In addition, 1,100 SSRs and 57,757 SNPs were detected and 231 pairs of SSR primes were designed for investigating the genetic diversity in future. Conclusions/Significance We developed a comprehensive transcriptomic database for L. entomophila. These sequences and putative molecular markers would further promote our understanding of the molecular mechanisms underlying insecticide resistance or environmental stress, and will facilitate studies on population genetics for psocids, as well as providing useful information for functional genomic research in the future. PMID:24244605

  20. VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment

    PubMed Central

    Habegger, Lukas; Balasubramanian, Suganthi; Chen, David Z.; Khurana, Ekta; Sboner, Andrea; Harmanci, Arif; Rozowsky, Joel; Clarke, Declan; Snyder, Michael; Gerstein, Mark

    2012-01-01

    Summary: The functional annotation of variants obtained through sequencing projects is generally assumed to be a simple intersection of genomic coordinates with genomic features. However, complexities arise for several reasons, including the differential effects of a variant on alternatively spliced transcripts, as well as the difficulty in assessing the impact of small insertions/deletions and large structural variants. Taking these factors into consideration, we developed the Variant Annotation Tool (VAT) to functionally annotate variants from multiple personal genomes at the transcript level as well as obtain summary statistics across genes and individuals. VAT also allows visualization of the effects of different variants, integrates allele frequencies and genotype data from the underlying individuals and facilitates comparative analysis between different groups of individuals. VAT can either be run through a command-line interface or as a web application. Finally, in order to enable on-demand access and to minimize unnecessary transfers of large data files, VAT can be run as a virtual machine in a cloud-computing environment. Availability and Implementation: VAT is implemented in C and PHP. The VAT web service, Amazon Machine Image, source code and detailed documentation are available at vat.gersteinlab.org. Contact: lukas.habegger@yale.edu or mark.gerstein@yale.edu Supplementary Information: Supplementary data are available at Bioinformatics online. PMID:22743228

  1. Biases in the Experimental Annotations of Protein Function and Their Effect on Our Understanding of Protein Function Space

    PubMed Central

    Schnoes, Alexandra M.; Ream, David C.; Thorman, Alexander W.; Babbitt, Patricia C.; Friedberg, Iddo

    2013-01-01

    The ongoing functional annotation of proteins relies upon the work of curators to capture experimental findings from scientific literature and apply them to protein sequence and structure data. However, with the increasing use of high-throughput experimental assays, a small number of experimental studies dominate the functional protein annotations collected in databases. Here, we investigate just how prevalent is the “few articles - many proteins” phenomenon. We examine the experimentally validated annotation of proteins provided by several groups in the GO Consortium, and show that the distribution of proteins per published study is exponential, with 0.14% of articles providing the source of annotations for 25% of the proteins in the UniProt-GOA compilation. Since each of the dominant articles describes the use of an assay that can find only one function or a small group of functions, this leads to substantial biases in what we know about the function of many proteins. Mass-spectrometry, microscopy and RNAi experiments dominate high throughput experiments. Consequently, the functional information derived from these experiments is mostly of the subcellular location of proteins, and of the participation of proteins in embryonic developmental pathways. For some organisms, the information provided by different studies overlap by a large amount. We also show that the information provided by high throughput experiments is less specific than those provided by low throughput experiments. Given the experimental techniques available, certain biases in protein function annotation due to high-throughput experiments are unavoidable. Knowing that these biases exist and understanding their characteristics and extent is important for database curators, developers of function annotation programs, and anyone who uses protein function annotation data to plan experiments. PMID:23737737

  2. Involving undergraduates in the annotation and analysis of global gene expression studies: creation of a maize shoot apical meristem expression database.

    PubMed

    Buckner, Brent; Beck, Jon; Browning, Kate; Fritz, Ashleigh; Grantham, Lisa; Hoxha, Eneda; Kamvar, Zhian; Lough, Ashley; Nikolova, Olga; Schnable, Patrick S; Scanlon, Michael J; Janick-Buckner, Diane

    2007-06-01

    Through a multi-university and interdisciplinary project we have involved undergraduate biology and computer science research students in the functional annotation of maize genes and the analysis of their microarray expression patterns. We have created a database to house the results of our functional annotation of >4400 genes identified as being differentially regulated in the maize shoot apical meristem (SAM). This database is located at http://sam.truman.edu and is now available for public use. The undergraduate students involved in constructing this unique SAM database received hands-on training in an intellectually challenging environment, which has prepared them for graduate and professional careers in biological sciences. We describe our experiences with this project as a model for effective research-based teaching of undergraduate biology and computer science students, as well as for a rich professional development experience for faculty at predominantly undergraduate institutions. PMID:17409087

  3. An algorithm for identifying clusters of functionally related genes in genomes 

    E-print Network

    Yi, Gang Man

    2009-05-15

    as in properties of gene clusters, including size distribution and functional annotation. These properties may be diagnostic of the evolutionary forces that lead to the formation of gene clusters. The approach finds all gene clusters in the data set and ranks them...

  4. UniqTag: Content-Derived Unique and Stable Identifiers for Gene Annotation

    PubMed Central

    Jackman, Shaun D.; Bohlmann, Joerg; Birol, ?nanç

    2015-01-01

    When working on an ongoing genome sequencing and assembly project, it is rather inconvenient when gene identifiers change from one build of the assembly to the next. The gene labelling system described here, UniqTag, addresses this common challenge. UniqTag assigns a unique identifier to each gene that is a representative k-mer, a string of length k, selected from the sequence of that gene. Unlike serial numbers, these identifiers are stable between different assemblies and annotations of the same data without requiring that previous annotations be lifted over by sequence alignment. We assign UniqTag identifiers to ten builds of the Ensembl human genome spanning eight years to demonstrate this stability. The implementation of UniqTag in Ruby and an R package are available at https://github.com/sjackman/uniqtag sjackman/uniqtag. The R package is also available from CRAN: install.packages ("uniqtag"). Supplementary material and code to reproduce it is available at https://github.com/sjackman/uniqtag-paper. PMID:26020645

  5. UTMGO: A Tool for Searching a Group of Semantically Related Gene Ontology Terms and Application to Annotation of Anonymous Protein Sequence

    Microsoft Academic Search

    Razib M. Othman; Safaai Deris; Rosli M. Illias

    Gene Ontology terms have been actively used to annotate various protein sets. SWISS-PROT, TrEMBL, and InterPro are protein databases that are annotated according to the Gene Ontology terms. However, direct implementation of the Gene Ontology terms for annotation of anonymous protein sequences is not easy, especially for species not commonly represented in biological databases. UTMGO is developed as a tool

  6. Annotation and re-sequencing of genes from de novo transcriptome assembly of Abies alba (Pinaceae)1

    PubMed Central

    Roschanski, Anna M.; Fady, Bruno; Ziegenhagen, Birgit; Liepelt, Sascha

    2013-01-01

    • Premise of the study: We present a protocol for the annotation of transcriptome sequence data and the identification of candidate genes therein using the example of the nonmodel conifer Abies alba. • Methods and Results: A normalized cDNA library was built from an A. alba seedling. The sequencing on a 454 platform yielded more than 1.5 million reads that were de novo assembled into 25149 contigs. Two complementary approaches were applied to annotate gene fragments that code for (1) well-known proteins and (2) proteins that are potentially adaptively relevant. Primer development and testing yielded 88 amplicons that could successfully be resequenced from genomic DNA. • Conclusions: The annotation workflow offers an efficient way to identify potential adaptively relevant genes from the large quantity of transcriptome sequence data. The primer set presented should be prioritized for single-nucleotide polymorphism detection in adaptively relevant genes in A. alba. PMID:25202477

  7. A meta-approach for improving the prediction and the functional annotation of ortholog groups

    PubMed Central

    2014-01-01

    Background In comparative genomics, orthologs are used to transfer annotation from genes already characterized to newly sequenced genomes. Many methods have been developed for finding orthologs in sets of genomes. However, the application of different methods on the same proteome set can lead to distinct orthology predictions. Methods We developed a method based on a meta-approach that is able to combine the results of several methods for orthologous group prediction. The purpose of this method is to produce better quality results by using the overlapping results obtained from several individual orthologous gene prediction procedures. Our method proceeds in two steps. The first aims to construct seeds for groups of orthologous genes; these seeds correspond to the exact overlaps between the results of all or several methods. In the second step, these seed groups are expanded by using HMM profiles. Results We evaluated our method on two standard reference benchmarks, OrthoBench and Orthology Benchmark Service. Our method presents a higher level of accurately predicted groups than the individual input methods of orthologous group prediction. Moreover, our method increases the number of annotated orthologous pairs without decreasing the annotation quality compared to twelve state-of-the-art methods. Conclusions The meta-approach based method appears to be a reliable procedure for predicting orthologous groups. Since a large number of methods for predicting groups of orthologous genes exist, it is quite conceivable to apply this meta-approach to several combinations of different methods. PMID:25573073

  8. Annotated genetic linkage maps of Pinus pinaster Ait. from a Central Spain population using microsatellite and gene based markers

    PubMed Central

    2012-01-01

    Background Pinus pinaster Ait. is a major resin producing species in Spain. Genetic linkage mapping can facilitate marker-assisted selection (MAS) through the identification of Quantitative Trait Loci and selection of allelic variants of interest in breeding populations. In this study, we report annotated genetic linkage maps for two individuals (C14 and C15) belonging to a breeding program aiming to increase resin production. We use different types of DNA markers, including last-generation molecular markers. Results We obtained 13 and 14 linkage groups for C14 and C15 maps, respectively. A total of 211 and 215 markers were positioned on each map and estimated genome length was between 1,870 and 2,166 cM respectively, which represents near 65% of genome coverage. Comparative mapping with previously developed genetic linkage maps for P. pinaster based on about 60 common markers enabled aligning linkage groups to this reference map. The comparison of our annotated linkage maps and linkage maps reporting QTL information revealed 11 annotated SNPs in candidate genes that co-localized with previously reported QTLs for wood properties and water use efficiency. Conclusions This study provides genetic linkage maps from a Spanish population that shows high levels of genetic divergence with French populations from which segregating progenies have been previously mapped. These genetic maps will be of interest to construct a reliable consensus linkage map for the species. The importance of developing functional genetic linkage maps is highlighted, especially when working with breeding populations for its future application in MAS for traits of interest. PMID:23036012

  9. Semantic Particularity Measure for Functional Characterization of Gene Sets Using Gene Ontology

    PubMed Central

    Bettembourg, Charles; Diot, Christian; Dameron, Olivier

    2014-01-01

    Background Genetic and genomic data analyses are outputting large sets of genes. Functional comparison of these gene sets is a key part of the analysis, as it identifies their shared functions, and the functions that distinguish each set. The Gene Ontology (GO) initiative provides a unified reference for analyzing the genes molecular functions, biological processes and cellular components. Numerous semantic similarity measures have been developed to systematically quantify the weight of the GO terms shared by two genes. We studied how gene set comparisons can be improved by considering gene set particularity in addition to gene set similarity. Results We propose a new approach to compute gene set particularities based on the information conveyed by GO terms. A GO term informativeness can be computed using either its information content based on the term frequency in a corpus, or a function of the term's distance to the root. We defined the semantic particularity of a set of GO terms Sg1 compared to another set of GO terms Sg2. We combined our particularity measure with a similarity measure to compare gene sets. We demonstrated that the combination of semantic similarity and semantic particularity measures was able to identify genes with particular functions from among similar genes. This differentiation was not recognized using only a semantic similarity measure. Conclusion Semantic particularity should be used in conjunction with semantic similarity to perform functional analysis of GO-annotated gene sets. The principle is generalizable to other ontologies. PMID:24489737

  10. Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome

    Microsoft Academic Search

    Casey M Bergman; Barret D Pfeiffer; Diego E Rincón-Limas; Roger A Hoskins; Andreas Gnirke; Chris J Mungall; Adrienne M Wang; Brent Kronmiller; Joanne Pacleb; Soo Park; Mark Stapleton; Kenneth Wan; Reed A George; Pieter J de Jong; Juan Botas; Gerald M Rubin; Susan E Celniker

    2002-01-01

    Background: It is widely accepted that comparative sequence data can aid the functional annotation of genome sequences; however, the most informative species and features of genome evolution for comparison remain to be determined. Results: We analyzed conservation in eight genomic regions (apterous, even-skipped, fushi tarazu, twist, and Rhodopsins 1, 2, 3 and 4) from four Drosophila species (D. erecta, D.

  11. Comparative genomic analysis of the family Iridoviridae: re-annotating and defining the core set of iridovirus genes

    PubMed Central

    Eaton, Heather E; Metcalf, Julie; Penny, Emily; Tcherepanov, Vasily; Upton, Chris; Brunetti, Craig R

    2007-01-01

    Background Members of the family Iridoviridae can cause severe diseases resulting in significant economic and environmental losses. Very little is known about how iridoviruses cause disease in their host. In the present study, we describe the re-analysis of the Iridoviridae family of complex DNA viruses using a variety of comparative genomic tools to yield a greater consensus among the annotated sequences of its members. Results A series of genomic sequence comparisons were made among, and between the Ranavirus and Megalocytivirus genera in order to identify novel conserved ORFs. Of these two genera, the Megalocytivirus genomes required the greatest number of altered annotations. Prior to our re-analysis, the Megalocytivirus species orange-spotted grouper iridovirus and rock bream iridovirus shared 99% sequence identity, but only 82 out of 118 potential ORFs were annotated; in contrast, we predict that these species share an identical complement of genes. These annotation changes allowed the redefinition of the group of core genes shared by all iridoviruses. Seven new core genes were identified, bringing the total number to 26. Conclusion Our re-analysis of genomes within the Iridoviridae family provides a unifying framework to understand the biology of these viruses. Further re-defining the core set of iridovirus genes will continue to lead us to a better understanding of the phylogenetic relationships between individual iridoviruses as well as giving us a much deeper understanding of iridovirus replication. In addition, this analysis will provide a better framework for characterizing and annotating currently unclassified iridoviruses. PMID:17239238

  12. Functional Annotation Analytics of Bacillus Genomes Reveals Stress Responsive Acetate Utilization and Sulfate Uptake in the Biotechnologically Relevant Bacillus megaterium

    PubMed Central

    Williams, Baraka S.; Isokpehi, Raphael D.; Mbah, Andreas N.; Hollman, Antoinesha L.; Bernard, Christina O.; Simmons, Shaneka S.; Ayensu, Wellington K.; Garner, Bianca L.

    2012-01-01

    Bacillus species form an heterogeneous group of Gram-positive bacteria that include members that are disease-causing, biotechnologically-relevant, and can serve as biological research tools. A common feature of Bacillus species is their ability to survive in harsh environmental conditions by formation of resistant endospores. Genes encoding the universal stress protein (USP) domain confer cellular and organismal survival during unfavorable conditions such as nutrient depletion. As of February 2012, the genome sequences and a variety of functional annotations for at least 123 Bacillus isolates including 45 Bacillus cereus isolates were available in public domain bioinformatics resources. Additionally, the genome sequencing status of 10 of the B. cereus isolates were annotated as finished with each genome encoded 3 USP genes. The conservation of gene neighborhood of the 140 aa universal stress protein in the B. cereus genomes led to the identification of a predicted plasmid-encoded transcriptional unit that includes a USP gene and a sulfate uptake gene in the soil-inhabiting Bacillus megaterium. Gene neighborhood analysis combined with visual analytics of chemical ligand binding sites data provided knowledge-building biological insights on possible cellular functions of B. megaterium universal stress proteins. These functions include sulfate and potassium uptake, acid extrusion, cellular energy-level sensing, survival in high oxygen conditions and acetate utilization. Of particular interest was a two-gene transcriptional unit that consisted of genes for a universal stress protein and a sirtuin Sir2 (deacetylase enzyme for NAD+-dependent acetate utilization). The predicted transcriptional units for stress responsive inorganic sulfate uptake and acetate utilization could explain biological mechanisms for survival of soil-inhabiting Bacillus species in sulfate and acetate limiting conditions. Considering the key role of sirtuins in mammalian physiology additional research on the USP-Sir2 transcriptional unit of B. megaterium could help explain mammalian acetate metabolism in glucose-limiting conditions such as caloric restriction. Finally, the deep-rooted position of B. megaterium in the phylogeny of Bacillus species makes the investigation of the functional coupling acetate utilization and stress response compelling. PMID:23226010

  13. A statistical framework for improving genomic annotations of transposon mutagenesis (TM) assigned essential genes.

    PubMed

    Deng, Jingyuan

    2015-01-01

    Whole-genome transposon mutagenesis (TM) experiment followed by sequence-based identification of insertion sites is the most popular genome-wise experiment to identify essential genes in Prokaryota. However, due to the limitation of high-throughput technique, this approach yields substantial systematic biases resulting in the incorrect assignments of many essential genes. To obtain unbiased and accurate annotations of essential genes from TM experiments, we developed a novel Poisson model based statistical framework to refine these TM assignments. In the model, first we identified and incorporated several potential factors such as gene length and TM insertion information which may cause the TM assignment biases into the basic Poisson model. Then we calculated the conditional probability of an essential gene given the observed TM insertion number. By factorizing this probability through introducing a latent variable the real insertion number, we formalized the statistical framework. Through iteratively updating and optimizing model parameters to maximize the goodness-of-fit of the model to the observed TM insertion data, we finalized the model. Using this model, we are able to assign the probability score of essentiality to each individual gene given its TM assignment, which subsequently correct the experimental biases. To enable our model widely useable, we established a user-friendly Web-server that is accessible to the public: http://research.cchmc.org/essentialgene/. PMID:25636618

  14. Genome Annotation by Shotgun Inactivation of a Native Gene in Hemizygous Cells: Application to BRCA2 with Implication of Hypomorphic Variants

    PubMed Central

    Ghosh, Soma; Bhunia, Anil K.; Paun, Bogdan C.; Gilbert, Samuel F.; Dhru, Urmil; Patel, Kalpesh; Kern, Scott E.

    2015-01-01

    The greatest interpretive challenge of modern medicine may be to functionally annotate the vast variation of human genomes. Demonstrating a proposed approach, we created a library of BRCA2 exon 27 shotgun-mutant plasmids including solitary and multiplex mutations to generate human knockin clones using homologous recombination. This 55-mutation, 13-clone syngeneic variance library (SyVaL) comprised severely affected clones having early-stop nonsense mutations, functionally hypomorphic clones having multiple missense mutations emphasizing the potential to identify and assess hypomorphic mutations in novel proteomic and epidemiologic studies, and neutral clones having multiple missense mutations. Efficient coverage of nonessential amino acids was provided by mutation multiplexing. Severe mutations were distinguished from hypomorphic or neutral changes by chemosensitivity assays (hypersensitivity to mitomycin C and acetaldehyde), by analysis of RAD51 focus formation, and by mitotic multipolarity. A multiplex unbiased approach of generating all-human SyVaLs in medically important genes, with random mutations in native genes, would provide databases of variants that could be functionally annotated without concerns arising from exogenous cDNA constructs or interspecies interactions, as a basis for subsequent proteomic domain mapping or clinical calibration if desired. Such gene-irrelevant approaches could be scaled up for multiple genes of clinical interest, providing distributable cellular libraries linked to public-shared functional databases. PMID:25451944

  15. tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles.

    PubMed

    Cejuela, Juan Miguel; McQuilton, Peter; Ponting, Laura; Marygold, Steven J; Stefancsik, Raymund; Millburn, Gillian H; Rost, Burkhard

    2014-01-01

    The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the 'tagtog' system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation. DATABASE URL: www.tagtog.net, www.flybase.org. PMID:24715220

  16. Functional Analysis of the Molecular Interactions of TATA Box-Containing Genes and Essential Genes

    PubMed Central

    Moon, Jisook

    2015-01-01

    Genes can be divided into TATA-containing genes and TATA-less genes according to the presence of TATA box elements at promoter regions. TATA-containing genes tend to be stress-responsive, whereas many TATA-less genes are known to be related to cell growth or “housekeeping” functions. In a previous study, we demonstrated that there are striking differences among four gene sets defined by the presence of TATA box (TATA-containing) and essentiality (TATA-less) with respect to number of associated transcription factors, amino acid usage, and functional annotation. Extending this research in yeast, we identified KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways that are statistically enriched in TATA-containing or TATA-less genes and evaluated the possibility that the enriched pathways are related to stress or growth as reflected by the individual functions of the genes involved. According to their enrichment for either of these two gene sets, we sorted KEGG pathways into TATA-containing-gene-enriched pathways (TEPs) and essential-gene-enriched pathways (EEPs). As expected, genes in TEPs and EEPs exhibited opposite results in terms of functional category, transcriptional regulation, codon adaptation index, and network properties, suggesting the possibility that the bipolar patterns in these pathways also contribute to the regulation of the stress response and to cell survival. Our findings provide the novel insight that significant enrichment of TATA-binding or TATA-less genes defines pathways as stress-responsive or growth-related. PMID:25789484

  17. Coordinated international action to accelerate genome-to-phenome with FAANG, the Functional Annotation of Animal Genomes project.

    PubMed

    Andersson, Leif; Archibald, Alan L; Bottema, Cynthia D; Brauning, Rudiger; Burgess, Shane C; Burt, Dave W; Casas, Eduardo; Cheng, Hans H; Clarke, Laura; Couldrey, Christine; Dalrymple, Brian P; Elsik, Christine G; Foissac, Sylvain; Giuffra, Elisabetta; Groenen, Martien A; Hayes, Ben J; Huang, LuSheng S; Khatib, Hassan; Kijas, James W; Kim, Heebal; Lunney, Joan K; McCarthy, Fiona M; McEwan, John C; Moore, Stephen; Nanduri, Bindu; Notredame, Cedric; Palti, Yniv; Plastow, Graham S; Reecy, James M; Rohrer, Gary A; Sarropoulou, Elena; Schmidt, Carl J; Silverstein, Jeffrey; Tellam, Ross L; Tixier-Boichard, Michele; Tosser-Klopp, Gwenola; Tuggle, Christopher K; Vilkki, Johanna; White, Stephen N; Zhao, Shuhong; Zhou, Huaijun

    2015-01-01

    We describe the organization of a nascent international effort, the Functional Annotation of Animal Genomes (FAANG) project, whose aim is to produce comprehensive maps of functional elements in the genomes of domesticated animal species. PMID:25854118

  18. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes

    Microsoft Academic Search

    Andreas Ruepp; Alfred Zollner; Dieter Maier; Kaj Albermann; Jean Hani; Martin Mokrejs; Igor Tetko; Ulrich G; Gertrud Mannhaupt; H. Werner Mewes

    2004-01-01

    In this paper, we present the Functional Catalogue (FunCat), a hierarchically structured, organism- independent, flexible and scalable controlled classi- fication system enabling the functional description of proteins from any organism. FunCat has been applied for the manual annotation of prokaryotes, fungi, plants and animals. We describe how FunCat is implemented as a highly efficient and robust tool for the manual

  19. Gene Ontology annotation of sequence-specific DNA binding transcription factors: setting the stage for a large-scale curation effort.

    PubMed

    Tripathi, Sushil; Christie, Karen R; Balakrishnan, Rama; Huntley, Rachael; Hill, David P; Thommesen, Liv; Blake, Judith A; Kuiper, Martin; Lægreid, Astrid

    2013-01-01

    Transcription factors control which information in a genome becomes transcribed to produce RNAs that function in the biological systems of cells and organisms. Reliable and comprehensive information about transcription factors is invaluable for large-scale network-based studies. However, existing transcription factor knowledge bases are still lacking in well-documented functional information. Here, we provide guidelines for a curation strategy, which constitutes a robust framework for using the controlled vocabularies defined by the Gene Ontology Consortium to annotate specific DNA binding transcription factors (DbTFs) based on experimental evidence reported in literature. Our standardized protocol and workflow for annotating specific DNA binding RNA polymerase II transcription factors is designed to document high-quality and decisive evidence from valid experimental methods. Within a collaborative biocuration effort involving the user community, we are now in the process of exhaustively annotating the full repertoire of human, mouse and rat proteins that qualify as DbTFs in as much as they are experimentally documented in the biomedical literature today. The completion of this task will significantly enrich Gene Ontology-based information resources for the research community. Database URL: www.tfcheckpoint.org. PMID:23981286

  20. Assessing functional annotation transfers with inter-species conserved coexpression: application to Plasmodium falciparum

    Microsoft Academic Search

    Laurent Bréhélin; Isabelle Florent; Olivier Gascuel; Éric Maréchal

    2010-01-01

    BACKGROUND: Plasmodium falciparum is the main causative agent of malaria. Of the 5 484 predicted genes of P. falciparum, about 57% do not have sufficient sequence similarity to characterized genes in other species to warrant functional assignments. Non-homology methods are thus needed to obtain functional clues for these uncharacterized genes. Gene expression data have been widely used in the recent

  1. Proteogenomic analysis of polymorphisms and gene annotation divergences in prokaryotes using a clustered mass spectrometry-friendly database

    Microsoft Academic Search

    G. A. de Souza; M. O. Arntzen; S. Fortuin; A. C. Schurch; H. Malen; C. R. McEvoy; D. van Soolingen; B. Thiede; R. M. Warren; H. G. Wiker

    2011-01-01

    Precise annotation of genes or open reading frames is still a difficult task that results in divergence even for data generated from the same genomic sequence. This has an impact in further proteomic studies, and also compromises the characterization of clinical isolates with many specific genetic variations that may not be represented in the selected database. We recently developed software

  2. Gene prediction and annotation in Penstemon (Plantaginaceae): A workflow for marker development from extremely low-coverage genome sequencing1

    PubMed Central

    Blischak, Paul D.; Wenzel, Aaron J.; Wolfe, Andrea D.

    2014-01-01

    • Premise of the study: Penstemon (Plantaginaceae) is a large and diverse genus endemic to North America. However, determining the phylogenetic relationships among its 280 species has been difficult due to its recent evolutionary radiation. The development of a large, multilocus data set can help to resolve this challenge. • Methods: Using both previously sequenced genomic libraries and our own low-coverage whole-genome shotgun sequencing libraries, we used the MAKER2 Annotation Pipeline to identify gene regions for the development of sequencing loci from six extremely low-coverage Penstemon genomes (?0.005×–0.007×). We also compared this approach to BLAST searches, and conducted analyses to characterize sequence divergence across the species sequenced. • Results: Annotations and gene predictions were successfully added to more than 10,000 contigs for potential use in downstream primer design. Primers were then designed for chloroplast, mitochondrial, and nuclear loci from these annotated sequences. MAKER2 identified longer gene regions in all six Penstemon genomes when compared with BLASTN and BLASTX searches. The average level of sequence divergence among the six species was 7.14%. • Discussion: Combining bioinformatics tools into a workflow that produces annotations can be useful for creating potential phylogenetic markers from thousands of sequences even when genome coverage is extremely low and reference data are only available from distant relatives. Furthermore, the output from MAKER2 contains information about important gene features, such as exon boundaries, and can be easily integrated with visualization tools to facilitate the process of marker development. PMID:25506519

  3. The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications

    PubMed Central

    Halperin, Inbal; Glazer, Dariya S; Wu, Shirley; Altman, Russ B

    2008-01-01

    Structural genomics efforts contribute new protein structures that often lack significant sequence and fold similarity to known proteins. Traditional sequence and structure-based methods may not be sufficient to annotate the molecular functions of these structures. Techniques that combine structural and functional modeling can be valuable for functional annotation. FEATURE is a flexible framework for modeling and recognition of functional sites in macromolecular structures. Here, we present an overview of the main components of the FEATURE framework, and describe the recent developments in its use. These include automating training sets selection to increase functional coverage, coupling FEATURE to structural diversity generating methods such as molecular dynamics simulations and loop modeling methods to improve performance, and using FEATURE in large-scale modeling and structure determination efforts. PMID:18831785

  4. Ribosome Profiling Reveals Pervasive Translation Outside of Annotated Protein-Coding Genes

    PubMed Central

    Ingolia, Nicholas T.; Brar, Gloria A.; Stern-Ginossar, Noam; Harris, Michael S.; Talhouarne, Gaëlle J. S.; Jackson, Sarah E.; Wills, Mark R.; Weissman, Jonathan S.

    2014-01-01

    SUMMARY Ribosome profiling suggests that ribosomes occupy many regions of the transcriptome thought to be non-coding, including 5? UTRs and lncRNAs. Apparent ribosome footprints outside of protein-coding regions raise the possibility of artifacts unrelated to translation, particularly when they occupy multiple, overlapping open reading frames (ORFs). Here we show hallmarks of translation in these footprints: co-purification with the large ribosomal subunit, response to drugs targeting elongation, trinucleotide periodicity, and initiation at early AUGs. We develop a metric for distinguishing between 80S footprints and nonribosomal sources using footprint size distributions, which validates the vast majority of footprints outside of coding regions. We present evidence for polypeptide production beyond annotated genes, including induction of immune responses following human cytomegalovirus (HCMV) infection. Translation is pervasive on cytosolic transcripts outside of conserved reading frames, and direct detection of this expanded universe of translated products enables efforts to understand how cells manage and exploit its consequences. PMID:25159147

  5. Escherichia coli K-12: a cooperatively developed annotation snapshot--2005

    Microsoft Academic Search

    Monica Riley; Takashi Abe; Martha B. Arnaud; Mary K. B. Berlyn; Frederick R. Blattner; Roy R. Chaudhuri; Jeremy D. Glasner; Takashi Horiuchi; Ingrid M. Keseler; Takehide Kosuge; Hirotada Mori; Nicole T. Perna; Guy Plunkett; Kenneth E. Rudd; Margrethe H. Serres; Gavin H. Thomas; Nicholas R. Thomson; David Wishart; Barry L. Wanner

    2006-01-01

    The goal of this group project has been to coordinate and bring up-to-date information on all genes of Escherichia coli K-12. Annotation of the genome of an organism entails identification of genes, the boundaries of genes in terms of precise start and end sites, and description of the gene products. Known and predicted functions were assigned to each gene product

  6. FAST-NMR: Functional Annotation Screening Technology Using NMR Spectroscopy

    E-print Network

    Powers, Robert

    .; Fellenberg, M.; Heumann, K.; Mewes, H.-W. Nucleic Acids Res. 2003, 31, 207. (4) Kanehisa, M.; Goto, S.; Kawashima, S.; Okuno, Y.; Hattori, M. Nucleic Acids Res. 2004, 32, D277. Figure 1. Functional information-ligand interactions are determined through a tiered NMR screen using a library composed of compounds with known

  7. Phylogenetic molecular function annotation Barbara E Engelhardt1,1

    E-print Network

    , they are often difficult to apply on a genome-wide scale because of the time-consuming step of reconstructing to discover thousands of protein sequences in a new microbial genome than it is to biochemically characterize a statistical graphical model to compute the probabilities of molecular functions for unannotated proteins. Our

  8. Culturable diversity and functional annotation of psychrotrophic bacteria from cold desert of Leh Ladakh (India).

    PubMed

    Yadav, Ajar Nath; Sachan, Shashwati Ghosh; Verma, Priyanka; Tyagi, Satya Prakash; Kaushik, Rajeev; Saxena, Anil K

    2015-01-01

    To study culturable bacterial diversity under subzero temperature conditions and their possible functional annotation, soil and water samples from Leh Ladakh region were analysed. Ten different nutrient combinations were used to isolate the maximum possible culturable morphotypes. A total of 325 bacterial isolates were characterized employing 16S rDNA-Amplified Ribosomal DNA Restriction Analysis with three restriction endonucleases AluI, MspI and HaeIII, which led to formation of 23-40 groups for the different sites at 75 % similarity index, adding up to 175 groups. Phylogenetic analysis based on 16S rRNA gene sequencing led to the identification of 175 bacteria, grouped in four phyla, Firmicutes (54 %), Proteobacteria (28 %), Actinobacteria (16 %) and Bacteroidetes (3 %), and included 29 different genera with 57 distinct species. Overall 39 % of the total morphotypes belonged to the Bacillus and Bacillus derived genera (BBDG) followed by Pseudomonas (14 %), Arthrobacter (9 %), Exiguobacterium (8 %), Alishewanella (4 %), Brachybacterium, Providencia, Planococcus (3 %), Janthinobacterium, Sphingobacterium, Kocuria (2 %) and Aurantimonas, Citricoccus, Cellulosimicrobium, Brevundimonas, Desemzia, Flavobacterium, Klebsiella, Paracoccus, Psychrobacter, Sporosarcina, Staphylococcus, Sinobaca, Stenotrophomonas, Sanguibacter, Vibrio (1 %). The representative isolates from each cluster were screened for their plant growth promoting characteristics at low temperature (5-15 °C). Variations were observed among strains for production of ammonia, hydrogen cyanide, indole-3-acetic acid and siderophore, solubilisation of phosphate, 1-aminocyclopropane-1-carboxylate deaminase activity and biocontrol activity against Rhizoctonia solani and Macrophomina phaseolina. Cold adapted microbes may have application as inoculants and biocontrol agents in crops growing at high altitudes under cold climate condition. PMID:25371316

  9. Emerging applications of read profiles towards the functional annotation of the genome

    PubMed Central

    Pundhir, Sachin; Poirazi, Panayiota; Gorodkin, Jan

    2015-01-01

    Functional annotation of the genome is important to understand the phenotypic complexity of various species. The road toward functional annotation involves several challenges ranging from experiments on individual molecules to large-scale analysis of high-throughput sequencing (HTS) data. HTS data is typically a result of the protocol designed to address specific research questions. The sequencing results in reads, which when mapped to a reference genome often leads to the formation of distinct patterns (read profiles). Interpretation of these read profiles is essential for their analysis in relation to the research question addressed. Several strategies have been employed at varying levels of abstraction ranging from a somewhat ad hoc to a more systematic analysis of read profiles. These include methods which can compare read profiles, e.g., from direct (non-sequence based) alignments to classification of patterns into functional groups. In this review, we highlight the emerging applications of read profiles for the annotation of non-coding RNA and cis-regulatory elements (CREs) such as enhancers and promoters. We also discuss the biological rationale behind their formation. PMID:26042150

  10. Automated Update, Revision, and Quality Control of the Maize Genome Annotations Using MAKER-P Improves the B73 RefGen_v3 Gene Models and Identifies New Genes1[OPEN

    PubMed Central

    Law, MeiYee; Childs, Kevin L.; Campbell, Michael S.; Stein, Joshua C.; Olson, Andrew J.; Holt, Carson; Panchy, Nicholas; Lei, Jikai; Jiao, Dian; Andorf, Carson M.; Lawrence, Carolyn J.; Ware, Doreen; Shiu, Shin-Han; Sun, Yanni; Jiang, Ning; Yandell, Mark

    2015-01-01

    The large size and relative complexity of many plant genomes make creation, quality control, and dissemination of high-quality gene structure annotations challenging. In response, we have developed MAKER-P, a fast and easy-to-use genome annotation engine for plants. Here, we report the use of MAKER-P to update and revise the maize (Zea mays) B73 RefGen_v3 annotation build (5b+) in less than 3 h using the iPlant Cyberinfrastructure. MAKER-P identified and annotated 4,466 additional, well-supported protein-coding genes not present in the 5b+ annotation build, added additional untranslated regions to 1,393 5b+ gene models, identified 2,647 5b+ gene models that lack any supporting evidence (despite the use of large and diverse evidence data sets), identified 104,215 pseudogene fragments, and created an additional 2,522 noncoding gene annotations. We also describe a method for de novo training of MAKER-P for the annotation of newly sequenced grass genomes. Collectively, these results lead to the 6a maize genome annotation and demonstrate the utility of MAKER-P for rapid annotation, management, and quality control of grasses and other difficult-to-annotate plant genomes. PMID:25384563

  11. PHYLOGENOMICS - GUIDED VALIDATION OF FUNCTION FOR CONSERVED UNKNOWN GENES

    SciTech Connect

    V, DE CRECY-LAGARD; D, HANSON A

    2012-01-03

    Identifying functions for all gene products in all sequenced organisms is a central challenge of the post-genomic era. However, at least 30-50% of the proteins encoded by any given genome are of unknown function, or wrongly or vaguely annotated. Many of these 'unknown' proteins are common to prokaryotes and plants. We accordingly set out to predict and experimentally test the functions of such proteins. Our approach to functional prediction is integrative, coupling the extensive post-genomic resources available for plants with comparative genomics based on hundreds of microbial genomes, and functional genomic datasets from model microorganisms. The early phase is computer-assisted; later phases incorporate intellectual input from expert plant and microbial biochemists. The approach thus bridges the gap between automated homology-based annotations and the classical gene discovery efforts of experimentalists, and is much more powerful than purely computational approaches to identifying gene-function associations. Among Arabidopsis genes, we focused on those (2,325 in total) that (i) are unique or belong to families with no more than three members, (ii) are conserved between plants and prokaryotes, and (iii) have unknown or poorly known functions. Computer-assisted selection of promising targets for deeper analysis was based on homology .. independent characteristics associated in the SEED database with the prokaryotic members of each family, specifically gene clustering and phyletic spread, as well as availability of functional genomics data, and publications that could link candidate families to general metabolic areas, or to specific functions. In-depth comparative genomic analysis was then performed for about 500 top candidate families, which connected ~55 of them to general areas of metabolism and led to specific functional predictions for a subset of ~25 more. Twenty predicted functions were experimentally tested in at least one prokaryotic organism via reverse genetics, metabolic profiling, functional complementation, and recombinant protein biochemistry. Our approach predicted and validated functions for 10 formerly uncharacterized protein families common to plants and prokaryotes; none of these functions had previously been correctly predicted by computational methods. The functions of five more are currently being validated. Experimental testing of diverse representatives of these families combined with in silica analysis allowed accurate projection of the annotations to hundreds more sequenced genomes.

  12. Recurrent use of evolutionary importance for functional annotation of proteins based on local structural similarity.

    PubMed

    Kristensen, David M; Chen, Brian Y; Fofanov, Viacheslav Y; Ward, R Matthew; Lisewski, Andreas Martin; Kimmel, Marek; Kavraki, Lydia E; Lichtarge, Olivier

    2006-06-01

    The annotation of protein function has not kept pace with the exponential growth of raw sequence and structure data. An emerging solution to this problem is to identify 3D motifs or templates in protein structures that are necessary and sufficient determinants of function. Here, we demonstrate the recurrent use of evolutionary trace information to construct such 3D templates for enzymes, search for them in other structures, and distinguish true from spurious matches. Serine protease templates built from evolutionarily important residues distinguish between proteases and other proteins nearly as well as the classic Ser-His-Asp catalytic triad. In 53 enzymes spanning 33 distinct functions, an automated pipeline identifies functionally related proteins with an average positive predictive power of 62%, including correct matches to proteins with the same function but with low sequence identity (the average identity for some templates is only 17%). Although these template building, searching, and match classification strategies are not yet optimized, their sequential implementation demonstrates a functional annotation pipeline which does not require experimental information, but only local molecular mimicry among a small number of evolutionarily important residues. PMID:16672239

  13. Gene function, gene networks and the fate of duplicated genes.

    PubMed

    Shimeld, S M

    1999-10-01

    For both copies of a duplicated gene to become fixed in a population and subsequently maintained, selection must favour individuals with both genes over individuals with one. Here I review and assess some of the proposed ways that gene structure and function might affect the likelihood of both copies acquiring distinct functions and therefore positive selection. In particular I focus on the interacting pathways of genes that make up gene networks, and how these may affect genes duplicated both singly and en masse. Using the Wnt and hedgehog pathways as examples and data from developmental and genome analyses, I show that, while some of these theories may genuinely reflect what has occurred in animal evolution, there are still insufficient data to rigorously assess their relative importance. This, however, is likely to change in the near future. PMID:10597639

  14. The NFI-Regulome Database: A tool for annotation and analysis of control regions of genes regulated by Nuclear Factor I transcription factors

    PubMed Central

    2011-01-01

    Background Genome annotation plays an essential role in the interpretation and use of genome sequence information. While great strides have been made in the annotation of coding regions of genes, less success has been achieved in the annotation of the regulatory regions of genes, including promoters, enhancers/silencers, and other regulatory elements. One reason for this disparity in annotated information is that coding regions can be assessed using high-throughput techniques such as EST sequencing, while annotation of regulatory regions often requires a gene-by-gene approach. Results The NFI-Regulome database http://nfiregulome.ccr.buffalo.edu was designed to promote easy annotation of the regulatory regions of genes that contain binding sites for the NFI (Nuclear Factor I) family of transcription factors, using data from the published literature. Binding sites are annotated together with the sequence of the gene, obtained from the UCSC Genome site, and the locations of all binding sites for multiple genes can be displayed in a number of formats designed to facilitate inter-gene comparisons. Classes of genes based on expression pattern, disease involvement, or types of binding sites present can be readily compared in order to assess common "architectural" structures in the regulatory regions. Conclusions The NFI-Regulome database allows rapid display of the relative locations and number of transcription factor binding sites of individual or defined sets of genes that contain binding sites for NFI transcription factors. This database may in the future be expanded into a distributed database structure including other families of transcription factors. Such databases may be useful for identifying common regulatory structures in genes essential for organ development, tissue-specific gene expression or those genes related to specific diseases. PMID:21884625

  15. WS-SNPs&GO: a web server for predicting the deleterious effect of human protein variants using functional annotation

    PubMed Central

    2013-01-01

    Background SNPs&GO is a method for the prediction of deleterious Single Amino acid Polymorphisms (SAPs) using protein functional annotation. In this work, we present the web server implementation of SNPs&GO (WS-SNPs&GO). The server is based on Support Vector Machines (SVM) and for a given protein, its input comprises: the sequence and/or its three-dimensional structure (when available), a set of target variations and its functional Gene Ontology (GO) terms. The output of the server provides, for each protein variation, the probabilities to be associated to human diseases. Results The server consists of two main components, including updated versions of the sequence-based SNPs&GO (recently scored as one of the best algorithms for predicting deleterious SAPs) and of the structure-based SNPs&GO3d programs. Sequence and structure based algorithms are extensively tested on a large set of annotated variations extracted from the SwissVar database. Selecting a balanced dataset with more than 38,000 SAPs, the sequence-based approach achieves 81% overall accuracy, 0.61 correlation coefficient and an Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve of 0.88. For the subset of ~6,600 variations mapped on protein structures available at the Protein Data Bank (PDB), the structure-based method scores with 84% overall accuracy, 0.68 correlation coefficient, and 0.91 AUC. When tested on a new blind set of variations, the results of the server are 79% and 83% overall accuracy for the sequence-based and structure-based inputs, respectively. Conclusions WS-SNPs&GO is a valuable tool that includes in a unique framework information derived from protein sequence, structure, evolutionary profile, and protein function. WS-SNPs&GO is freely available at http://snps.biofold.org/snps-and-go. PMID:23819482

  16. The Zebrafish GenomeWiki: a crowdsourcing approach to connect the long tail for zebrafish gene annotation.

    PubMed

    Singh, Meghna; Bhartiya, Deeksha; Maini, Jayant; Sharma, Meenakshi; Singh, Angom Ramcharan; Kadarkaraisamy, Subburaj; Rana, Rajiv; Sabharwal, Ankit; Nanda, Srishti; Ramachandran, Aravindhakshan; Mittal, Ashish; Kapoor, Shruti; Sehgal, Paras; Asad, Zainab; Kaushik, Kriti; Vellarikkal, Shamsudheen Karuthedath; Jagga, Divya; Muthuswami, Muthulakshmi; Chauhan, Rajendra K; Leonard, Elvin; Priyadarshini, Ruby; Halimani, Mahantappa; Malhotra, Sunny; Patowary, Ashok; Vishwakarma, Harinder; Joshi, Prateek; Bhardwaj, Vivek; Bhaumik, Arijit; Bhatt, Bharat; Jha, Aamod; Kumar, Aalok; Budakoti, Prerna; Lalwani, Mukesh Kumar; Meli, Rajeshwari; Jalali, Saakshi; Joshi, Kandarp; Pal, Koustav; Dhiman, Heena; Laddha, Saurabh V; Jadhav, Vaibhav; Singh, Naresh; Pandey, Vikas; Sachidanandan, Chetana; Ekker, Stephen C; Klee, Eric W; Scaria, Vinod; Sivasubbu, Sridhar

    2014-01-01

    A large repertoire of gene-centric data has been generated in the field of zebrafish biology. Although the bulk of these data are available in the public domain, most of them are not readily accessible or available in nonstandard formats. One major challenge is to unify and integrate these widely scattered data sources. We tested the hypothesis that active community participation could be a viable option to address this challenge. We present here our approach to create standards for assimilation and sharing of information and a system of open standards for database intercommunication. We have attempted to address this challenge by creating a community-centric solution for zebrafish gene annotation. The Zebrafish GenomeWiki is a 'wiki'-based resource, which aims to provide an altruistic shared environment for collective annotation of the zebrafish genes. The Zebrafish GenomeWiki has features that enable users to comment, annotate, edit and rate this gene-centric information. The credits for contributions can be tracked through a transparent microattribution system. In contrast to other wikis, the Zebrafish GenomeWiki is a 'structured wiki' or rather a 'semantic wiki'. The Zebrafish GenomeWiki implements a semantically linked data structure, which in the future would be amenable to semantic search. Database URL: http://genome.igib.res.in/twiki. PMID:24578356

  17. Annotation extension through protein family annotation coherence metrics

    PubMed Central

    Bastos, Hugo P.; Clarke, Luka A.; Couto, Francisco M.

    2013-01-01

    Protein functional annotation consists in associating proteins with textual descriptors elucidating their biological roles. The bulk of annotation is done via automated procedures that ultimately rely on annotation transfer. Despite a large number of existing protein annotation procedures the ever growing protein space is never completely annotated. One of the facets of annotation incompleteness derives from annotation uncertainty. Often when protein function cannot be predicted with enough specificity it is instead conservatively annotated with more generic terms. In a scenario of protein families or functionally related (or even dissimilar) sets this leads to a more difficult task of using annotations to compare the extent of functional relatedness among all family or set members. However, we postulate that identifying sub-sets of functionally coherent proteins annotated at a very specific level, can help the annotation extension of other incompletely annotated proteins within the same family or functionally related set. As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class. We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension. Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families. PMID:24130572

  18. Towards multidimensional genome annotation

    Microsoft Academic Search

    Jennifer L. Reed; Iman Famili; Ines Thiele; Bernhard O. Palsson

    2006-01-01

    Our information about the gene content of organisms continues to grow as more genomes are sequenced and gene products are characterized. Sequence-based annotation efforts have led to a list of cellular components, which can be thought of as a one-dimensional annotation. With growing information about component interactions, facilitated by the advancement of various high-throughput technologies, systemic, or two-dimensional, annotations can

  19. De novo Cloning and Annotation of Genes Associated with Immunity, Detoxification and Energy Metabolism from the Fat Body of the Oriental Fruit Fly, Bactrocera dorsalis

    PubMed Central

    Yang, Wen-Jia; Yuan, Guo-Rui; Cong, Lin; Xie, Yi-Fei; Wang, Jin-Jun

    2014-01-01

    The oriental fruit fly, Bactrocera dorsalis, is a destructive pest in tropical and subtropical areas. In this study, we performed transcriptome-wide analysis of the fat body of B. dorsalis and obtained more than 59 million sequencing reads, which were assembled into 27,787 unigenes with an average length of 591 bp. Among them, 17,442 (62.8%) unigenes matched known proteins in the NCBI database. The assembled sequences were further annotated with gene ontology, cluster of orthologous group terms, and Kyoto encyclopedia of genes and genomes. In depth analysis was performed to identify genes putatively involved in immunity, detoxification, and energy metabolism. Many new genes were identified including serpins, peptidoglycan recognition proteins and defensins, which were potentially linked to immune defense. Many detoxification genes were identified, including cytochrome P450s, glutathione S-transferases and ATP-binding cassette (ABC) transporters. Many new transcripts possibly involved in energy metabolism, including fatty acid desaturases, lipases, alpha amylases, and trehalose-6-phosphate synthases, were identified. Moreover, we randomly selected some genes to examine their expression patterns in different tissues by quantitative real-time PCR, which indicated that some genes exhibited fat body-specific expression in B. dorsalis. The identification of a numerous transcripts in the fat body of B. dorsalis laid the foundation for future studies on the functions of these genes. PMID:24710118

  20. Protein function annotation with Structurally Aligned Local Sites of Activity (SALSAs)

    PubMed Central

    2013-01-01

    Background The prediction of biochemical function from the 3D structure of a protein has proved to be much more difficult than was originally foreseen. A reliable method to test the likelihood of putative annotations and to predict function from structure would add tremendous value to structural genomics data. We report on a new method, Structurally Aligned Local Sites of Activity (SALSA), for the prediction of biochemical function based on a local structural match at the predicted catalytic or binding site. Results Implementation of the SALSA method is described. For the structural genomics protein PY01515 (PDB ID 2aqw) from Plasmodium yoelii, it is shown that the putative annotation, Orotidine 5'-monophosphate decarboxylase (OMPDC), is most likely correct. SALSA analysis of YP_001304206.1 (PDB ID 3h3l), a putative sugar hydrolase from Parabacteroides distasonis, shows that its active site does not bear close resemblance to any previously characterized member of its superfamily, the Concanavalin A-like lectins/glucanases. It is noted that three residues in the active site of the thermophilic beta-1,4-xylanase from Nonomuraea flexuosa (PDB ID 1m4w), Y78, E87, and E176, overlap with POOL-predicted residues of similar type, Y168, D153, and E232, in YP_001304206.1. The substrate recognition regions of the two proteins are rather different, suggesting that YP_001304206.1 is a new functional type within the superfamily. A structural genomics protein from Mycobacterium avium (PDB ID 3q1t) has been reported to be an enoyl-CoA hydratase (ECH), but SALSA analysis shows a poor match between the predicted residues for the SG protein and those of known ECHs. A better local structural match is obtained with Anabaena beta-diketone hydrolase (ABDH), a known ?-diketone hydrolase from Cyanobacterium anabaena (PDB ID 2j5s). This suggests that the reported ECH function of the SG protein is incorrect and that it is more likely a ?-diketone hydrolase. Conclusions A local site match provides a more compelling function prediction than that obtainable from a simple 3D structure match. The present method can confirm putative annotations, identify misannotation, and in some cases suggest a more probable annotation. PMID:23514271

  1. ASAP: the Alternative Splicing Annotation Project

    Microsoft Academic Search

    Christopher Lee; Levan Atanelov; Barmak Modrek; Yi Xing

    2003-01-01

    Recently, genomics analyses have demonstrated that alternative splicing is widespread in mammalian genomes (30-60% of genes reported to have multiple isoforms), and maybe one of their most important mechanisms of functional regulation. However, by comparison with other genomics data such as genome annotation, SNPs, or gene expression, there exists relativelylittle database infrastructure for the studyof alternative splicing. We have constructed

  2. BambooGDB: a bamboo genome database with functional annotation and an analysis platform.

    PubMed

    Zhao, Hansheng; Peng, Zhenhua; Fei, Benhua; Li, Lubin; Hu, Tao; Gao, Zhimin; Jiang, Zehui

    2014-01-01

    Bamboo, as one of the most important non-timber forest products and fastest-growing plants in the world, represents the only major lineage of grasses that is native to forests. Recent success on the first high-quality draft genome sequence of moso bamboo (Phyllostachys edulis) provides new insights on bamboo genetics and evolution. To further extend our understanding on bamboo genome and facilitate future studies on the basis of previous achievements, here we have developed BambooGDB, a bamboo genome database with functional annotation and analysis platform. The de novo sequencing data, together with the full-length complementary DNA and RNA-seq data of moso bamboo composed the main contents of this database. Based on these sequence data, a comprehensively functional annotation for bamboo genome was made. Besides, an analytical platform composed of comparative genomic analysis, protein-protein interactions network, pathway analysis and visualization of genomic data was also constructed. As discovery tools to understand and identify biological mechanisms of bamboo, the platform can be used as a systematic framework for helping and designing experiments for further validation. Moreover, diverse and powerful search tools and a convenient browser were incorporated to facilitate the navigation of these data. As far as we know, this is the first genome database for bamboo. Through integrating high-throughput sequencing data, a full functional annotation and several analysis modules, BambooGDB aims to provide worldwide researchers with a central genomic resource and an extensible analysis platform for bamboo genome. BambooGDB is freely available at http://www.bamboogdb.org/. Database URL: http://www.bamboogdb.org. PMID:24602877

  3. Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome

    Microsoft Academic Search

    Casey M Bergman; Barret D Pfeiffer; Diego E Rincón-Limas; Roger A Hoskins; Andreas Gnirke; Chris J Mungall; Adrienne M Wang; Brent Kronmiller; Joanne Pacleb; Soo Park; Mark Stapleton; Kenneth Wan; Reed A George; Pieter J de Jong; Juan Botas; Gerald M Rubin; Susan E Celniker

    2002-01-01

    Background  It is widely accepted that comparative sequence data can aid the functional annotation of genome sequences; however, the most\\u000a informative species and features of genome evolution for comparison remain to be determined.\\u000a \\u000a \\u000a \\u000a \\u000a Results  We analyzed conservation in eight genomic regions (apterous, even-skipped, fushi tarazu, twist, and Rhodopsins 1, 2, 3 and 4) from four Drosophila species (D. erecta, D. pseudoobscura, D.

  4. Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation

    Microsoft Academic Search

    Phillip W. Lord; Robert D. Stevens; Andy Brass; Carole A. Goble

    2003-01-01

    Motivation: Many bioinformatics data resources not only hold data in the form of sequences, but also as annotation. In the majority of cases, annotation is written as scientific natu- ral language: this is suitable for humans, but not particularly useful for machine processing. Ontologies offer a mechanism by which knowledge can be represented in a form capable of such processing.

  5. De Novo Assembly and Annotation of the Transcriptome of the Agricultural Weed Ipomoea purpurea Uncovers Gene Expression Changes Associated with Herbicide Resistance

    PubMed Central

    Leslie, Trent; Baucom, Regina S.

    2014-01-01

    Human-mediated selection can lead to rapid evolution in very short time scales, and the evolution of herbicide resistance in agricultural weeds is an excellent example of this phenomenon. The common morning glory, Ipomoea purpurea, is resistant to the herbicide glyphosate, but genetic investigations of this trait have been hampered by the lack of genomic resources for this species. Here, we present the annotated transcriptome of the common morning glory, Ipomoea purpurea, along with an examination of whole genome expression profiling to assess potential gene expression differences between three artificially selected herbicide resistant lines and three susceptible lines. The assembled Ipomoea transcriptome reported in this work contains 65,459 assembled transcripts, ~28,000 of which were functionally annotated by assignment to Gene Ontology categories. Our RNA-seq survey using this reference transcriptome identified 19 differentially expressed genes associated with resistance—one of which, a cytochrome P450, belongs to a large plant family of genes involved in xenobiotic detoxification. The differentially expressed genes also broadly implicated receptor-like kinases, which were down-regulated in the resistant lines, and other growth and defense genes, which were up-regulated in resistant lines. Interestingly, the target of glyphosate—EPSP synthase—was not overexpressed in the resistant Ipomoea lines as in other glyphosate resistant weeds. Overall, this work identifies potential candidate resistance loci for future investigations and dramatically increases genomic resources for this species. The assembled transcriptome presented herein will also provide a valuable resource to the Ipomoea community, as well as to those interested in utilizing the close relationship between the Convolvulaceae and the Solanaceae for phylogenetic and comparative genomics examinations. PMID:25155274

  6. De novo assembly and annotation of the transcriptome of the agricultural weed Ipomoea purpurea uncovers gene expression changes associated with herbicide resistance.

    PubMed

    Leslie, Trent; Baucom, Regina S

    2014-10-01

    Human-mediated selection can lead to rapid evolution in very short time scales, and the evolution of herbicide resistance in agricultural weeds is an excellent example of this phenomenon. The common morning glory, Ipomoea purpurea, is resistant to the herbicide glyphosate, but genetic investigations of this trait have been hampered by the lack of genomic resources for this species. Here, we present the annotated transcriptome of the common morning glory, Ipomoea purpurea, along with an examination of whole genome expression profiling to assess potential gene expression differences between three artificially selected herbicide resistant lines and three susceptible lines. The assembled Ipomoea transcriptome reported in this work contains 65,459 assembled transcripts, ~28,000 of which were functionally annotated by assignment to Gene Ontology categories. Our RNA-seq survey using this reference transcriptome identified 19 differentially expressed genes associated with resistance-one of which, a cytochrome P450, belongs to a large plant family of genes involved in xenobiotic detoxification. The differentially expressed genes also broadly implicated receptor-like kinases, which were down-regulated in the resistant lines, and other growth and defense genes, which were up-regulated in resistant lines. Interestingly, the target of glyphosate-EPSP synthase-was not overexpressed in the resistant Ipomoea lines as in other glyphosate resistant weeds. Overall, this work identifies potential candidate resistance loci for future investigations and dramatically increases genomic resources for this species. The assembled transcriptome presented herein will also provide a valuable resource to the Ipomoea community, as well as to those interested in utilizing the close relationship between the Convolvulaceae and the Solanaceae for phylogenetic and comparative genomics examinations. PMID:25155274

  7. Functionally Enigmatic Genes: A Case Study of the Brain Ignorome

    PubMed Central

    Pandey, Ashutosh K.; Lu, Lu; Wang, Xusheng; Homayouni, Ramin; Williams, Robert W.

    2014-01-01

    What proportion of genes with intense and selective expression in specific tissues, cells, or systems are still almost completely uncharacterized with respect to biological function? In what ways do these functionally enigmatic genes differ from well-studied genes? To address these two questions, we devised a computational approach that defines so-called ignoromes. As proof of principle, we extracted and analyzed a large subset of genes with intense and selective expression in brain. We find that publications associated with this set are highly skewed—the top 5% of genes absorb 70% of the relevant literature. In contrast, approximately 20% of genes have essentially no neuroscience literature. Analysis of the ignorome over the past decade demonstrates that it is stubbornly persistent, and the rapid expansion of the neuroscience literature has not had the expected effect on numbers of these genes. Surprisingly, ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery being associated with greater research momentum—a genomic bandwagon effect. Finally we ask to what extent massive genomic, imaging, and phenotype data sets can be used to provide high-throughput functional annotation for an entire ignorome. In a majority of cases we have been able to extract and add significant information for these neglected genes. In several cases—ELMOD1, TMEM88B, and DZANK1—we have exploited sequence polymorphisms, large phenome data sets, and reverse genetic methods to evaluate the function of ignorome genes. PMID:24523945

  8. Correlation between Gene Expression and GO Semantic Similarity

    Microsoft Academic Search

    Jose L. Sevilla; Victor Segura; Adam Podhorski; Elizabeth Guruceaga; Jose M. Mato; Luis A. Martinez-Cruz; Fernando J. Corrales; Angel Rubio

    2005-01-01

    This research analyzes some aspects of the relationship between gene expression, gene function, and gene annotation. Many recent studies are implicitly based on the assumption that gene products that are biologically and functionally related would maintain this similarity both in their expression profiles as well as in their Gene Ontology (GO) annotation. We analyze how accurate this assumption proves to

  9. FIGENIX: Intelligent automation of genomic annotation: expertise integration in a new software platform

    Microsoft Academic Search

    Philippe Gouret; Vérane Vitiello; Nathalie Balandraud; André Gilles; Pierre Pontarotti; Etienne G. J. Danchin

    2005-01-01

    Background: Two of the main objectives of the genomic and post-genomic era are to structurally and functionally annotate genomes which consists of detecting genes' position and structure, and inferring their function (as well as of other features of genomes). Structural and functional annotation both require the complex chaining of numerous different software, algorithms and methods under the supervision of a

  10. Taxonomic and functional annotation of gut bacterial communities of Eisenia foetida and Perionyx excavatus.

    PubMed

    Singh, Arjun; Singh, Dushyant P; Tiwari, Rameshwar; Kumar, Kanika; Singh, Ran Vir; Singh, Surender; Prasanna, Radha; Saxena, Anil K; Nain, Lata

    2015-06-01

    Epigeic earthworms can significantly hasten the decomposition of organic matter, which is known to be mediated by gut associated microflora. However, there is scanty information on the abundance and diversity of the gut bacterial flora in different earthworm genera fed with a similar diet, particularly Eisenia foetida and Perionyx excavatus. In this context, 16S rDNA based clonal survey of gut metagenomic DNA was assessed after growth of these two earthworms on lignocellulosic biomass. A set of 67 clonal sequences belonging to E. foetida and 75 to P. excavatus were taxonomically annotated using MG-RAST and RDP pipeline servers. Highest number of sequences were annotated to Proteobacteria (38-44%), followed by unclassified bacteria (14-18%) and Firmicutes (9.3-11%). Comparative analyses revealed significantly higher abundance of Actinobacteria and Firmicutes in the gut of P. excavatus. The functional annotation for the 16S rDNA clonal libraries of both the metagenomes revealed a high abundance of xylan degraders (12.1-24.1%). However, chitin degraders (16.7%), ammonia oxidizers (24.1%) and nitrogen fixers (7.4%) were relatively higher in E. foetida, while in P. excavatus; sulphate reducers and sulphate oxidizers (12.1-29.6%) were more abundant. Lignin degradation was detected in 3.7% clones of E. foetida, while cellulose degraders represented 1.7%. The gut microbiomes showed relative abundance of dehalogenators (17.2-22.2%) and aromatic hydrocarbon degraders (1.7-5.6%), illustrating their role in bioremediation. This study highlights the significance of differences in the inherent microbiome of these two earthworms in shaping the metagenome for effective degradation of different types of biomass under tropical conditions. PMID:25813857

  11. Towards revealing the functions of all genes in plants.

    PubMed

    Rhee, Seung Yon; Mutwil, Marek

    2014-04-01

    The great recent progress made in identifying the molecular parts lists of organisms revealed the paucity of our understanding of what most of the parts do. In this review, we introduce computational and statistical approaches and omics data used for inferring gene function in plants, with an emphasis on network-based inference. We also discuss caveats associated with network-based function predictions such as performance assessment, annotation propagation, the guilt-by-association concept, and the meaning of hubs. Finally, we note the current limitations and possible future directions such as the need for gold standard data from several species, unified access to data and tools, quantitative comparison of data and tool quality, and high-throughput experimental validation platforms for systematic gene function elucidation in plants. PMID:24231067

  12. A weighted power framework for integrating multisource information: gene function prediction in yeast.

    PubMed

    Ray, Shubhra Sankar; Bandyopadhyay, Sanghamitra; Pal, Sankar K

    2012-04-01

    Predicting the functions of unannotated genes is one of the major challenges of biological investigation. In this study, we propose a weighted power scoring framework, called weighted power biological score (WPBS), for combining different biological data sources and predicting the function of some of the unclassified yeast Saccharomyces cerevisiae genes. The relative power and weight coefficients of different data sources, in the proposed score, are estimated systematically by utilizing functional annotations [yeast Gene Ontology (GO)-Slim: Process] of classified genes, available from Saccharomyces Genome Database. Genes are then clustered by applying k-medoids algorithm on WPBS, and functional categories of 334 unclassified genes are predicted using a P-value cutoff 1 ×10(-5). The WPBS is available online at http://www.isical.ac.in/~ shubhra/WPBS/WPBS.html, where one can download WPBS, related files, and a MATLAB code to predict functions of unclassified genes. PMID:22318478

  13. Comprehensive investigation of parameter choice in viral integration site analysis and its effects on the gene annotations produced.

    PubMed

    Huston, Marshall W; Brugman, Martijn H; Horsman, Sebastiaan; Stubbs, Andrew; van der Spek, Peter; Wagemaker, Gerard

    2012-11-01

    Introducing therapeutic genes into hematopoietic stem cells using retroviral vector-mediated gene transfer is an effective treatment for monogenic diseases. The risks of therapeutic gene integration include aberrant expression of a neighboring gene, resulting in oncogenesis at low frequencies (10(-7)-10(-6)/transduced cell). Mechanisms governing insertional mutagenesis are the subject of intensive ongoing studies that produce large amounts of sequencing data representing genomic regions flanking viral integration sites (IS). Validating and analyzing these data require automated bioinformatics applications. The exact methods used vary between applications, based on the requirements and preferences of the designer. The parameters used to analyze sequence data are capable of shaping the resulting integration site annotations, but a comprehensive examination of these effects is lacking. Here we present a web-based tool for integration site analysis, called Methods for Analyzing ViRal Integration Collections (MAVRIC), and use its highly customizable interface to look at how IS annotations can vary based on the analysis parameters. We used the integration data of the previously published adenosine deaminase severe combined immunodeficiency (ADA-SCID) gene therapy trials for evaluation of MAVRIC. The output illustrates how MAVRIC allows for direct multiparameter comparison of integration patterns. Careful analysis of the SCID data and reanalyses using different parameters for trimming, alignment, and repeat masking revealed the degree of variation that can be expected to arise due to changes in these parameters. We observed mainly small differences in annotation, with the largest effects caused by masking repeat sequences and by changing the size of the window around the IS. PMID:22909036

  14. Community annotation in biology

    PubMed Central

    2010-01-01

    Attempts to engage the scientific community to annotate biological data (such as protein/gene function) stored in databases have not been overly successful. There are several hypotheses on why this has not been successful but it is not clear which of these hypotheses are correct. In this study we have surveyed 50 biologists (who have recently published a paper characterizing a gene or protein) to better understand what would make them interested in providing input/contributions to biological databases. Based on our survey two things become clear: a) database managers need to proactively contact biologists to solicit contributions; and b) potential contributors need to be provided with an easy-to-use interface and clear instructions on what to annotate. Other factors such as 'reward' and 'employer/funding agency recognition' previously perceived as motivators was found to be less important. Based on this study we propose community annotation projects should devote resources to direct solicitation for input and streamlining of the processes or interfaces used to collect this input. Reviewers This article was reviewed by I. King Jordan, Daniel Haft and Yuriy Gusev PMID:20167071

  15. Genomic Sequence around Butterfly Wing Development Genes: Annotation and Comparative Analysis

    Microsoft Academic Search

    Inês C. Conceição; Anthony D. Long; Jonathan D. Gruber; Patrícia Beldade

    2011-01-01

    BackgroundAnalysis of genomic sequence allows characterization of genome content and organization, and access beyond gene-coding regions for identification of functional elements. BAC libraries, where relatively large genomic regions are made readily available, are especially useful for species without a fully sequenced genome and can increase genomic coverage of phylogenetic and biological diversity. For example, no butterfly genome is yet available

  16. Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-Prot

    PubMed Central

    Ehrler, Frédéric; Geissbühler, Antoine; Jimeno, Antonio; Ruch, Patrick

    2005-01-01

    Background In the context of the BioCreative competition, where training data were very sparse, we investigated two complementary tasks: 1) given a Swiss-Prot triplet, containing a protein, a GO (Gene Ontology) term and a relevant article, extraction of a short passage that justifies the GO category assignement; 2) given a Swiss-Prot pair, containing a protein and a relevant article, automatic assignement of a set of categories. Methods Sentence is the basic retrieval unit. Our classifier computes a distance between each sentence and the GO category provided with the Swiss-Prot entry. The Text Categorizer computes a distance between each GO term and the text of the article. Evaluations are reported both based on annotator judgements as established by the competition and based on mean average precision measures computed using a curated sample of Swiss-Prot. Results Our system achieved the best recall and precision combination both for passage retrieval and text categorization as evaluated by official evaluators. However, text categorization results were far below those in other data-poor text categorization experiments The top proposed term is relevant in less that 20% of cases, while categorization with other biomedical controlled vocabulary, such as the Medical Subject Headings, we achieved more than 90% precision. We also observe that the scoring methods used in our experiments, based on the retrieval status value of our engines, exhibits effective confidence estimation capabilities. Conclusion From a comparative perspective, the combination of retrieval and natural language processing methods we designed, achieved very competitive performances. Largely data-independent, our systems were no less effective that data-intensive approaches. These results suggests that the overall strategy could benefit a large class of information extraction tasks, especially when training data are missing. However, from a user perspective, results were disappointing. Further investigations are needed to design applicable end-user text mining tools for biologists. PMID:15960836

  17. Oncotator: cancer variant annotation tool.

    PubMed

    Ramos, Alex H; Lichtenstein, Lee; Gupta, Manaswi; Lawrence, Michael S; Pugh, Trevor J; Saksena, Gordon; Meyerson, Matthew; Getz, Gad

    2015-04-01

    Oncotator is a tool for annotating genomic point mutations and short nucleotide insertions/deletions (indels) with variant- and gene-centric information relevant to cancer researchers. This information is drawn from 14 different publicly available resources that have been pooled and indexed, and we provide an extensible framework to add additional data sources. Annotations linked to variants range from basic information, such as gene names and functional classification (e.g. missense), to cancer-specific data from resources such as the Catalogue of Somatic Mutations in Cancer (COSMIC), the Cancer Gene Census, and The Cancer Genome Atlas (TCGA). For local use, Oncotator is freely available as a python module hosted on Github (https://github.com/broadinstitute/oncotator). Furthermore, Oncotator is also available as a web service and web application at http://www.broadinstitute.org/oncotator/. PMID:25703262

  18. Functional annotation and three-dimensional structure of an incorrectly annotated dihydroorotase from cog3964 in the amidohydrolase superfamily.

    PubMed

    Ornelas, Argentina; Korczynska, Magdalena; Ragumani, Sugadev; Kumaran, Desigan; Narindoshvili, Tamari; Shoichet, Brian K; Swaminathan, Subramanyam; Raushel, Frank M

    2013-01-01

    The substrate specificities of two incorrectly annotated enzymes belonging to cog3964 from the amidohydrolase superfamily were determined. This group of enzymes are currently misannotated as either dihydroorotases or adenine deaminases. Atu3266 from Agrobacterium tumefaciens C58 and Oant2987 from Ochrobactrum anthropi ATCC 49188 were found to catalyze the hydrolysis of acetyl-(R)-mandelate and similar esters with values of k(cat)/K(m) that exceed 10(5) M(-1) s(-1). These enzymes do not catalyze the deamination of adenine or the hydrolysis of dihydroorotate. Atu3266 was crystallized and the structure determined to a resolution of 2.62 Å. The protein folds as a distorted (?/?)(8) barrel and binds two zincs in the active site. The substrate profile was determined via a combination of computational docking to the three-dimensional structure of Atu3266 and screening of a highly focused library of potential substrates. The initial weak hit was the hydrolysis of N-acetyl-D-serine (k(cat)/K(m) = 4 M(-1) s(-1)). This was followed by the progressive identification of acetyl-(R)-glycerate (k(cat)/K(m) = 4 × 10(2) M(-1) s(-1)), acetyl glycolate (k(cat)/K(m) = 1.3 × 10(4) M(-1) s(-1)), and ultimately acetyl-(R)-mandelate (k(cat)/K(m) = 2.8 × 10(5) M(-1) s(-1)). PMID:23214420

  19. A transcriptomic analysis of striped catfish (Pangasianodon hypophthalmus) in response to salinity adaptation: De novo assembly, gene annotation and marker discovery.

    PubMed

    Thanh, Nguyen Minh; Jung, Hyungtaek; Lyons, Russell E; Chand, Vincent; Tuan, Nguyen Viet; Thu, Vo Thi Minh; Mather, Peter

    2014-06-01

    The striped catfish (Pangasianodon hypophthalmus) culture industry in the Mekong Delta in Vietnam has developed rapidly over the past decade. The culture industry now however, faces some significant challenges, especially related to climate change impacts notably from predicted extensive saltwater intrusion into many low topographical coastal provinces across the Mekong Delta. This problem highlights a need for development of culture stocks that can tolerate more saline culture environments as a response to expansion of saline water-intruded land. While a traditional artificial selection program can potentially address this need, understanding the genomic basis of salinity tolerance can assist development of more productive culture lines. The current study applied a transcriptomic approach using Ion PGM technology to generate expressed sequence tag (EST) resources from the intestine and swim bladder from striped catfish reared at a salinity level of 9ppt which showed best growth performance. Total sequence data generated was 467.8Mbp, consisting of 4,116,424 reads with an average length of 112bp. De novo assembly was employed that generated 51,188 contigs, and allowed identification of 16,116 putative genes based on the GenBank non-redundant database. GO annotation, KEGG pathway mapping, and functional annotation of the EST sequences recovered with a wide diversity of biological functions and processes. In addition, more than 11,600 simple sequence repeats were also detected. This is the first comprehensive analysis of a striped catfish transcriptome, and provides a valuable genomic resource for future selective breeding programs and functional or evolutionary studies of genes that influence salinity tolerance in this important culture species. PMID:24841517

  20. Validation of a novel expressed sequence tag (EST) clustering method and development of a phylogenetic annotation pipeline for livestock gene families 

    E-print Network

    Venkatraman, Anand

    2009-05-15

    -species comparisons. In the absence of completed genomes and the accompanying high-quality annotations, expressed sequence tags (ESTs) from random cDNA clones are the primary tools for functional genomics. EST datasets are fragmented and redundant, necessitating...

  1. On Anomalies in Annotation Systems

    E-print Network

    Brust, Matthias R

    2007-01-01

    Today's computer-based annotation systems implement a wide range of functionalities that often go beyond those available in traditional paper-and-pencil annotations. Conceptually, annotation systems are based on thoroughly investigated psycho-sociological and pedagogical learning theories. They offer a huge diversity of annotation types that can be placed in textual as well as in multimedia format. Additionally, annotations can be published or shared with a group of interested parties via well-organized repositories. Although highly sophisticated annotation systems exist both conceptually as well as technologically, we still observe that their acceptance is somewhat limited. In this paper, we argue that nowadays annotation systems suffer from several fundamental problems that are inherent in the traditional paper-and-pencil annotation paradigm. As a solution, we propose to shift the annotation paradigm for the implementation of annotation system.

  2. Transcriptome sequencing and annotation of the microalgae Dunaliella tertiolecta: Pathway description and gene discovery for production of next-generation biofuels

    PubMed Central

    2011-01-01

    Background Biodiesel or ethanol derived from lipids or starch produced by microalgae may overcome many of the sustainability challenges previously ascribed to petroleum-based fuels and first generation plant-based biofuels. The paucity of microalgae genome sequences, however, limits gene-based biofuel feedstock optimization studies. Here we describe the sequencing and de novo transcriptome assembly for the non-model microalgae species, Dunaliella tertiolecta, and identify pathways and genes of importance related to biofuel production. Results Next generation DNA pyrosequencing technology applied to D. tertiolecta transcripts produced 1,363,336 high quality reads with an average length of 400 bases. Following quality and size trimming, ~ 45% of the high quality reads were assembled into 33,307 isotigs with a 31-fold coverage and 376,482 singletons. Assembled sequences and singletons were subjected to BLAST similarity searches and annotated with Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology (KO) identifiers. These analyses identified the majority of lipid and starch biosynthesis and catabolism pathways in D. tertiolecta. Conclusions The construction of metabolic pathways involved in the biosynthesis and catabolism of fatty acids, triacylglycrols, and starch in D. tertiolecta as well as the assembled transcriptome provide a foundation for the molecular genetics and functional genomics required to direct metabolic engineering efforts that seek to enhance the quantity and character of microalgae-based biofuel feedstock. PMID:21401935

  3. Improving the gene structure annotation of the apicomplexan parasite Neospora caninum fulfils a vital requirement towards an in silico-derived vaccine.

    PubMed

    Goodswen, Stephen J; Barratt, Joel L N; Kennedy, Paul J; Ellis, John T

    2015-04-01

    Neospora caninum is an apicomplexan parasite which can cause abortion in cattle, instigating major economic burden. Vaccination has been proposed as the most cost-effective control measure to alleviate this burden. Consequently the overriding aspiration for N. caninum research is the identification and subsequent evaluation of vaccine candidates in animal models. To save time, cost and effort, it is now feasible to use an in silico approach for vaccine candidate prediction. Precise protein sequences, derived from the correct open reading frame, are paramount and arguably the most important factor determining the success or failure of this approach. The challenge is that publicly available N. caninum sequences are mostly derived from gene predictions. Annotated inaccuracies can lead to erroneously predicted vaccine candidates by bioinformatics programs. This study evaluates the current N. caninum annotation for potential inaccuracies. Comparisons with annotation from a closely related pathogen, Toxoplasma gondii, are also made to distinguish patterns of inconsistency. More importantly, a mRNA sequencing (RNA-Seq) experiment is used to validate the annotation. Potential discrepancies originating from a questionable start codon context and exon boundaries were identified in 1943 protein coding sequences. We conclude, where experimental data were available, that the majority of N. caninum gene sequences were reliably predicted. Nevertheless, almost 28% of genes were identified as questionable. Given the limitations of RNA-Seq, the intention of this study was not to replace the existing annotation but to support or oppose particular aspects of it. Ideally, many studies aimed at improving the annotation are required to build a consensus. We believe this study, in providing a new resource on gene structure and annotation, is a worthy contributor to this endeavour. PMID:25747726

  4. Multidimensional annotation of the Escherichia coli K-12 genome

    Microsoft Academic Search

    Peter D. Karp; Ingrid M. Keseler; Alexander Shearer; Mario Latendresse; Markus Krummenacker; Suzanne M. Paley; Ian Paulsen; Julio Collado-Vides; Socorro Gama-Castro; Martin Peralta-Gil; Alberto Santos-Zavaleta; M. I. Penaloza-Spinola; C. Bonavides-Martinez; J. Ingraham

    2007-01-01

    The annotation of the Escherichia coli K-12 genome in the EcoCyc database is one of the most accurate, complete and multidimensional genome annota- tions. Of the 4460 E. coli genes, EcoCyc assigns biochemical functions to 76%, and 66% of all genes had their functions determined experimentally. EcoCyc assigns E. coli genes to Gene Ontology and to MultiFun. Seventy-five percent of

  5. Elucidating gene function and function evolution through comparison of co-expression networks of plants

    PubMed Central

    Hansen, Bjoern O.; Vaid, Neha; Musialak-Lange, Magdalena; Janowski, Marcin; Mutwil, Marek

    2014-01-01

    The analysis of gene expression data has shown that transcriptionally coordinated (co-expressed) genes are often functionally related, enabling scientists to use expression data in gene function prediction. This Focused Review discusses our original paper (Large-scale co-expression approach to dissect secondary cell wall formation across plant species, Frontiers in Plant Science 2:23). In this paper we applied cross-species analysis to co-expression networks of genes involved in cellulose biosynthesis. We showed that the co-expression networks from different species are highly similar, indicating that whole biological pathways are conserved across species. This finding has two important implications. First, the analysis can transfer gene function annotation from well-studied plants, such as Arabidopsis, to other, uncharacterized plant species. As the analysis finds genes that have similar sequence and similar expression pattern across different organisms, functionally equivalent genes can be identified. Second, since co-expression analyses are often noisy, a comparative analysis should have higher performance, as parts of co-expression networks that are conserved are more likely to be functionally relevant. In this Focused Review, we outline the comparative analysis done in the original paper and comment on the recent advances and approaches that allow comparative analyses of co-function networks. We hypothesize that in comparison to simple co-expression analysis, comparative analysis would yield more accurate gene function predictions. Finally, by combining comparative analysis with genomic information of green plants, we propose a possible composition of cellulose biosynthesis machinery during earlier stages of plant evolution. PMID:25191328

  6. Function of the DISC1 Gene

    NSDL National Science Digital Library

    2009-04-14

    As a result of the human genome project, we now know largely where our genes are, and what structure they have. The search to uncover each gene's function, on the other hand, is only in its infancy. Functional genomics is an area of research dedicated to studying what protein is produced by a gene, and what happens in the body when it is activated. Understanding gene function is the next major hurdle in genomic research, which holds the key to developing revolutionary therapeutics.

  7. Protein structure annotation resources.

    PubMed

    Gabanyi, Margaret J; Berman, Helen M

    2015-01-01

    A key reason three-dimensional (3-D) protein structures are annotated with supporting or derived information is to understand the molecular basis of protein function. To this end, protein structure annotation databases curate key facts and observations, based on community-accepted standards, about the ~100,000 3-D experimental protein structures to date. This review will introduce the primary structure repositories, databases, and value-added structural annotation databases, as well as the range of information they provide. The different levels of annotation data (primary vs. derived vs. inferred) and how they should all be considered accordingly will also be described. PMID:25502191

  8. Mouse Genetics: Determining gene function

    E-print Network

    Goldschmidt, Christina

    mutagenesis Phenotype Driven Gene Driven · Gene traps · Gene targeting · Gene driven ENU · RNAi EUCOMM, Europe of offspring for transgenic DNA and expression in tissue X Construction of transgene driven by promoter X · Gene driven ENU · RNAi EUCOMM, Europe European Conditional Mouse Mutagenesis KOMP, US Knock-out Mouse

  9. The UniProt-GO Annotation database in 2011.

    PubMed

    Dimmer, Emily C; Huntley, Rachael P; Alam-Faruque, Yasmin; Sawford, Tony; O'Donovan, Claire; Martin, Maria J; Bely, Benoit; Browne, Paul; Mun Chan, Wei; Eberhardt, Ruth; Gardner, Michael; Laiho, Kati; Legge, Duncan; Magrane, Michele; Pichler, Klemens; Poggioli, Diego; Sehra, Harminder; Auchincloss, Andrea; Axelsen, Kristian; Blatter, Marie-Claude; Boutet, Emmanuel; Braconi-Quintaje, Silvia; Breuza, Lionel; Bridge, Alan; Coudert, Elizabeth; Estreicher, Anne; Famiglietti, Livia; Ferro-Rojas, Serenella; Feuermann, Marc; Gos, Arnaud; Gruaz-Gumowski, Nadine; Hinz, Ursula; Hulo, Chantal; James, Janet; Jimenez, Silvia; Jungo, Florence; Keller, Guillaume; Lemercier, Phillippe; Lieberherr, Damien; Masson, Patrick; Moinat, Madelaine; Pedruzzi, Ivo; Poux, Sylvain; Rivoire, Catherine; Roechert, Bernd; Schneider, Michael; Stutz, Andre; Sundaram, Shyamala; Tognolli, Michael; Bougueleret, Lydie; Argoud-Puy, Ghislaine; Cusin, Isabelle; Duek-Roggli, Paula; Xenarios, Ioannis; Apweiler, Rolf

    2012-01-01

    The GO annotation dataset provided by the UniProt Consortium (GOA: http://www.ebi.ac.uk/GOA) is a comprehensive set of evidenced-based associations between terms from the Gene Ontology resource and UniProtKB proteins. Currently supplying over 100 million annotations to 11 million proteins in more than 360,000 taxa, this resource has increased 2-fold over the last 2 years and has benefited from a wealth of checks to improve annotation correctness and consistency as well as now supplying a greater information content enabled by GO Consortium annotation format developments. Detailed, manual GO annotations obtained from the curation of peer-reviewed papers are directly contributed by all UniProt curators and supplemented with manual and electronic annotations from 36 model organism and domain-focused scientific resources. The inclusion of high-quality, automatic annotation predictions ensures the UniProt GO annotation dataset supplies functional information to a wide range of proteins, including those from poorly characterized, non-model organism species. UniProt GO annotations are freely available in a range of formats accessible by both file downloads and web-based views. In addition, the introduction of a new, normalized file format in 2010 has made for easier handling of the complete UniProt-GOA data set. PMID:22123736

  10. Comparative annotation of functional regions in the human genome using epigenomic data

    PubMed Central

    Won, Kyoung-Jae; Zhang, Xian; Wang, Tao; Ding, Bo; Raha, Debasish; Snyder, Michael; Ren, Bing; Wang, Wei

    2013-01-01

    Epigenetic regulation is dynamic and cell-type dependent. The recently available epigenomic data in multiple cell types provide an unprecedented opportunity for a comparative study of epigenetic landscape. We developed a machine-learning method called ChroModule to annotate the epigenetic states in eight ENCyclopedia Of DNA Elements cell types. The trained model successfully captured the characteristic histone-modification patterns associated with regulatory elements, such as promoters and enhancers, and showed superior performance on identifying enhancers compared with the state-of-art methods. In addition, given the fixed number of epigenetic states in the model, ChroModule allows straightforward illustration of epigenetic variability in multiple cell types. Using this feature, we found that invariable and variable epigenetic states across cell types correspond to housekeeping functions and stimulus response, respectively. Especially, we observed that enhancers, but not the other regulatory elements, dictate cell specificity, as similar cell types share common enhancers, and cell-type–specific enhancers are often bound by transcription factors playing critical roles in that cell type. More interestingly, we found some genomic regions are dormant in cell type but primed to become active in other cell types. These observations highlight the usefulness of ChroModule in comparative analysis and interpretation of multiple epigenomes. PMID:23482391

  11. RNA-seq-Based Gene Annotation and Comparative Genomics of Four Fungal Grass Pathogens in the Genus Zymoseptoria Identify Novel Orphan Genes and Species-Specific Invasions of Transposable Elements

    PubMed Central

    Grandaubert, Jonathan; Bhattacharyya, Amitava; Stukenbrock, Eva H.

    2015-01-01

    The fungal pathogen Zymoseptoria tritici (synonym Mycosphaerella graminicola) is a prominent pathogen of wheat. The reference genome of the isolate IPO323 is one of the best-assembled eukaryotic genomes and encodes more than 10,000 predicted genes. However, a large proportion of the previously annotated gene models are incomplete, with either no start or no stop codons. The availability of RNA-seq data allows better predictions of gene structure. We here used two different RNA-seq datasets, de novo transcriptome assemblies, homology-based comparisons, and trained ab initio gene callers to generate a new gene annotation of Z. tritici IPO323. The annotation pipeline was also applied to re-sequenced genomes of three closely related species of Z. tritici: Z. pseudotritici, Z. ardabiliae, and Z. brevis. Comparative analyses of the predicted gene models using the four Zymoseptoria species revealed sets of species-specific orphan genes enriched with putative pathogenicity-related genes encoding small secreted proteins that may play essential roles in virulence and host specificity. De novo repeat identification allowed us to show that few families of transposable elements are shared between Zymoseptoria species while we observe many species-specific invasions and expansions. The annotation data presented here provide a high-quality resource for future studies of Z. tritici and its sister species and provide detailed insight into gene and genome evolution of fungal plant pathogens. PMID:25917918

  12. Integrative analysis of functional genomic annotations and sequencing data to identify rare causal variants via hierarchical modeling

    PubMed Central

    Capanu, Marinela; Ionita-Laza, Iuliana

    2015-01-01

    Identifying the small number of rare causal variants contributing to disease has been a major focus of investigation in recent years, but represents a formidable statistical challenge due to the rare frequencies with which these variants are observed. In this commentary we draw attention to a formal statistical framework, namely hierarchical modeling, to combine functional genomic annotations with sequencing data with the objective of enhancing our ability to identify rare causal variants. Using simulations we show that in all configurations studied, the hierarchical modeling approach has superior discriminatory ability compared to a recently proposed aggregate measure of deleteriousness, the Combined Annotation-Dependent Depletion (CADD) score, supporting our premise that aggregate functional genomic measures can more accurately identify causal variants when used in conjunction with sequencing data through a hierarchical modeling approach. PMID:26005447

  13. Comprehensive Functional Annotation of Seventy-One Breast Cancer Risk Loci

    PubMed Central

    Rhie, Suhn Kyong; Coetzee, Simon G.; Noushmehr, Houtan; Yan, Chunli; Kim, Jae Mun; Haiman, Christopher A.; Coetzee, Gerhard A.

    2013-01-01

    Breast Cancer (BCa) genome-wide association studies revealed allelic frequency differences between cases and controls at index single nucleotide polymorphisms (SNPs). To date, 71 loci have thus been identified and replicated. More than 320,000 SNPs at these loci define BCa risk due to linkage disequilibrium (LD). We propose that BCa risk resides in a subgroup of SNPs that functionally affects breast biology. Such a shortlist will aid in framing hypotheses to prioritize a manageable number of likely disease-causing SNPs. We extracted all the SNPs, residing in 1 Mb windows around breast cancer risk index SNP from the 1000 genomes project to find correlated SNPs. We used FunciSNP, an R/Bioconductor package developed in-house, to identify potentially functional SNPs at 71 risk loci by coinciding them with chromatin biofeatures. We identified 1,005 SNPs in LD with the index SNPs (r2?0.5) in three categories; 21 in exons of 18 genes, 76 in transcription start site (TSS) regions of 25 genes, and 921 in enhancers. Thirteen SNPs were found in more than one category. We found two correlated and predicted non-benign coding variants (rs8100241 in exon 2 and rs8108174 in exon 3) of the gene, ANKLE1. Most putative functional LD SNPs, however, were found in either epigenetically defined enhancers or in gene TSS regions. Fifty-five percent of these non-coding SNPs are likely functional, since they affect response element (RE) sequences of transcription factors. Functionality of these SNPs was assessed by expression quantitative trait loci (eQTL) analysis and allele-specific enhancer assays. Unbiased analyses of SNPs at BCa risk loci revealed new and overlooked mechanisms that may affect risk of the disease, thereby providing a valuable resource for follow-up studies. PMID:23717510

  14. Disease candidate gene identification and prioritization using protein interaction networks

    Microsoft Academic Search

    Jing Chen; Bruce J. Aronow; Anil G. Jegga

    2009-01-01

    BACKGROUND: Although most of the current disease candidate gene identification and prioritization methods depend on functional annotations, the coverage of the gene functional annotations is a limiting factor. In the current study, we describe a candidate gene prioritization method that is entirely based on protein-protein interaction network (PPIN) analyses. RESULTS: For the first time, extended versions of the PageRank and

  15. The functional diversity of essential genes required for mammalian cardiac development

    PubMed Central

    Clowes, Christopher; Boylan, Michael GS; Ridge, Liam A; Barnes, Emma; Wright, Jayne A; Hentges, Kathryn E

    2014-01-01

    Genes required for an organism to develop to maturity (for which no other gene can compensate) are considered essential. The continuing functional annotation of the mouse genome has enabled the identification of many essential genes required for specific developmental processes including cardiac development. Patterns are now emerging regarding the functional nature of genes required at specific points throughout gestation. Essential genes required for development beyond cardiac progenitor cell migration and induction include a small and functionally homogenous group encoding transcription factors, ligands and receptors. Actions of core cardiogenic transcription factors from the Gata, Nkx, Mef, Hand, and Tbx families trigger a marked expansion in the functional diversity of essential genes from midgestation onwards. As the embryo grows in size and complexity, genes required to maintain a functional heartbeat and to provide muscular strength and regulate blood flow are well represented. These essential genes regulate further specialization and polarization of cell types along with proliferative, migratory, adhesive, contractile, and structural processes. The identification of patterns regarding the functional nature of essential genes across numerous developmental systems may aid prediction of further essential genes and those important to development and/or progression of disease. genesis 52:713–737, 2014. PMID:24866031

  16. Use of Modern Chemical Protein Synthesis and Advanced Fluorescent Assay Techniques to Experimentally Validate the Functional Annotation of Microbial Genomes

    SciTech Connect

    Kent, Stephen [University of Chicago

    2012-07-20

    The objective of this research program was to prototype methods for the chemical synthesis of predicted protein molecules in annotated microbial genomes. High throughput chemical methods were to be used to make large numbers of predicted proteins and protein domains, based on microbial genome sequences. Microscale chemical synthesis methods for the parallel preparation of peptide-thioester building blocks were developed; these peptide segments are used for the parallel chemical synthesis of proteins and protein domains. Ultimately, it is envisaged that these synthetic molecules would be ‘printed’ in spatially addressable arrays. The unique ability of total synthesis to precision label protein molecules with dyes and with chemical or biochemical ‘tags’ can be used to facilitate novel assay technologies adapted from state-of-the art single molecule fluorescence detection techniques. In the future, in conjunction with modern laboratory automation this integrated set of techniques will enable high throughput experimental validation of the functional annotation of microbial genomes.

  17. Annotated embryonic CNS expression patterns of 5000 GMR GAL4 lines: a resource for manipulating gene expression and analyzing cis-regulatory modules

    PubMed Central

    Manning, Laurina; Heckscher, Ellie S.; Purice, Maria D.; Roberts, Jourdain; Bennett, Alysha L.; Kroll, Jason R.; Pollard, Jill L.; Strader, Marie E.; Lupton, Josh R.; Dyukareva, Anna V.; Doan, Phuong Nam; Bauer, David M.; Wilbur, Allison N.; Tanner, Stephanie; Kelly, Jimmy J.; Lai, Sen-Lin; Tran, Khoa D.; Kohwi, Minoree; Laverty, Todd R.; Pearson, Joseph C.; Crews, Stephen T.; Rubin, Gerald M.; Doe, Chris Q.

    2012-01-01

    Here we describe the embryonic CNS expression of 5,000 GAL4 lines made using molecularly defined cis-regulatory DNA inserted into a single attP genomic location. We document and annotate the patterns in early embryos when neurogenesis is at its peak, and in older embryos where there is maximal neuronal diversity and the first neural circuits are established. We note expression in other tissues such as the lateral body wall (muscle, sensory neurons, trachea) and viscera. Companion papers report on the adult brain and larval imaginal discs, and the integrated datasets are available online (www.janelia.org/flylight/gal4-gen1). This collection of embryonically-expressed GAL4 lines will be valuable for determining neuronal morphology and function; the 1862 lines expressed in small subsets of neurons (<20/segment) will be especially valuable for characterizing interneuronal diversity and function, as interneurons comprise the majority of all CNS neurons, yet their gene expression profile and function remain virtually unexplored. PMID:23063363

  18. Evolution and Biochemistry of Family 4 Glycosidases: Implications for Assigning Enzyme Function in Sequence Annotations

    PubMed Central

    Pikis, Andreas; Thompson, John

    2009-01-01

    Glycosyl hydrolase Family 4 (GH4) is exceptional among the 114 families in this enzyme superfamily. Members of GH4 exhibit unusual cofactor requirements for activity, and an essential cysteine residue is present at the active site. Of greatest significance is the fact that members of GH4 employ a unique catalytic mechanism for cleavage of the glycosidic bond. By phylogenetic analysis, and from available substrate specificities, we have assigned a majority of the enzymes of GH4 to five subgroups. Our classification revealed an unexpected relationship between substrate specificity and the presence, in each subgroup, of a motif of four amino acids that includes the active-site Cys residue: ?-glucosidase, CHE(I/V); ?-galactosidase, CHSV; ?-glucuronidase, CHGx; 6-phospho-?-glucosidase, CDMP; and 6-phospho-?-glucosidase, CN(V/I)P. The question arises: Does the presence of a particular motif sufficiently predict the catalytic function of an unassigned GH4 protein? To test this hypothesis, we have purified and characterized the ?-glucoside–specific GH4 enzyme (PalH) from the phytopathogen, Erwinia rhapontici. The CHEI motif in this protein has been changed by site-directed mutagenesis, and the effects upon substrate specificity have been determined. The change to CHSV caused the loss of all ?-glucosidase activity, but the mutant protein exhibited none of the anticipated ?-galactosidase activity. The Cys-containing motif may be suggestive of enzyme specificity, but phylogenetic placement is required for confidence in that specificity. The Acholeplasma laidlawii GH4 protein is phylogenetically a phospho-?-glucosidase but has a unique SSSP motif. Lacking the initial Cys in that motif it cannot hydrolyze glycosides by the normal GH4 mechanism because the Cys is required to position the metal ion for hydrolysis, nor can it use the more common single or double-displacement mechanism of Koshland. Several considerations suggest that the protein has acquired a new function as the consequence of positive selection. This study emphasizes the importance of automatic annotation systems that by integrating phylogenetic analysis, functional motifs, and bioinformatics data, may lead to innovative experiments that further our understanding of biological systems. PMID:19625389

  19. Microarray annotation Benedikt Brors

    E-print Network

    Spang, Rainer

    German Cancer Research Center b.brors@dkfz.de Why do we need microarray clone annotation? · Often for protein domain structure, and GeneCards for comprehensive information from other databases on human ge- nes. The relation of clone information to genes and proteins · Microarrays are produced using

  20. Microarray Annotation Marc Zapatka

    E-print Network

    Spang, Rainer

    Cards for comprehensive information from other databases on human genes. The relation of clone information to genes German Cancer Research Center 2005-11-29 Why do we need microarray clone annotation? Often, the result and proteins Microarrays are produced using information on expressed sequences as EST clones, cDNAs, partial c

  1. Discovery of Tumor Suppressor Gene Function.

    ERIC Educational Resources Information Center

    Oppenheimer, Steven B.

    1995-01-01

    This is an update of a 1991 review on tumor suppressor genes written at a time when understanding of how the genes work was limited. A recent major breakthrough in the understanding of the function of tumor suppressor genes is discussed. (LZ)

  2. Gene Ontology Driven Classification of Gene

    E-print Network

    Spang, Rainer

    expression patterns Gene Ontology · Structure knowledge about genes · Directed acyclic graph · Represents · No biological knowledge #12;Introduction 23-Jul-02 5 / 17Claudio Lottaz: GO driven classification of gene knowledge on · Molecular function · Bilogical process · Cellular component · Genes are annotated to nodes

  3. Variation ontology: annotator guide

    PubMed Central

    2014-01-01

    Background Systematic representation of information related to genetic and non-genetic variations is required to allow large scale studies, data mining and data integration, and to make it possible to reveal novel relationships between genotype and phenotype. Although lots of variation data is available it is often difficult to use due to lack of systematics. Results A novel ontology, Variation Ontology (VariO http://variationontology.org), was developed for annotation of effects, consequences and mechanisms of variations. In this article instructions are provided on how VariO annotations are made. The major levels for description are the three molecules, namely DNA, RNA and protein. They are further divided to four major sublevels: variation type, function, structure, and property, and further up to eight sublevels. VariO annotation summarizes existing knowledge about a variation and its effects and formalizes it so that computational analyses are efficient. The annotations should be made on as many levels as possible. VariO annotations are made in reference to normal states, which vary for each data item including e.g. reference sequences, wild type properties, and activities. Conclusions Detailed instructions together with examples are provided to indicate how VariO can be used for annotation of variations and their effects. A dedicated tool has been developed for annotation and will be further developed to cover also evidence for the annotations. VariO is suitable for annotation of data in many types of databases. As several different kinds of databases are in a process of adapting VariO annotations it is important to have guidelines to guarantee consistent annotation. PMID:24533660

  4. A bi-ordering approach to linking gene expression with clinical annotations in gastric cancer

    PubMed Central

    2010-01-01

    Background In the study of cancer genomics, gene expression microarrays, which measure thousands of genes in a single assay, provide abundant information for the investigation of interesting genes or biological pathways. However, in order to analyze the large number of noisy measurements in microarrays, effective and efficient bioinformatics techniques are needed to identify the associations between genes and relevant phenotypes. Moreover, systematic tests are needed to validate the statistical and biological significance of those discoveries. Results In this paper, we develop a robust and efficient method for exploratory analysis of microarray data, which produces a number of different orderings (rankings) of both genes and samples (reflecting correlation among those genes and samples). The core algorithm is closely related to biclustering, and so we first compare its performance with several existing biclustering algorithms on two real datasets - gastric cancer and lymphoma datasets. We then show on the gastric cancer data that the sample orderings generated by our method are highly statistically significant with respect to the histological classification of samples by using the Jonckheere trend test, while the gene modules are biologically significant with respect to biological processes (from the Gene Ontology). In particular, some of the gene modules associated with biclusters are closely linked to gastric cancer tumorigenesis reported in previous literature, while others are potentially novel discoveries. Conclusion In conclusion, we have developed an effective and efficient method, Bi-Ordering Analysis, to detect informative patterns in gene expression microarrays by ranking genes and samples. In addition, a number of evaluation metrics were applied to assess both the statistical and biological significance of the resulting bi-orderings. The methodology was validated on gastric cancer and lymphoma datasets. PMID:20860844

  5. The prediction of protein subcellular localization from sequence: a shortcut to functional genome annotation

    Microsoft Academic Search

    Rita Casadio; Pier Luigi Martelli; Andrea Pierleoni

    2008-01-01

    Automated sequence annotation is a major goal of post-genomic era with hundreds of genomes in the databases, from both prokaryotes and eukaryotes. While the number of fully sequenced chromosomes from microbial organ- isms exponentially increased in the last decade above 600, presently we know the whole DNA content of only 25 eukaryotic organisms, including Homo sapiens. However, the process of

  6. Systematic Learning of Gene Functional Classes From DNA Array Expression Data by Using Multilayer Perceptrons

    PubMed Central

    Mateos, Alvaro; Dopazo, Joaquín; Jansen, Ronald; Tu, Yuhai; Gerstein, Mark; Stolovitzky, Gustavo

    2002-01-01

    Recent advances in microarray technology have opened new ways for functional annotation of previously uncharacterised genes on a genomic scale. This has been demonstrated by unsupervised clustering of co-expressed genes and, more importantly, by supervised learning algorithms. Using prior knowledge, these algorithms can assign functional annotations based on more complex expression signatures found in existing functional classes. Previously, support vector machines (SVMs) and other machine-learning methods have been applied to a limited number of functional classes for this purpose. Here we present, for the first time, the comprehensive application of supervised neural networks (SNNs) for functional annotation. Our study is novel in that we report systematic results for ?100 classes in the Munich Information Center for Protein Sequences (MIPS) functional catalog. We found that only ?10% of these are learnable (based on the rate of false negatives). A closer analysis reveals that false positives (and negatives) in a machine-learning context are not necessarily “false” in a biological sense. We show that the high degree of interconnections among functional classes confounds the signatures that ought to be learned for a unique class. We term this the “Borges effect” and introduce two new numerical indices for its quantification. Our analysis indicates that classification systems with a lower Borges effect are better suitable for machine learning. Furthermore, we introduce a learning procedure for combining false positives with the original class. We show that in a few iterations this process converges to a gene set that is learnable with considerably low rates of false positives and negatives and contains genes that are biologically related to the original class, allowing for a coarse reconstruction of the interactions between associated biological pathways. We exemplify this methodology using the well-studied tricarboxylic acid cycle. PMID:12421757

  7. A memetic co-clustering algorithm for gene expression profiles and biological annotation

    Microsoft Academic Search

    Nora Speer; Christian Spieth; Andreas Zell

    2004-01-01

    With the invention of microarrays, researchers are capable of measuring thousands of gene expression levels in parallel at various time points of the biological process. To investigate general regulatory mechanisms, biologists cluster genes based on their expression patterns. In this paper, we propose a new memetic co-clustering algorithm for expression profiles, which incorporates a priori knowledge in the form of

  8. GENOMIC SEQUENCE AND ANNOTATION OF A REGION THAT HARBORS MAJOR HISTOCOMPATIBILITY GENES IN RAINBOW TROUT

    Technology Transfer Automated Retrieval System (TEKTRAN)

    We have previously shown that the rainbow trout genome contains at least 4 unlinked regions of major histocompatibility (MH) genes. One of the regions which we previously dubbed the extended MH class II region is located on chromosome arm 3p and harbors the TAP1 (aka ABCB2) gene. TAP1 is one of the ...

  9. Annotation of metabolic and biosynthesis genes from Hessian fly (Diptera: Cecidomyiidae)

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The Hessian fly is the major insect pest of wheat in the southeastern United States and has traditionally been controlled through the utilization of Hessian fly resistance (R) genes in wheat. Such R genes are a limited resource, and once deployed lose their field effectiveness with time. Using 21 ...

  10. Using gene ontology annotations in exploratory microarray clustering to understand cancer etiology

    E-print Network

    Bailey, James

    Department of Computer Science and Software Engineering, University of Melbourne, Victoria, Australia bNICTA, Victorian Research Lab, Australia cIan Potter Centre for Cancer Genomics and Predictive Medicine, Peter Mac for the exploration of cancer etiology. Key words: Microarray, Gene Ontology, Clustering, Cancer 1. Introduction Gene

  11. IMG ER: A System for Microbial Genome Annotation Expert Review and Curation

    SciTech Connect

    Markowitz, Victor M.; Mavromatis, Konstantinos; Ivanova, Natalia N.; Chen, I-Min A.; Chu, Ken; Kyrpides, Nikos C.

    2009-05-25

    A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG's rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER's annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG's comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes.

  12. Gene Transfer Strategies for Augmenting Cardiac Function

    Microsoft Academic Search

    Karsten Peppel; Walter J Koch; Robert J Lefkowitz

    1997-01-01

    Recent transgenic as well as gene-targeted animal models have greatly increased our understanding of the molecular mechanisms of normal and compromised heart function. These studies have raised the possibility of using somatic gene transfer as a means for improving cardiac function. DNA transfer to a significant portion of the myocardium has thus far been difficult to accomplish. This review describes

  13. Functional Annotation of the Ophiostoma novo-ulmi Genome: Insights into the Phytopathogenicity of the Fungal Agent of Dutch Elm Disease

    PubMed Central

    Comeau, André M.; Dufour, Josée; Bouvet, Guillaume F.; Jacobi, Volker; Nigg, Martha; Henrissat, Bernard; Laroche, Jérôme; Levesque, Roger C.; Bernier, Louis

    2015-01-01

    The ascomycete fungus Ophiostoma novo-ulmi is responsible for the pandemic of Dutch elm disease that has been ravaging Europe and North America for 50 years. We proceeded to annotate the genome of the O. novo-ulmi strain H327 that was sequenced in 2012. The 31.784-Mb nuclear genome (50.1% GC) is organized into 8 chromosomes containing a total of 8,640 protein-coding genes that we validated with RNA sequencing analysis. Approximately 53% of these genes have their closest match to Grosmannia clavigera kw1407, followed by 36% in other close Sordariomycetes, 5% in other Pezizomycotina, and surprisingly few (5%) orphans. A relatively small portion (?3.4%) of the genome is occupied by repeat sequences; however, the mechanism of repeat-induced point mutation appears active in this genome. Approximately 76% of the proteins could be assigned functions using Gene Ontology analysis; we identified 311 carbohydrate-active enzymes, 48 cytochrome P450s, and 1,731 proteins potentially involved in pathogen–host interaction, along with 7 clusters of fungal secondary metabolites. Complementary mating-type locus sequencing, mating tests, and culturing in the presence of elm terpenes were conducted. Our analysis identified a specific genetic arsenal impacting the sexual and vegetative growth, phytopathogenicity, and signaling/plant–defense–degradation relationship between O. novo-ulmi and its elm host and insect vectors. PMID:25539722

  14. The maize ALDH protein superfamily: linking structural features to functional specificities

    Microsoft Academic Search

    Jose C Jimenez-Lopez; Emma W Gachomo; Manfredo J Seufferheld; Simeon O Kotchoni

    2010-01-01

    BACKGROUND: The completion of maize genome sequencing has resulted in the identification of a large number of uncharacterized genes. Gene annotation and functional characterization of gene products are important to uncover novel protein functionality. RESULTS: In this paper, we identify, and annotate members of all the maize aldehyde dehydrogenase (ALDH) gene superfamily according to the revised nomenclature criteria developed by

  15. Combined QTL and Selective Sweep Mappings with Coding SNP Annotation and cis-eQTL Analysis Revealed PARK2 and JAG2 as New Candidate Genes for Adiposity Regulation

    PubMed Central

    Roux, Pierre-François; Boitard, Simon; Blum, Yuna; Parks, Brian; Montagner, Alexandra; Mouisel, Etienne; Djari, Anis; Esquerré, Diane; Désert, Colette; Boutin, Morgane; Leroux, Sophie; Lecerf, Frédéric; Le Bihan-Duval, Elisabeth; Klopp, Christophe; Servin, Bertrand; Pitel, Frédérique; Duclos, Michel Jean; Guillou, Hervé; Lusis, Aldons J.; Demeure, Olivier; Lagarrigue, Sandrine

    2015-01-01

    Very few causal genes have been identified by quantitative trait loci (QTL) mapping because of the large size of QTL, and most of them were identified thanks to functional links already known with the targeted phenotype. Here, we propose to combine selection signature detection, coding SNP annotation, and cis-expression QTL analyses to identify potential causal genes underlying QTL identified in divergent line designs. As a model, we chose experimental chicken lines divergently selected for only one trait, the abdominal fat weight, in which several QTL were previously mapped. Using new haplotype-based statistics exploiting the very high SNP density generated through whole-genome resequencing, we found 129 significant selective sweeps. Most of the QTL colocalized with at least one sweep, which markedly narrowed candidate region size. Some of those sweeps contained only one gene, therefore making them strong positional causal candidates with no presupposed function. We then focused on two of these QTL/sweeps. The absence of nonsynonymous SNPs in their coding regions strongly suggests the existence of causal mutations acting in cis on their expression, confirmed by cis-eQTL identification using either allele-specific expression or genetic mapping analyses. Additional expression analyses of those two genes in the chicken and mice contrasted for adiposity reinforces their link with this phenotype. This study shows for the first time the interest of combining selective sweeps mapping, coding SNP annotation and cis-eQTL analyses for identifying causative genes for a complex trait, in the context of divergent lines selected for this specific trait. Moreover, it highlights two genes, JAG2 and PARK2, as new potential negative and positive key regulators of adiposity in chicken and mice. PMID:25653314

  16. A manual curation strategy to improve genome annotation: application to a set of haloarchael genomes.

    PubMed

    Pfeiffer, Friedhelm; Oesterhelt, Dieter

    2015-01-01

    Genome annotation errors are a persistent problem that impede research in the biosciences. A manual curation effort is described that attempts to produce high-quality genome annotations for a set of haloarchaeal genomes (Halobacterium salinarum and Hbt. hubeiense, Haloferax volcanii and Hfx. mediterranei, Natronomonas pharaonis and Nmn. moolapensis, Haloquadratum walsbyi strains HBSQ001 and C23, Natrialba magadii, Haloarcula marismortui and Har. hispanica, and Halohasta litchfieldiae). Genomes are checked for missing genes, start codon misassignments, and disrupted genes. Assignments of a specific function are preferably based on experimentally characterized homologs (Gold Standard Proteins). To avoid overannotation, which is a major source of database errors, we restrict annotation to only general function assignments when support for a specific substrate assignment is insufficient. This strategy results in annotations that are resistant to the plethora of errors that compromise public databases. Annotation consistency is rigorously validated for ortholog pairs from the genomes surveyed. The annotation is regularly crosschecked against the UniProt database to further improve annotations and increase the level of standardization. Enhanced genome annotations are submitted to public databases (EMBL/GenBank, UniProt), to the benefit of the scientific community. The enhanced annotations are also publically available via HaloLex. PMID:26042526

  17. MaGe: a microbial genome annotation system supported by synteny results.

    PubMed

    Vallenet, David; Labarre, Laurent; Rouy, Zoé; Barbe, Valérie; Bocs, Stéphanie; Cruveiller, Stéphane; Lajus, Aurélie; Pascal, Géraldine; Scarpelli, Claude; Médigue, Claudine

    2006-01-01

    Magnifying Genomes (MaGe) is a microbial genome annotation system based on a relational database containing information on bacterial genomes, as well as a web interface to achieve genome annotation projects. Our system allows one to initiate the annotation of a genome at the early stage of the finishing phase. MaGe's main features are (i) integration of annotation data from bacterial genomes enhanced by a gene coding re-annotation process using accurate gene models, (ii) integration of results obtained with a wide range of bioinformatics methods, among which exploration of gene context by searching for conserved synteny and reconstruction of metabolic pathways, (iii) an advanced web interface allowing multiple users to refine the automatic assignment of gene product functions. MaGe is also linked to numerous well-known biological databases and systems. Our system has been thoroughly tested during the annotation of complete bacterial genomes (Acinetobacter baylyi ADP1, Pseudoalteromonas haloplanktis, Frankia alni) and is currently used in the context of several new microbial genome annotation projects. In addition, MaGe allows for annotation curation and exploration of already published genomes from various genera (e.g. Yersinia, Bacillus and Neisseria). MaGe can be accessed at http://www.genoscope.cns.fr/agc/mage. PMID:16407324

  18. Interferome v2.0: an updated database of annotated interferon-regulated genes.

    PubMed

    Rusinova, Irina; Forster, Sam; Yu, Simon; Kannan, Anitha; Masse, Marion; Cumming, Helen; Chapman, Ross; Hertzog, Paul J

    2013-01-01

    Interferome v2.0 (http://interferome.its.monash.edu.au/interferome/) is an update of an earlier version of the Interferome DB published in the 2009 NAR database edition. Vastly improved computational infrastructure now enables more complex and faster queries, and supports more data sets from types I, II and III interferon (IFN)-treated cells, mice or humans. Quantitative, MIAME compliant data are collected, subjected to thorough, standardized, quantitative and statistical analyses and then significant changes in gene expression are uploaded. Comprehensive manual collection of metadata in v2.0 allows flexible, detailed search capacity including the parameters: range of -fold change, IFN type, concentration and time, and cell/tissue type. There is no limit to the number of genes that can be used to search the database in a single query. Secondary analysis such as gene ontology, regulatory factors, chromosomal location or tissue expression plots of IFN-regulated genes (IRGs) can be performed in Interferome v2.0, or data can be downloaded in convenient text formats compatible with common secondary analysis programs. Given the importance of IFN to innate immune responses in infectious, inflammatory diseases and cancer, this upgrade of the Interferome to version 2.0 will facilitate the identification of gene signatures of importance in the pathogenesis of these diseases. PMID:23203888

  19. Screening and functional pathway analysis of genes associated with pediatric allergic asthma using a DNA microarray.

    PubMed

    Lu, Li-Qun; Liao, Wei

    2015-06-01

    The present study aimed to identify differentially expressed genes (DEGs) associated with pediatric allergic asthma, and to analyze the functional pathways of the selected target genes, in order to explore the pathogenesis of the disease. The GSE18965 gene expression profile was downloaded from the Gene Expression Omnibus database and was preprocessed. This gene expression profile consisted of seven normal samples and nine samples from patients with pediatric allergic asthma. The DEGs between the normal and pediatric allergic asthma samples were screened using limma package in R, and the cut?off value was set at false discovery rate <0.05 and log fold change >1. Following hierarchical clustering of the DEGs based on the expression profiles, the up? and downregulated genes underwent a functional enrichment analysis by topological approach (P<0.05), using the Database for Annotation, Visualization and Integrated Discovery. A total of 127 DEGs were identified between the normal and pediatric allergic asthma samples. The up? and downregulated genes were significantly enriched in the actin filament?based process and the monosaccharide metabolic process, respectively. Seven downregulated DEGs (M6PR, TPP1, GLB1, NEU1, ACP2, LAMP1 and HGSNAT) were identified in the lysosomal pathway, with P=6.4x10(?9). These results suggested that variation in lysosomal function, triggered by the seven downregulated genes, may lead to aberrant functioning of the T lymphocytes, resulting in asthma. Further research regarding the treatment of pediatric allergic asthma through targeting lysosomal function is required. PMID:25633562

  20. Screening and functional pathway analysis of genes associated with pediatric allergic asthma using a DNA microarray

    PubMed Central

    LU, LI-QUN; LIAO, WEI

    2015-01-01

    The present study aimed to identify differentially expressed genes (DEGs) associated with pediatric allergic asthma, and to analyze the functional pathways of the selected target genes, in order to explore the pathogenesis of the disease. The GSE18965 gene expression profile was downloaded from the Gene Expression Omnibus database and was preprocessed. This gene expression profile consisted of seven normal samples and nine samples from patients with pediatric allergic asthma. The DEGs between the normal and pediatric allergic asthma samples were screened using limma package in R, and the cut-off value was set at false discovery rate <0.05 and log fold change >1. Following hierarchical clustering of the DEGs based on the expression profiles, the up- and downregulated genes underwent a functional enrichment analysis by topological approach (P<0.05), using the Database for Annotation, Visualization and Integrated Discovery. A total of 127 DEGs were identified between the normal and pediatric allergic asthma samples. The up- and downregulated genes were significantly enriched in the actin filament-based process and the monosaccharide metabolic process, respectively. Seven downregulated DEGs (M6PR, TPP1, GLB1, NEU1, ACP2, LAMP1 and HGSNAT) were identified in the lysosomal pathway, with P=6.4×10?9. These results suggested that variation in lysosomal function, triggered by the seven downregulated genes, may lead to aberrant functioning of the T lymphocytes, resulting in asthma. Further research regarding the treatment of pediatric allergic asthma through targeting lysosomal function is required. PMID:25633562

  1. Automatic Assignment of Protein Function with Supervised Classifiers 

    E-print Network

    Jung, Jae

    2010-01-16

    High-throughput genome sequencing and sequence analysis technologies have created the need for automated annotation and analysis of large sets of genes. The Gene Ontology (GO) provides a common controlled vocabulary for describing gene function...

  2. Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms

    PubMed Central

    2012-01-01

    Background Predicting protein function has become increasingly demanding in the era of next generation sequencing technology. The task to assign a curator-reviewed function to every single sequence is impracticable. Bioinformatics tools, easy to use and able to provide automatic and reliable annotations at a genomic scale, are necessary and urgent. In this scenario, the Gene Ontology has provided the means to standardize the annotation classification with a structured vocabulary which can be easily exploited by computational methods. Results Argot2 is a web-based function prediction tool able to annotate nucleic or protein sequences from small datasets up to entire genomes. It accepts as input a list of sequences in FASTA format, which are processed using BLAST and HMMER searches vs UniProKB and Pfam databases respectively; these sequences are then annotated with GO terms retrieved from the UniProtKB-GOA database and the terms are weighted using the e-values from BLAST and HMMER. The weighted GO terms are processed according to both their semantic similarity relations described by the Gene Ontology and their associated score. The algorithm is based on the original idea developed in a previous tool called Argot. The entire engine has been completely rewritten to improve both accuracy and computational efficiency, thus allowing for the annotation of complete genomes. Conclusions The revised algorithm has been already employed and successfully tested during in-house genome projects of grape and apple, and has proven to have a high precision and recall in all our benchmark conditions. It has also been successfully compared with Blast2GO, one of the methods most commonly employed for sequence annotation. The server is freely accessible at http://www.medcomp.medicina.unipd.it/Argot2. PMID:22536960

  3. GRYFUN: A Web Application for GO Term Annotation Visualization and Analysis in Protein Sets

    PubMed Central

    Bastos, Hugo P.; Sousa, Lisete; Clarke, Luka A.; Couto, Francisco M.

    2015-01-01

    Functional context for biological sequence is provided in the form of annotations. However, within a group of similar sequences there can be annotation heterogeneity in terms of coverage and specificity. This in turn can introduce issues regarding the interpretation of actual functional similarity and overall functional coherence of such a group. One way to mitigate such issues is through the use of visualization and statistical techniques. Therefore, in order to help interpret this annotation heterogeneity we created a web application that generates Gene Ontology annotation graphs for protein sets and their associated statistics from simple frequencies to enrichment values and Information Content based metrics. The publicly accessible website http://xldb.di.fc.ul.pt/gryfun/ currently accepts lists of UniProt accession numbers in order to create user-defined protein sets for subsequent annotation visualization and statistical assessment. GRYFUN is a freely available web application that allows GO annotation visualization of protein sets and which can be used for annotation coherence and cohesiveness analysis and annotation extension assessments within under-annotated protein sets. PMID:25794277

  4. RNA Interference for Wheat Functional Gene Analysis

    Technology Transfer Automated Retrieval System (TEKTRAN)

    RNA interference (RNAi) refers to a common mechanism of RNA-based post-transcriptional gene silencing in eukaryotic cells. In model plant species such as Arabidopsis and rice, RNAi has been routinely used to characterize gene function and to engineer novel phenotypes. In polyploid species, this appr...

  5. IIS – Integrated Interactome System: A Web-Based Platform for the Annotation, Analysis and Visualization of Protein-Metabolite-Gene-Drug Interactions by Integrating a Variety of Data Sources and Tools

    PubMed Central

    Carazzolle, Marcelo Falsarella; de Carvalho, Lucas Miguel; Slepicka, Hugo Henrique; Vidal, Ramon Oliveira; Pereira, Gonçalo Amarante Guimarães; Kobarg, Jörg; Vaz Meirelles, Gabriela

    2014-01-01

    Background High-throughput screening of physical, genetic and chemical-genetic interactions brings important perspectives in the Systems Biology field, as the analysis of these interactions provides new insights into protein/gene function, cellular metabolic variations and the validation of therapeutic targets and drug design. However, such analysis depends on a pipeline connecting different tools that can automatically integrate data from diverse sources and result in a more comprehensive dataset that can be properly interpreted. Results We describe here the Integrated Interactome System (IIS), an integrative platform with a web-based interface for the annotation, analysis and visualization of the interaction profiles of proteins/genes, metabolites and drugs of interest. IIS works in four connected modules: (i) Submission module, which receives raw data derived from Sanger sequencing (e.g. two-hybrid system); (ii) Search module, which enables the user to search for the processed reads to be assembled into contigs/singlets, or for lists of proteins/genes, metabolites and drugs of interest, and add them to the project; (iii) Annotation module, which assigns annotations from several databases for the contigs/singlets or lists of proteins/genes, generating tables with automatic annotation that can be manually curated; and (iv) Interactome module, which maps the contigs/singlets or the uploaded lists to entries in our integrated database, building networks that gather novel identified interactions, protein and metabolite expression/concentration levels, subcellular localization and computed topological metrics, GO biological processes and KEGG pathways enrichment. This module generates a XGMML file that can be imported into Cytoscape or be visualized directly on the web. Conclusions We have developed IIS by the integration of diverse databases following the need of appropriate tools for a systematic analysis of physical, genetic and chemical-genetic interactions. IIS was validated with yeast two-hybrid, proteomics and metabolomics datasets, but it is also extendable to other datasets. IIS is freely available online at: http://www.lge.ibi.unicamp.br/lnbio/IIS/. PMID:24949626

  6. Functional genomics: Probing plant gene function and expression with transposons

    PubMed Central

    Martienssen, Robert A.

    1998-01-01

    Transposable elements provide a convenient and flexible means to disrupt plant genes, so allowing their function to be assessed. By engineering transposons to carry reporter genes and regulatory signals, the expression of target genes can be monitored and to some extent manipulated. Two strategies for using transposons to assess gene function are outlined here: First, the PCR can be used to identify plants that carry insertions into specific genes from among pools of heavily mutagenized individuals (site-selected transposon mutagenesis). This method requires that high copy transposons be used and that a relatively large number of reactions be performed to identify insertions into genes of interest. Second, a large library of plants, each carrying a unique insertion, can be generated. Each insertion site then can be amplified and sequenced systematically. These two methods have been demonstrated in maize, Arabidopsis, and other plant species, and the relative merits of each are discussed in the context of plant genome research. PMID:9482828

  7. GeneCards Version 3: the human gene integrator.

    PubMed

    Safran, Marilyn; Dalah, Irina; Alexander, Justin; Rosen, Naomi; Iny Stein, Tsippi; Shmoish, Michael; Nativ, Noam; Bahir, Iris; Doniger, Tirza; Krug, Hagit; Sirota-Madi, Alexandra; Olender, Tsviya; Golan, Yaron; Stelzer, Gil; Harel, Arye; Lancet, Doron

    2010-01-01

    GeneCards (www.genecards.org) is a comprehensive, authoritative compendium of annotative information about human genes, widely used for nearly 15 years. Its gene-centric content is automatically mined and integrated from over 80 digital sources, resulting in a web-based deep-linked card for each of >73,000 human gene entries, encompassing the following categories: protein coding, pseudogene, RNA gene, genetic locus, cluster and uncategorized. We now introduce GeneCards Version 3, featuring a speedy and sophisticated search engine and a revamped, technologically enabling infrastructure, catering to the expanding needs of biomedical researchers. A key focus is on gene-set analyses, which leverage GeneCards' unique wealth of combinatorial annotations. These include the GeneALaCart batch query facility, which tabulates user-selected annotations for multiple genes and GeneDecks, which identifies similar genes with shared annotations, and finds set-shared annotations by descriptor enrichment analysis. Such set-centric features address a host of applications, including microarray data analysis, cross-database annotation mapping and gene-disorder associations for drug targeting. We highlight the new Version 3 database architecture, its multi-faceted search engine, and its semi-automated quality assurance system. Data enhancements include an expanded visualization of gene expression patterns in normal and cancer tissues, an integrated alternative splicing pattern display, and augmented multi-source SNPs and pathways sections. GeneCards now provides direct links to gene-related research reagents such as antibodies, recombinant proteins, DNA clones and inhibitory RNAs and features gene-related drugs and compounds lists. We also portray the GeneCards Inferred Functionality Score annotation landscape tool for scoring a gene's functional information status. Finally, we delineate examples of applications and collaborations that have benefited from the GeneCards suite. Database URL: www.genecards.org. PMID:20689021

  8. A novel analytical brain block tool to enable functional annotation of discriminatory transcript biomarkers among discrete regions of the fronto-limbic circuit in primate brain.

    PubMed

    Dalgard, Clifton L; Jacobowitz, David M; Singh, Vijay K; Saleem, Kadharbatcha S; Ursano, Robert J; Starr, Joshua M; Pollard, Harvey B

    2015-03-10

    Fronto-limbic circuits in the primate brain are responsible for executive function, learning and memory, and emotions, including fear. Consequently, changes in gene expression in cortical and subcortical brain regions housing these circuits are associated with many important psychiatric and neurological disorders. While high quality gene expression profiles can be identified in brains from model organisms, primate brains have unique features such as Brodmann Area 25, which is absent in rodents, yet profoundly important in primates, including humans. The potential insights to be gained from studying the human brain are complicated by the fact that the post-mortem interval (PMI) is variable, and most repositories keep solid tissue in the deep frozen state. Consequently, sampling the important medial and internal regions of these brains is difficult. Here we describe a novel method for obtaining discrete regions from the fronto-limbic circuits of a 4 year old and a 5 year old, male, intact, frozen non-human primate (NHP) brain, for which the PMI is exactly known. The method also preserves high quality RNA, from which we use transcriptional profiling and a new algorithm to identify region-exclusive RNA signatures for Area 25 (NF?B and dopamine receptor signaling), the anterior cingulate cortex (LXR/RXR signaling), the amygdala (semaphorin signaling), and the hippocampus (Ca(++) and retinoic acid signaling). The RNA signatures not only reflect function of the different regions, but also include highly expressed RNAs for which function is either poorly understood, or which generate proteins presently lacking annotated functions. We suggest that this new approach will provide a useful strategy for identifying changes in fronto-limbic system biology underlying normal development, aging and disease in the human brain. PMID:25529630

  9. Approaches to Fungal Genome Annotation

    PubMed Central

    Haas, Brian J.; Zeng, Qiandong; Pearson, Matthew D.; Cuomo, Christina A.; Wortman, Jennifer R.

    2011-01-01

    Fungal genome annotation is the starting point for analysis of genome content. This generally involves the application of diverse methods to identify features on a genome assembly such as protein-coding and non-coding genes, repeats and transposable elements, and pseudogenes. Here we describe tools and methods leveraged for eukaryotic genome annotation with a focus on the annotation of fungal nuclear and mitochondrial genomes. We highlight the application of the latest technologies and tools to improve the quality of predicted gene sets. The Broad Institute eukaryotic genome annotation pipeline is described as one example of how such methods and tools are integrated into a sequencing center’s production genome annotation environment. PMID:22059117

  10. Novel genes identified by manual annotation and microarray expression analysis in the pancreas.

    PubMed

    Mazzarelli, Joan M; White, Peter; Gorski, Regina; Brestelli, John; Pinney, Deborah F; Arsenlis, Athanasios; Katokhin, Alexey; Belova, Olga; Bogdanova, Vera; Elisafenko, Eugenij; Gubina, Marina; Nizolenko, Lilia; Perelman, Polina; Puzakov, Mikhail; Shilov, Alexandre; Trifonoff, Vladimir; Vorobjeva, Nadezhda; Kolchanov, Nikolay; Kaestner, Klaus H; Stoeckert, Christian J

    2006-12-01

    The mouse PancChip, a microarray developed for studying endocrine pancreatic development and diabetes, represents over 13,000 cDNAs. After computationally assigning the cDNAs on the array to known genes, manual curation of the remaining sequences identified 211 novel transcripts. In microarray experiments, we found that 196 of these transcripts were expressed in total pancreas and/or pancreatic islets. Of 50 randomly selected clones from these 196 transcripts, 92% were confirmed as expressed by qRT-PCR. We evaluated the coding potential of the novel transcripts and found that 74% of the clones had low coding potential. Since the transcripts may be partial mRNAs, we examined their translated proteins for transmembrane or signal peptide domains and found that about 40 proteins had one of these predicted domains. Interestingly, when we investigated the novel transcripts for their overlap with noncoding microRNAs, we found that 1 of the novel transcripts overlapped a known microRNA gene. PMID:16725306

  11. NetAffx: Affymetrix probesets and annotations

    Microsoft Academic Search

    Guoying Liu; Ann E. Loraine; Ron Shigeta; Melissa S. Cline; Jill Cheng; Venu Valmeekam; Shaw Sun; David Kulp; Michael A. Siani-rose

    2003-01-01

    NetAffx (http:\\/\\/www.affymetrix.com) details and annotates probesets on Affymetrix GeneChip micro- arrays. These annotations include (i) static informa- tion specific to the probeset composition; (ii) sequence annotations extracted from public data- bases; and (iii) protein sequence-level annotations derived from public domain programs, as well as libraries of hidden Markov models (HMMs) devel- oped at Affymetrix. For each probeset, NetAffx lists the

  12. Gene function prediction with gene interaction networks: a context graph kernel approach

    Microsoft Academic Search

    Xin Li; Hsinchun Chen; Jiexun Li; Zhu Zhang

    2010-01-01

    Predicting gene functions is a challenge for biologists in the postgenomic era. Interactions among genes and their products compose networks that can be used to infer gene functions. Most previous studies adopt a linkage assumption, i.e., they assume that gene interactions indicate functional similarities between connected genes. In this study, we propose to use a gene's context graph, i.e., the

  13. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST).

    PubMed

    Overbeek, Ross; Olson, Robert; Pusch, Gordon D; Olsen, Gary J; Davis, James J; Disz, Terry; Edwards, Robert A; Gerdes, Svetlana; Parrello, Bruce; Shukla, Maulik; Vonstein, Veronika; Wattam, Alice R; Xia, Fangfang; Stevens, Rick

    2014-01-01

    In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources. PMID:24293654

  14. Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions?

    PubMed Central

    Pallejà, Albert; Harrington, Eoghan D; Bork, Peer

    2008-01-01

    Background Across the fully sequenced microbial genomes there are thousands of examples of overlapping genes. Many of these are only a few nucleotides long and are thought to function by permitting the coordinated regulation of gene expression. However, there should also be selective pressure against long overlaps, as the existence of overlapping reading frames increases the risk of deleterious mutations. Here we examine the longest overlaps and assess whether they are the product of special functional constraints or of erroneous annotation. Results We analysed the genes that overlap by 60 bps or more among 338 fully-sequenced prokaryotic genomes. The likely functional significance of an overlap was determined by comparing each of the genes to its respective orthologs. If a gene showed a significantly different length from its orthologs it was considered unlikely to be functional and therefore the result of an error either in sequencing or gene prediction. Focusing on 715 co-directional overlaps longer than 60 bps, we classified the erroneous ones into five categories: i) 5'-end extension of the downstream gene due to either a mispredicted start codon or a frameshift at 5'-end of the gene (409 overlaps), ii) fragmentation of a gene caused by a frameshift (163), iii) 3'-end extension of the upstream gene due to either a frameshift at 3'-end of a gene or point mutation at the stop codon (68), iv) Redundant gene predictions (4), v) 5' & 3'-end extension which is a combination of i) and iii) (71). We also studied 75 divergent overlaps that could be classified as misannotations of group i). Nevertheless we found some convergent long overlaps (54) that might be true overlaps, although an important part of convergent overlaps could be classified as group iii) (124). Conclusion Among the 968 overlaps larger than 60 bps which we analysed, we did not find a single real one among the co-directional and divergent orientations and concluded that there had been an excessive number of misannotations. Only convergent orientation seems to permit some long overlaps, although convergent overlaps are also hampered by misannotations. We propose a simple rule to flag these erroneous gene length predictions to facilitate automatic annotation. PMID:18627618

  15. B2G-FAR, a species-centered GO annotation repository

    PubMed Central

    Götz, Stefan; Arnold, Roland; Sebastián-León, Patricia; Martín-Rodríguez, Samuel; Tischler, Patrick; Jehl, Marc-André; Dopazo, Joaquín; Rattei, Thomas; Conesa, Ana

    2011-01-01

    Motivation: Functional genomics research has expanded enormously in the last decade thanks to the cost reduction in high-throughput technologies and the development of computational tools that generate, standardize and share information on gene and protein function such as the Gene Ontology (GO). Nevertheless, many biologists, especially working with non-model organisms, still suffer from non-existing or low-coverage functional annotation, or simply struggle retrieving, summarizing and querying these data. Results: The Blast2GO Functional Annotation Repository (B2G-FAR) is a bioinformatics resource envisaged to provide functional information for otherwise uncharacterized sequence data and offers data mining tools to analyze a larger repertoire of species than currently available. This new annotation resource has been created by applying the Blast2GO functional annotation engine in a strongly high-throughput manner to the entire space of public available sequences. The resulting repository contains GO term predictions for over 13.2 million non-redundant protein sequences based on BLAST search alignments from the SIMAP database. We generated GO annotation for approximately 150 000 different taxa making available 2000 species with the highest coverage through B2G-FAR. A second section within B2G-FAR holds functional annotations for 17 non-model organism Affymetrix GeneChips. Conclusions: B2G-FAR provides easy access to exhaustive functional annotation for 2000 species offering a good balance between quality and quantity, thereby supporting functional genomics research especially in the case of non-model organisms. Availability: The annotation resource is available at http://www.b2gfar.org. Contact: aconesa@cipf.es; sgoetz@cipf.es Supplementary information: Supplementary data are available at Bioinformatics online. PMID:21335611

  16. Integrator: surprisingly diverse functions in gene expression.

    PubMed

    Baillat, David; Wagner, Eric J

    2015-05-01

    The discovery of the metazoan-specific Integrator (INT) complex represented a breakthrough in our understanding of noncoding U-rich small nuclear RNA (UsnRNA) maturation and has triggered a reevaluation of their biosynthesis mechanism. In the decade since, significant progress has been made in understanding the details of its recruitment, specificity, and assembly. While some discrepancies remain on how it interacts with the C-terminal domain (CTD) of the RNA polymerase II (RNAPII) and the details of its recruitment to UsnRNA genes, preliminary models have emerged. Recent provocative studies now implicate INT in the regulation of protein-coding gene transcription initiation and RNAPII pause-release, thereby broadening the scope of INT functions in gene expression regulation. We discuss the implications of these findings while putting them into the context of what is understood about INT function at UsnRNA genes. PMID:25882383

  17. Studying Functions of All Yeast Genes Simultaneously

    NASA Technical Reports Server (NTRS)

    Stolc, Viktor; Eason, Robert G.; Poumand, Nader; Herman, Zelek S.; Davis, Ronald W.; Anthony Kevin; Jejelowo, Olufisayo

    2006-01-01

    A method of studying the functions of all the genes of a given species of microorganism simultaneously has been developed in experiments on Saccharomyces cerevisiae (commonly known as baker's or brewer's yeast). It is already known that many yeast genes perform functions similar to those of corresponding human genes; therefore, by facilitating understanding of yeast genes, the method may ultimately also contribute to the knowledge needed to treat some diseases in humans. Because of the complexity of the method and the highly specialized nature of the underlying knowledge, it is possible to give only a brief and sketchy summary here. The method involves the use of unique synthetic deoxyribonucleic acid (DNA) sequences that are denoted as DNA bar codes because of their utility as molecular labels. The method also involves the disruption of gene functions through deletion of genes. Saccharomyces cerevisiae is a particularly powerful experimental system in that multiple deletion strains easily can be pooled for parallel growth assays. Individual deletion strains recently have been created for 5,918 open reading frames, representing nearly all of the estimated 6,000 genetic loci of Saccharomyces cerevisiae. Tagging of each deletion strain with one or two unique 20-nucleotide sequences enables identification of genes affected by specific growth conditions, without prior knowledge of gene functions. Hybridization of bar-code DNA to oligonucleotide arrays can be used to measure the growth rate of each strain over several cell-division generations. The growth rate thus measured serves as an index of the fitness of the strain.

  18. RNA interference for wheat functional gene analysis

    Microsoft Academic Search

    Daolin Fu; Cristobal Uauy; Ann Blechl; Jorge Dubcovsky

    2007-01-01

    RNA interference (RNAi) refers to a common mechanism of RNA-based post-transcriptional gene silencing in eukaryotic cells.\\u000a In model plant species such as Arabidopsis and rice, RNAi has been routinely used to characterize gene function and to engineer novel phenotypes. In polyploid species,\\u000a this approach is in its early stages, but has great potential since multiple homoeologous copies can be simultaneously

  19. The DNA sequence and biological annotation of human chromosome 1.

    PubMed

    Gregory, S G; Barlow, K F; McLay, K E; Kaul, R; Swarbreck, D; Dunham, A; Scott, C E; Howe, K L; Woodfine, K; Spencer, C C A; Jones, M C; Gillson, C; Searle, S; Zhou, Y; Kokocinski, F; McDonald, L; Evans, R; Phillips, K; Atkinson, A; Cooper, R; Jones, C; Hall, R E; Andrews, T D; Lloyd, C; Ainscough, R; Almeida, J P; Ambrose, K D; Anderson, F; Andrew, R W; Ashwell, R I S; Aubin, K; Babbage, A K; Bagguley, C L; Bailey, J; Beasley, H; Bethel, G; Bird, C P; Bray-Allen, S; Brown, J Y; Brown, A J; Buckley, D; Burton, J; Bye, J; Carder, C; Chapman, J C; Clark, S Y; Clarke, G; Clee, C; Cobley, V; Collier, R E; Corby, N; Coville, G J; Davies, J; Deadman, R; Dunn, M; Earthrowl, M; Ellington, A G; Errington, H; Frankish, A; Frankland, J; French, L; Garner, P; Garnett, J; Gay, L; Ghori, M R J; Gibson, R; Gilby, L M; Gillett, W; Glithero, R J; Grafham, D V; Griffiths, C; Griffiths-Jones, S; Grocock, R; Hammond, S; Harrison, E S I; Hart, E; Haugen, E; Heath, P D; Holmes, S; Holt, K; Howden, P J; Hunt, A R; Hunt, S E; Hunter, G; Isherwood, J; James, R; Johnson, C; Johnson, D; Joy, A; Kay, M; Kershaw, J K; Kibukawa, M; Kimberley, A M; King, A; Knights, A J; Lad, H; Laird, G; Lawlor, S; Leongamornlert, D A; Lloyd, D M; Loveland, J; Lovell, J; Lush, M J; Lyne, R; Martin, S; Mashreghi-Mohammadi, M; Matthews, L; Matthews, N S W; McLaren, S; Milne, S; Mistry, S; Moore, M J F; Nickerson, T; O'Dell, C N; Oliver, K; Palmeiri, A; Palmer, S A; Parker, A; Patel, D; Pearce, A V; Peck, A I; Pelan, S; Phelps, K; Phillimore, B J; Plumb, R; Rajan, J; Raymond, C; Rouse, G; Saenphimmachak, C; Sehra, H K; Sheridan, E; Shownkeen, R; Sims, S; Skuce, C D; Smith, M; Steward, C; Subramanian, S; Sycamore, N; Tracey, A; Tromans, A; Van Helmond, Z; Wall, M; Wallis, J M; White, S; Whitehead, S L; Wilkinson, J E; Willey, D L; Williams, H; Wilming, L; Wray, P W; Wu, Z; Coulson, A; Vaudin, M; Sulston, J E; Durbin, R; Hubbard, T; Wooster, R; Dunham, I; Carter, N P; McVean, G; Ross, M T; Harrow, J; Olson, M V; Beck, S; Rogers, J; Bentley, D R; Banerjee, R; Bryant, S P; Burford, D C; Burrill, W D H; Clegg, S M; Dhami, P; Dovey, O; Faulkner, L M; Gribble, S M; Langford, C F; Pandian, R D; Porter, K M; Prigmore, E

    2006-05-18

    The reference sequence for each human chromosome provides the framework for understanding genome function, variation and evolution. Here we report the finished sequence and biological annotation of human chromosome 1. Chromosome 1 is gene-dense, with 3,141 genes and 991 pseudogenes, and many coding sequences overlap. Rearrangements and mutations of chromosome 1 are prevalent in cancer and many other diseases. Patterns of sequence variation reveal signals of recent selection in specific genes that may contribute to human fitness, and also in regions where no function is evident. Fine-scale recombination occurs in hotspots of varying intensity along the sequence, and is enriched near genes. These and other studies of human biology and disease encoded within chromosome 1 are made possible with the highly accurate annotated sequence, as part of the completed set of chromosome sequences that comprise the reference human genome. PMID:16710414

  20. Wiki-pi: a web-server of annotated human protein-protein interactions to aid in discovery of protein function.

    PubMed

    Orii, Naoki; Ganapathiraju, Madhavi K

    2012-01-01

    Protein-protein interactions (PPIs) are the basis of biological functions. Knowledge of the interactions of a protein can help understand its molecular function and its association with different biological processes and pathways. Several publicly available databases provide comprehensive information about individual proteins, such as their sequence, structure, and function. There also exist databases that are built exclusively to provide PPIs by curating them from published literature. The information provided in these web resources is protein-centric, and not PPI-centric. The PPIs are typically provided as lists of interactions of a given gene with links to interacting partners; they do not present a comprehensive view of the nature of both the proteins involved in the interactions. A web database that allows search and retrieval based on biomedical characteristics of PPIs is lacking, and is needed. We present Wiki-Pi (read Wiki-?), a web-based interface to a database of human PPIs, which allows users to retrieve interactions by their biomedical attributes such as their association to diseases, pathways, drugs and biological functions. Each retrieved PPI is shown with annotations of both of the participant proteins side-by-side, creating a basis to hypothesize the biological function facilitated by the interaction. Conceptually, it is a search engine for PPIs analogous to PubMed for scientific literature. Its usefulness in generating novel scientific hypotheses is demonstrated through the study of IGSF21, a little-known gene that was recently identified to be associated with diabetic retinopathy. Using Wiki-Pi, we infer that its association to diabetic retinopathy may be mediated through its interactions with the genes HSPB1, KRAS, TMSB4X and DGKD, and that it may be involved in cellular response to external stimuli, cytoskeletal organization and regulation of molecular activity. The website also provides a wiki-like capability allowing users to describe or discuss an interaction. Wiki-Pi is available publicly and freely at http://severus.dbmi.pitt.edu/wiki-pi/. PMID:23209562

  1. Human Genome Annotation

    NASA Astrophysics Data System (ADS)

    Gerstein, Mark

    A central problem for 21st century science is annotating the human genome and making this annotation useful for the interpretation of personal genomes. My talk will focus on annotating the 99% of the genome that does not code for canonical genes, concentrating on intergenic features such as structural variants (SVs), pseudogenes (protein fossils), binding sites, and novel transcribed RNAs (ncRNAs). In particular, I will describe how we identify regulatory sites and variable blocks (SVs) based on processing next-generation sequencing experiments. I will further explain how we cluster together groups of sites to create larger annotations. Next, I will discuss a comprehensive pseudogene identification pipeline, which has enabled us to identify >10K pseudogenes in the genome and analyze their distribution with respect to age, protein family, and chromosomal location. Throughout, I will try to introduce some of the computational algorithms and approaches that are required for genome annotation. Much of this work has been carried out in the framework of the ENCODE, modENCODE, and 1000 genomes projects.

  2. Neural networks approaches for discovering the learnable correlation between gene function and gene expression in mouse

    E-print Network

    Morris, Quaid

    Keywords: Gene function prediction Self organizing maps (SOM) Multilayer perceptrons (MLP) Gene expression function based on gene expression data is much easier in prokaryotes than eukaryotes due to the relatively between gene function and gene expression. In previous work, we presented novel clustering and neural

  3. Drosophila Genomic Sequence Annotation Using the BLOCKS+ Database

    Microsoft Academic Search

    Jorja G. Henikoff; Steven Henikoff

    2008-01-01

    A simple and general homology-based method for gene finding was applied to the 2.9-Mb Drosophila melanogaster Adh region, the target sequence of the Genome Annotation Assessment Project (GASP). Each strand of the entire sequence was used as query of the BLOCKS+ database of conserved regions of proteins. This led to functional assignments for more than one-third of the genes and

  4. Rapid Annotation of Anonymous Sequences from Genome Projects Using Semantic Similarities and a Weighting Scheme in Gene Ontology

    Microsoft Academic Search

    Paolo Fontana; Alessandro Cestaro; Riccardo Velasco; Elide Formentin; Stefano Toppo; Sridhar Hannenhalli

    2009-01-01

    BackgroundLarge-scale sequencing projects have now become routine lab practice and this has led to the development of a new generation of tools involving function prediction methods, bringing the latter back to the fore. The advent of Gene Ontology, with its structured vocabulary and paradigm, has provided computational biologists with an appropriate means for this task.MethodologyWe present here a novel method

  5. Next generation models for storage and representation of microbial biological annotation

    PubMed Central

    2010-01-01

    Background Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software system to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way. Results Here, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files. Conclusions The power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to researchers. In this way, all researchers, experimental and computational, will more easily understand the informatics processes constructing genome annotation and ultimately be able to help improve the systems that produce them. PMID:20946598

  6. Next Generation Models for Storage and Representation of Microbial Biological Annotation

    SciTech Connect

    Quest, Daniel J [ORNL; Land, Miriam L [ORNL; Brettin, Thomas S [ORNL; Cottingham, Robert W [ORNL

    2010-01-01

    Background Traditional genome annotation systems were developed in a very different computing era, one where the World Wide Web was just emerging. Consequently, these systems are built as centralized black boxes focused on generating high quality annotation submissions to GenBank/EMBL supported by expert manual curation. The exponential growth of sequence data drives a growing need for increasingly higher quality and automatically generated annotation. Typical annotation pipelines utilize traditional database technologies, clustered computing resources, Perl, C, and UNIX file systems to process raw sequence data, identify genes, and predict and categorize gene function. These technologies tightly couple the annotation software system to hardware and third party software (e.g. relational database systems and schemas). This makes annotation systems hard to reproduce, inflexible to modification over time, difficult to assess, difficult to partition across multiple geographic sites, and difficult to understand for those who are not domain experts. These systems are not readily open to scrutiny and therefore not scientifically tractable. The advent of Semantic Web standards such as Resource Description Framework (RDF) and OWL Web Ontology Language (OWL) enables us to construct systems that address these challenges in a new comprehensive way. Results Here, we develop a framework for linking traditional data to OWL-based ontologies in genome annotation. We show how data standards can decouple hardware and third party software tools from annotation pipelines, thereby making annotation pipelines easier to reproduce and assess. An illustrative example shows how TURTLE (Terse RDF Triple Language) can be used as a human readable, but also semantically-aware, equivalent to GenBank/EMBL files. Conclusions The power of this approach lies in its ability to assemble annotation data from multiple databases across multiple locations into a representation that is understandable to researchers. In this way, all researchers, experimental and computational, will more easily understand the informatics processes constructing genome annotation and ultimately be able to help improve the systems that produce them.

  7. ESTExplorer: an expressed sequence tag (EST) assembly and annotation platform

    PubMed Central

    Nagaraj, Shivashankar H.; Deshpande, Nandan; Gasser, Robin B.; Ranganathan, Shoba

    2007-01-01

    The analysis of expressed sequence tag (EST) datasets offers a rapid and cost-effective approach to elucidate the transcriptome of an organism, but requiring several computational methods for assembly and annotation. ESTExplorer is a comprehensive workflow system for EST data management and analysis. The pipeline uses a ‘distributed control approach’ in which the most appropriate bioinformatics tools are implemented over different dedicated processors. Species-specific repeat masking and conceptual translation are in-built. ESTExplorer accepts a set of ESTs in FASTA format which can be analysed using programs selected by the user. After pre-processing and assembly, the dataset is annotated at the nucleotide and protein levels, following conceptual translation. Users may optionally provide ESTExplorer with assembled contigs for annotation purposes. Functionally annotated contigs/ESTs can be analysed individually. The overall outputs are gene ontologies, protein functional identifications in terms of mapping to protein domains and metabolic pathways. ESTExplorer has been applied successfully to annotate large EST datasets from parasitic nematodes and to identify novel genes as potential targets for parasite intervention. ESTExplorer runs on a Linux cluster and is freely available for the academic community at http://estexplorer.biolinfo.org. PMID:17545197

  8. Functionalization of a protosynaptic gene expression network.

    PubMed

    Conaco, Cecilia; Bassett, Danielle S; Zhou, Hongjun; Arcila, Mary Luz; Degnan, Sandie M; Degnan, Bernard M; Kosik, Kenneth S

    2012-06-26

    Assembly of a functioning neuronal synapse requires the precisely coordinated synthesis of many proteins. To understand the evolution of this complex cellular machine, we tracked the developmental expression patterns of a core set of conserved synaptic genes across a representative sampling of the animal kingdom. Coregulation, as measured by correlation of gene expression over development, showed a marked increase as functional nervous systems emerged. In the earliest branching animal phyla (Porifera), in which a nearly complete set of synaptic genes exists in the absence of morphological synapses, these "protosynaptic" genes displayed a lack of global coregulation although small modules of coexpressed genes are readily detectable by using network analysis techniques. These findings suggest that functional synapses evolved by exapting preexisting cellular machines, likely through some modification of regulatory circuitry. Evolutionarily ancient modules continue to operate seamlessly within the synapses of modern animals. This work shows that the application of network techniques to emerging genomic and expression data can provide insights into the evolution of complex cellular machines such as the synapse. PMID:22723359

  9. Functionalization of a protosynaptic gene expression network

    PubMed Central

    Conaco, Cecilia; Bassett, Danielle S.; Zhou, Hongjun; Arcila, Mary Luz; Degnan, Sandie M.; Degnan, Bernard M.; Kosik, Kenneth S.

    2012-01-01

    Assembly of a functioning neuronal synapse requires the precisely coordinated synthesis of many proteins. To understand the evolution of this complex cellular machine, we tracked the developmental expression patterns of a core set of conserved synaptic genes across a representative sampling of the animal kingdom. Coregulation, as measured by correlation of gene expression over development, showed a marked increase as functional nervous systems emerged. In the earliest branching animal phyla (Porifera), in which a nearly complete set of synaptic genes exists in the absence of morphological synapses, these “protosynaptic” genes displayed a lack of global coregulation although small modules of coexpressed genes are readily detectable by using network analysis techniques. These findings suggest that functional synapses evolved by exapting preexisting cellular machines, likely through some modification of regulatory circuitry. Evolutionarily ancient modules continue to operate seamlessly within the synapses of modern animals. This work shows that the application of network techniques to emerging genomic and expression data can provide insights into the evolution of complex cellular machines such as the synapse. PMID:22723359

  10. EST Express: PHP\\/MySQL based automated annotation of ESTs from expression libraries

    Microsoft Academic Search

    Robin P. Smith; William J. Buchser; Marcus B. Lemmon; Jose R. Pardinas; John L. Bixby; Vance P. Lemmon

    2008-01-01

    BACKGROUND: Several biological techniques result in the acquisition of functional sets of cDNAs that must be sequenced and analyzed. The emergence of redundant databases such as UniGene and centralized annotation engines such as Entrez Gene has allowed the development of software that can analyze a great number of sequences in a matter of seconds. RESULTS: We have developed \\

  11. The Structure of a Gene Co-Expression Network Reveals Biological Functions Underlying eQTLs

    PubMed Central

    Villa-Vialaneix, Nathalie; Liaubet, Laurence; Laurent, Thibault; Cherel, Pierre; Gamot, Adrien; SanCristobal, Magali

    2013-01-01

    What are the commonalities between genes, whose expression level is partially controlled by eQTL, especially with regard to biological functions? Moreover, how are these genes related to a phenotype of interest? These issues are particularly difficult to address when the genome annotation is incomplete, as is the case for mammalian species. Moreover, the direct link between gene expression and a phenotype of interest may be weak, and thus difficult to handle. In this framework, the use of a co-expression network has proven useful: it is a robust approach for modeling a complex system of genetic regulations, and to infer knowledge for yet unknown genes. In this article, a case study was conducted with a mammalian species. It showed that the use of a co-expression network based on partial correlation, combined with a relevant clustering of nodes, leads to an enrichment of biological functions of around 83%. Moreover, the use of a spatial statistics approach allowed us to superimpose additional information related to a phenotype; this lead to highlighting specific genes or gene clusters that are related to the network structure and the phenotype. Three main results are worth noting: first, key genes were highlighted as a potential focus for forthcoming biological experiments; second, a set of biological functions, which support a list of genes under partial eQTL control, was set up by an overview of the global structure of the gene expression network; third, pH was found correlated with gene clusters, and then with related biological functions, as a result of a spatial analysis of the network topology. PMID:23577081

  12. Rice functionality, starch structure and the genes

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Through collaborative efforts among USDA scientists at Beaumont, Texas, we have gained in-depth knowledge of how rice functionality, i.e. the texture of the cooked rice, rice processing properties, and starch gelatinization temperature, are associated with starch-synthesis genes and starch structure...

  13. Bayesian modelling of shared gene function

    Microsoft Academic Search

    P. Sykacek; R. Clarkson; C. Print; R. A. Furlong; Gos Micklem

    2007-01-01

    Motivation Biological assays are often carried out on tissues that contain many cell lineages and active pathways. Microarray data produced using such material therefore reflect superimpositions of biological processes. Analysing such data for shared gene function by means of well matched assays may help to provide a better focus on specific cell types and processes. The identification of ge nes

  14. A large-scale zebrafish gene knockout resource for the genome-wide study of gene function

    PubMed Central

    Varshney, Gaurav K.; Lu, Jing; Gildea, Derek E.; Huang, Haigen; Pei, Wuhong; Yang, Zhongan; Huang, Sunny C.; Schoenfeld, David; Pho, Nam H.; Casero, David; Hirase, Takashi; Mosbrook-Davis, Deborah; Zhang, Suiyuan; Jao, Li-En; Zhang, Bo; Woods, Ian G.; Zimmerman, Steven; Schier, Alexander F.; Wolfsberg, Tyra G.; Pellegrini, Matteo; Burgess, Shawn M.; Lin, Shuo

    2013-01-01

    With the completion of the zebrafish genome sequencing project, it becomes possible to analyze the function of zebrafish genes in a systematic way. The first step in such an analysis is to inactivate each protein-coding gene by targeted or random mutation. Here we describe a streamlined pipeline using proviral insertions coupled with high-throughput sequencing and mapping technologies to widely mutagenize genes in the zebrafish genome. We also report the first 6144 mutagenized and archived F1's predicted to carry up to 3776 mutations in annotated genes. Using in vitro fertilization, we have rescued and characterized ?0.5% of the predicted mutations, showing mutation efficacy and a variety of phenotypes relevant to both developmental processes and human genetic diseases. Mutagenized fish lines are being made freely available to the public through the Zebrafish International Resource Center. These fish lines establish an important milestone for zebrafish genetics research and should greatly facilitate systematic functional studies of the vertebrate genome. PMID:23382537

  15. A large-scale zebrafish gene knockout resource for the genome-wide study of gene function.

    PubMed

    Varshney, Gaurav K; Lu, Jing; Gildea, Derek E; Huang, Haigen; Pei, Wuhong; Yang, Zhongan; Huang, Sunny C; Schoenfeld, David; Pho, Nam H; Casero, David; Hirase, Takashi; Mosbrook-Davis, Deborah; Zhang, Suiyuan; Jao, Li-En; Zhang, Bo; Woods, Ian G; Zimmerman, Steven; Schier, Alexander F; Wolfsberg, Tyra G; Pellegrini, Matteo; Burgess, Shawn M; Lin, Shuo

    2013-04-01

    With the completion of the zebrafish genome sequencing project, it becomes possible to analyze the function of zebrafish genes in a systematic way. The first step in such an analysis is to inactivate each protein-coding gene by targeted or random mutation. Here we describe a streamlined pipeline using proviral insertions coupled with high-throughput sequencing and mapping technologies to widely mutagenize genes in the zebrafish genome. We also report the first 6144 mutagenized and archived F1's predicted to carry up to 3776 mutations in annotated genes. Using in vitro fertilization, we have rescued and characterized ~0.5% of the predicted mutations, showing mutation efficacy and a variety of phenotypes relevant to both developmental processes and human genetic diseases. Mutagenized fish lines are being made freely available to the public through the Zebrafish International Resource Center. These fish lines establish an important milestone for zebrafish genetics research and should greatly facilitate systematic functional studies of the vertebrate genome. PMID:23382537

  16. Cloning, analysis and functional annotation of expressed sequence tags from the Earthworm Eisenia fetida

    Microsoft Academic Search

    M. Pirooznia; Ping Gong; Xin Guan; Laura S. Inouye; Kuan Yang; Edward J. Perkins; Youping Deng

    2007-01-01

    Eisenia fetida, commonly known as red wiggler or compost worm, belongs to the Lumbricidae family of the Annelida phylum. Little is known about its genome sequence although it has been extensively used as a test organism in terrestrial ecotoxicology. In order to understand its gene expression response to environmental contaminants, we cloned 4032 cDNAs or expressed sequence tags (ESTs) from

  17. Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae

    SciTech Connect

    Rutledge, Alexandra C.; Jones, Marcus B.; Chauhan, Sadhana; Purvine, Samuel O.; Sanford, James; Monroe, Matthew E.; Brewer, Heather M.; Payne, Samuel H.; Ansong, Charles; Frank, Bryan C.; Smith, Richard D.; Peterson, Scott; Motin, Vladimir L.; Adkins, Joshua N.

    2012-03-27

    Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. To date, the perceived value of manual curation for genome annotations is not offset by the real cost and time associated with the process. In order to balance the large number of sequences generated, the annotation process is now performed almost exclusively in an automated fashion for most genome sequencing projects. One possible way to reduce errors inherent to automated computational annotations is to apply data from 'omics' measurements (i.e. transcriptional and proteomic) to the un-annotated genome with a proteogenomic-based approach. This approach does require additional experimental and bioinformatics methods to include omics technologies; however, the approach is readily automatable and can benefit from rapid developments occurring in those research domains as well. The annotation process can be improved by experimental validation of transcription and translation and aid in the discovery of annotation errors. Here the concept of annotation refinement has been extended to include a comparative assessment of genomes across closely related species, as is becoming common in sequencing efforts. Transcriptomic and proteomic data derived from three highly similar pathogenic Yersiniae (Y. pestis CO92, Y. pestis pestoides F, and Y. pseudotuberculosis PB1/+) was used to demonstrate a comprehensive comparative omic-based annotation methodology. Peptide and oligo measurements experimentally validated the expression of nearly 40% of each strain's predicted proteome and revealed the identification of 28 novel and 68 previously incorrect protein-coding sequences (e.g., observed frameshifts, extended start sites, and translated pseudogenes) within the three current Yersinia genome annotations. Gene loss is presumed to play a major role in Y. pestis acquiring its niche as a virulent pathogen, thus the discovery of many translated pseudogenes underscores a need for functional analyses to investigate hypotheses related to divergence. Refinements included the discovery of a seemingly essential ribosomal protein, several virulence-associated factors, and a transcriptional regulator, among other proteins, most of which are annotated as hypothetical, that were missed during annotation.

  18. Comparative validation of the D. melanogaster modENCODE transcriptome annotation

    PubMed Central

    Chen, Zhen-Xia; Sturgill, David; Qu, Jiaxin; Jiang, Huaiyang; Park, Soo; Boley, Nathan; Suzuki, Ana Maria; Fletcher, Anthony R.; Plachetzki, David C.; FitzGerald, Peter C.; Artieri, Carlo G.; Atallah, Joel; Barmina, Olga; Brown, James B.; Blankenburg, Kerstin P.; Clough, Emily; Dasgupta, Abhijit; Gubbala, Sai; Han, Yi; Jayaseelan, Joy C.; Kalra, Divya; Kim, Yoo-Ah; Kovar, Christie L.; Lee, Sandra L.; Li, Mingmei; Malley, James D.; Malone, John H.; Mathew, Tittu; Mattiuzzo, Nicolas R.; Munidasa, Mala; Muzny, Donna M.; Ongeri, Fiona; Perales, Lora; Przytycka, Teresa M.; Pu, Ling-Ling; Robinson, Garrett; Thornton, Rebecca L.; Saada, Nehad; Scherer, Steven E.; Smith, Harold E.; Vinson, Charles; Warner, Crystal B.; Worley, Kim C.; Wu, Yuan-Qing; Zou, Xiaoyan; Cherbas, Peter; Kellis, Manolis; Eisen, Michael B.; Piano, Fabio; Kionte, Karin; Fitch, David H.; Sternberg, Paul W.; Cutter, Asher D.; Duff, Michael O.; Hoskins, Roger A.; Graveley, Brenton R.; Gibbs, Richard A.; Bickel, Peter J.; Kopp, Artyom; Carninci, Piero; Celniker, Susan E.; Oliver, Brian; Richards, Stephen

    2014-01-01

    Accurate gene model annotation of reference genomes is critical for making them useful. The modENCODE project has improved the D. melanogaster genome annotation by using deep and diverse high-throughput data. Since transcriptional activity that has been evolutionarily conserved is likely to have an advantageous function, we have performed large-scale interspecific comparisons to increase confidence in predicted annotations. To support comparative genomics, we filled in divergence gaps in the Drosophila phylogeny by generating draft genomes for eight new species. For comparative transcriptome analysis, we generated mRNA expression profiles on 81 samples from multiple tissues and developmental stages of 15 Drosophila species, and we performed cap analysis of gene expression in D. melanogaster and D. pseudoobscura. We also describe conservation of four distinct core promoter structures composed of combinations of elements at three positions. Overall, each type of genomic feature shows a characteristic divergence rate relative to neutral models, highlighting the value of multispecies alignment in annotating a target genome that should prove useful in the annotation of other high priority genomes, especially human and other mammalian genomes that are rich in noncoding sequences. We report that the vast majority of elements in the annotation are evolutionarily conserved, indicating that the annotation will be an important springboard for functional genetic testing by the Drosophila community. PMID:24985915

  19. Current trend of annotating single nucleotide variation in humans - A case study on SNVrap.

    PubMed

    Li, Mulin Jun; Wang, Junwen

    2015-06-01

    As high throughput methods, such as whole genome genotyping arrays, whole exome sequencing (WES) and whole genome sequencing (WGS), have detected huge amounts of genetic variants associated with human diseases, function annotation of these variants is an indispensable step in understanding disease etiology. Large-scale functional genomics projects, such as The ENCODE Project and Roadmap Epigenomics Project, provide genome-wide profiling of functional elements across different human cell types and tissues. With the urgent demands for identification of disease-causal variants, comprehensive and easy-to-use annotation tool is highly in demand. Here we review and discuss current progress and trend of the variant annotation field. Furthermore, we introduce a comprehensive web portal for annotating human genetic variants. We use gene-based features and the latest functional genomics datasets to annotate single nucleotide variation (SNVs) in human, at whole genome scale. We further apply several function prediction algorithms to annotate SNVs that might affect different biological processes, including transcriptional gene regulation, alternative splicing, post-transcriptional regulation, translation and post-translational modifications. The SNVrap web portal is freely available at http://jjwanglab.org/snvrap. PMID:25308971

  20. Experimental annotation of post-translational features and translated coding regions in the pathogen Salmonella Typhimurium

    SciTech Connect

    Ansong, Charles; Tolic, Nikola; Purvine, Samuel O.; Porwollik, Steffen; Jones, Marcus B.; Yoon, Hyunjin; Payne, Samuel H.; Martin, Jessica L.; Burnet, Meagan C.; Monroe, Matthew E.; Venepally, Pratap; Smith, Richard D.; Peterson, Scott; Heffron, Fred; Mcclelland, Michael; Adkins, Joshua N.

    2011-08-25

    Complete and accurate genome annotation is crucial for comprehensive and systematic studies of biological systems. For example systems biology-oriented genome scale modeling efforts greatly benefit from accurate annotation of protein-coding genes to develop proper functioning models. However, determining protein-coding genes for most new genomes is almost completely performed by inference, using computational predictions with significant documented error rates (> 15%). Furthermore, gene prediction programs provide no information on biologically important post-translational processing events critical for protein function. With the ability to directly measure peptides arising from expressed proteins, mass spectrometry-based proteomics approaches can be used to augment and verify coding regions of a genomic sequence and importantly detect post-translational processing events. In this study we utilized “shotgun” proteomics to guide accurate primary genome annotation of the bacterial pathogen Salmonella Typhimurium 14028 to facilitate a systems-level understanding of Salmonella biology. The data provides protein-level experimental confirmation for 44% of predicted protein-coding genes, suggests revisions to 48 genes assigned incorrect translational start sites, and uncovers 13 non-annotated genes missed by gene prediction programs. We also present a comprehensive analysis of post-translational processing events in Salmonella, revealing a wide range of complex chemical modifications (70 distinct modifications) and confirming more than 130 signal peptide and N-terminal methionine cleavage events in Salmonella. This study highlights several ways in which proteomics data applied during the primary stages of annotation can improve the quality of genome annotations, especially with regards to the annotation of mature protein products.

  1. Annotated Bibliography

    NSDL National Science Digital Library

    Leslie Davis

    Annotations are short and cannot give detailed information, but they should cover these points: 1. The general contents of the work. What does it discuss and how detailed is it? This is the main portion of the annotation. 2. The author's qualifications. Is the writer a trained scholar? A journalist? Someone relating a personal experience? 3. An evaluation of the reliability. Is the information given reliable? Are facts or opinions stressed? 4. The intended audience. Is it for a general reader or a specialist? How much, if any, background knowledge is needed to understand it? Was is easy or difficult to read?

  2. GENCODE: The reference human genome annotation for The ENCODE Project

    E-print Network

    Lin, Michael

    The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation ...

  3. Integrative bioinformatics for functional genome annotation: trawling for G protein-coupled receptors.

    PubMed

    Flower, Darren R; Attwood, Teresa K

    2004-12-01

    G protein-coupled receptors (GPCR) are amongst the best studied and most functionally diverse types of cell-surface protein. The importance of GPCRs as mediates or cell function and organismal developmental underlies their involvement in key physiological roles and their prominence as targets for pharmacological therapeutics. In this review, we highlight the requirement for integrated protocols which underline the different perspectives offered by different sequence analysis methods. BLAST and FastA offer broad brush strokes. Motif-based search methods add the fine detail. Structural modelling offers another perspective which allows us to elucidate the physicochemical properties that underlie ligand binding. Together, these different views provide a more informative and a more detailed picture of GPCR structure and function. Many GPCRs remain orphan receptors with no identified ligand, yet as computer-driven functional genomics starts to elaborate their functions, a new understanding of their roles in cell and developmental biology will follow. PMID:15561589

  4. Function Annotation of Hepatic Retinoid x Receptor ? Based on Genome-Wide DNA Binding and Transcriptome Profiling

    PubMed Central

    Zhan, Qi; Fang, Yaping; He, Yuqi; Liu, Hui-Xin; Fang, Jianwen; Wan, Yu-Jui Yvonne

    2012-01-01

    Background Retinoid x receptor ? (RXR?) is abundantly expressed in the liver and is essential for the function of other nuclear receptors. Using chromatin immunoprecipitation sequencing and mRNA profiling data generated from wild type and RXR?-null mouse livers, the current study identifies the bona-fide hepatic RXR? targets and biological pathways. In addition, based on binding and motif analysis, the molecular mechanism by which RXR? regulates hepatic genes is elucidated in a high-throughput manner. Principal Findings Close to 80% of hepatic expressed genes were bound by RXR?, while 16% were expressed in an RXR?-dependent manner. Motif analysis predicted direct repeat with a spacer of one nucleotide as the most prevalent RXR? binding site. Many of the 500 strongest binding motifs overlapped with the binding motif of specific protein 1. Biological functional analysis of RXR?-dependent genes revealed that hepatic RXR? deficiency mainly resulted in up-regulation of steroid and cholesterol biosynthesis-related genes and down-regulation of translation- as well as anti-apoptosis-related genes. Furthermore, RXR? bound to many genes that encode nuclear receptors and their cofactors suggesting the central role of RXR? in regulating nuclear receptor-mediated pathways. Conclusions This study establishes the relationship between RXR? DNA binding and hepatic gene expression. RXR? binds extensively to the mouse genome. However, DNA binding does not necessarily affect the basal mRNA level. In addition to metabolism, RXR? dictates the expression of genes that regulate RNA processing, translation, and protein folding illustrating the novel roles of hepatic RXR? in post-transcriptional regulation. PMID:23166811

  5. Cancer Proliferation Gene Discovery Through Functional Genomics

    PubMed Central

    Schlabach, Michael R.; Luo, Ji; Solimini, Nicole L.; Hu, Guang; Xu, Qikai; Li, Mamie Z.; Zhao, Zhenming; Smogorzewska, Agata; Sowa, Mathew E.; Ang, Xiaolu L.; Westbrook, Thomas F.; Liang, Anthony C.; Chang, Kenneth; Hackett, Jennifer A.; Harper, J. Wade; Hannon, Gregory J.; Elledge, Stephen J.

    2010-01-01

    Retroviral short hairpin RNA (shRNA)–mediated genetic screens in mammalian cells are powerful tools for discovering loss-of-function phenotypes. We describe a highly parallel multiplex methodology for screening large pools of shRNAs using half-hairpin barcodes for microarray deconvolution. We carried out dropout screens for shRNAs that affect cell proliferation and viability in cancer cells and normal cells. We identified many shRNAs to be antiproliferative that target core cellular processes, such as the cell cycle and protein translation, in all cells examined. Moreover, we identified genes that are selectively required for proliferation and survival in different cell lines. Our platform enables rapid and cost-effective genome-wide screens to identify cancer proliferation and survival genes for target discovery. Such efforts are complementary to the Cancer Genome Atlas and provide an alternative functional view of cancer cells. PMID:18239126

  6. Cost benefit theory and optimal design of gene regulation functions

    Microsoft Academic Search

    Tomer Kalisky; Erez Dekel; Uri Alon

    2007-01-01

    Cells respond to the environment by regulating the expression of genes according to environmental signals. The relation between the input signal level and the expression of the gene is called the gene regulation function. It is of interest to understand the shape of a gene regulation function in terms of the environment in which it has evolved and the basic

  7. BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments

    Microsoft Academic Search

    Fátima Al-shahrour; Pablo Minguez; Joaquín Tárraga; David Montaner; Eva Alloza; Juan M. Vaquerizas; Lucía Conde; Christian Blaschke; Javier Vera; Joaquín Dopazo

    2006-01-01

    We present a new version of Babelomics, a com- plete suite of web tools for functional analysis of genome-scale experiments, with new and improved tools. New functionally relevant terms have been included such as CisRed motifs or bioentities obtained by text-mining procedures. An improved indexing has considerably speeded up several of the modules. An improved version of the FatiScan method

  8. BG7: a new approach for bacterial genome annotation designed for next generation sequencing data.

    PubMed

    Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Pareja, Eduardo; Tobes, Raquel

    2012-01-01

    BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version - which is developed in Java, takes advantage of Amazon Web Services (AWS) cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic E. coli genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future. PMID:23185310

  9. Table 1: Summary of Applications for Protein Functional Annotation Method Advantages/Disadvantages Website

    E-print Network

    Powers, Robert

    limitation Slow (1-2 days) http://ef-site.hgc.jp/eF-seek JAFA [70] Advantages meta-server to sequence function [15] http://jafa.burnham.org PDB-UF [27] Advantages assigns E.C. number to hypothetical proteins

  10. Protein surface analysis for function annotation in high-throughput structural genomics pipeline

    Microsoft Academic Search

    T. Andrew Binkowski; Andrzej Joachimiak; Jie Liang

    2005-01-01

    Structural genomics (SG) initiatives are expanding the universe of protein fold space by rapidly determining structures of proteins that were intentionally selected on the basis of low sequence similarity to proteins of known structure. Often these proteins have no associated biochemical or cellular functions. The SG success has resulted in an accelerated deposition of novel structures. In some cases the

  11. Proteome Analyst Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors

    E-print Network

    Lu, Paul

    . coli and other microbes. Variations on these functional vocabularies were added as other genomes T6G 2E8 CANADA Abstract Modern sequencing technology permits sequencing of entire genomes, whose-throughput sequencing technology has made it possible for even the smallest laboratory to sequence the genome

  12. Comparison of Protein Active Site Structures for Functional Annotation of Proteins and Drug Design

    E-print Network

    Powers, Robert

    of numerous genome sequencing projects and the vastly expanding list of unannotated proteins. Traditionally, global pri- mary-sequence and structure comparisons have been used to determine putative function of various genomics efforts has been a vast growth in putative protein sequences that lack any experimental

  13. Towards Ontologies of Functionality and Semantic Annotation for Technical Knowledge Management

    Microsoft Academic Search

    Yoshinobu Kitamura; Naoya Washio; Yusuke Koji; Riichiro Mizoguchi

    2005-01-01

    This research aims at promoting sharing of knowledge about func- tionality of artifacts among engineers, which tends to be implicit in practice. In order to provide a conceptual viewpoint for modeling and a controlled vocabu- lary, we have developed an ontological framework of functional knowledge. This framework has been successfully deployed in a manufacturing company. This paper firstly discusses an

  14. Management and Analysis of Genomic Functional and Phenotypic Controlled Annotations to Support Biomedical Investigation and Practice

    Microsoft Academic Search

    Marco Masseroli

    2007-01-01

    The growing available genomic information provides new opportunities for novel research approaches and original biomedical applications that can provide effective data management and analysis support. In fact, integration and comprehensive evaluation of available controlled data can highlight information patterns leading to unveil new biomedical knowledge. Here, we describe Genome Function INtegrated Discover (GFINDer ), a Web-accessible three-tier multidatabase system we

  15. Gene polymorphisms associated with functional dyspepsia

    PubMed Central

    Kourikou, Anastasia; Karamanolis, George P; Dimitriadis, George D; Triantafyllou, Konstantinos

    2015-01-01

    Functional dyspepsia (FD) is a constellation of functional upper abdominal complaints with poorly elucidated pathophysiology. However, there is increasing evidence that susceptibility to FD is influenced by hereditary factors. Genetic association studies in FD have examined genotypes related to gastrointestinal motility or sensation, as well as those related to inflammation or immune response. G-protein b3 subunit gene polymorphisms were first reported as being associated with FD. Thereafter, several gene polymorphisms including serotonin transporter promoter, interlukin-17F, migration inhibitory factor, cholecystocynine-1 intron 1, cyclooxygenase-1, catechol-o-methyltransferase, transient receptor potential vanilloid 1 receptor, regulated upon activation normal T cell expressed and secreted, p22PHOX, Toll like receptor 2, SCN10A, CD14 and adrenoreceptors have been investigated in relation to FD; however, the results are contradictory. Several limitations underscore the value of current studies. Among others, inconsistencies in the definitions of FD and controls, subject composition differences regarding FD subtypes, inadequate samples, geographical and ethnical differences, as well as unadjusted environmental factors. Further well-designed studies are necessary to determine how targeted genes polymorphisms, influence the clinical manifestations and potentially the therapeutic response in FD. PMID:26167069

  16. Assembly, gene annotation and marker development using 454 floral transcriptome sequences in Ziziphus celata (Rhamnaceae), a highly endangered, Florida endemic plant.

    PubMed

    Edwards, Christine E; Parchman, Thomas L; Weekley, Carl W

    2012-01-01

    Large-scale DNA sequence data may enable development of genetic resources in endangered species, thereby facilitating conservation efforts. Ziziphus celata, a federally endangered, self-incompatible plant species occurring in Florida, USA, is one species for which genetic resources are necessary to facilitate new introductions and augmentations essential for recovery of the species. We used 454 pyrosequencing of a Z. celata normalized floral cDNA library to create a genomic resource for gene and marker discovery. A half-plate GS-FLX Titanium run yielded 655 337 reads averaging 250 bp. A total of 474 025 reads were assembled de novo into 84 645 contigs averaging 408 bp, while 181 312 reads remained unassembled. Forty-seven and 43% of contig consensus sequences had BLAST matches to known proteins in the Uniref50 and TAIR9 annotated protein databases, respectively; many contigs fully represented orthologous proteins in TAIR9. A total of 22 707 unique genes were sequenced, indicating substantial coverage of the Z. celata transcriptome. We detected single-nucleotide polymorphisms and simple sequence repeats (SSRs) and developed thousands of SSR primers for use in future genetic studies. As a first step towards understanding self-incompatibility in Z. celata, we identified sequences belonging to the gene family encoding self-incompatibility. This study demonstrates the efficacy of 454 transcriptome sequencing for rapid gene and marker discovery in an endangered plant. PMID:22039173

  17. Assembly, Gene Annotation and Marker Development Using 454 Floral Transcriptome Sequences in Ziziphus Celata (Rhamnaceae), a Highly Endangered, Florida Endemic Plant

    PubMed Central

    Edwards, Christine E.; Parchman, Thomas L.; Weekley, Carl W.

    2012-01-01

    Large-scale DNA sequence data may enable development of genetic resources in endangered species, thereby facilitating conservation efforts. Ziziphus celata, a federally endangered, self-incompatible plant species occurring in Florida, USA, is one species for which genetic resources are necessary to facilitate new introductions and augmentations essential for recovery of the species. We used 454 pyrosequencing of a Z. celata normalized floral cDNA library to create a genomic resource for gene and marker discovery. A half-plate GS-FLX Titanium run yielded 655 337 reads averaging 250 bp. A total of 474 025 reads were assembled de novo into 84 645 contigs averaging 408 bp, while 181 312 reads remained unassembled. Forty-seven and 43% of contig consensus sequences had BLAST matches to known proteins in the Uniref50 and TAIR9 annotated protein databases, respectively; many contigs fully represented orthologous proteins in TAIR9. A total of 22 707 unique genes were sequenced, indicating substantial coverage of the Z. celata transcriptome. We detected single-nucleotide polymorphisms and simple sequence repeats (SSRs) and developed thousands of SSR primers for use in future genetic studies. As a first step towards understanding self-incompatibility in Z. celata, we identified sequences belonging to the gene family encoding self-incompatibility. This study demonstrates the efficacy of 454 transcriptome sequencing for rapid gene and marker discovery in an endangered plant. PMID:22039173

  18. Identifying pair-wise gene functional similarity by multiplex gene expression maps and supervised learning

    E-print Network

    Obradovic, Zoran

    Identifying pair-wise gene functional similarity by multiplex gene expression maps and supervised and gene expression profiles in the mammalian brain. However, little attention has been paid to the location information of a gene's expressions. Gene expression maps, which contain spatial information

  19. Mining Association Rules among Gene Functions in Clusters of Similar Gene Expression Maps

    E-print Network

    Obradovic, Zoran

    Mining Association Rules among Gene Functions in Clusters of Similar Gene Expression Maps Li An166522@temple.edu Abstract Association rules mining methods have been recently applied to gene expression, not much effort has focused on detecting the relation between gene expression maps and related gene

  20. Comparative genome analysis of PHB gene family reveals deep evolutionary origins and diverse gene function

    Microsoft Academic Search

    Chao Di; Wenying Xu; Zhen Su; Joshua S Yuan

    2010-01-01

    BACKGROUND: PHB (Prohibitin) gene family is involved in a variety of functions important for different biological processes. PHB genes are ubiquitously present in divergent species from prokaryotes to eukaryotes. Human PHB genes have been found to be associated with various diseases. Recent studies by our group and others have shown diverse function of PHB genes in plants for development, senescence,

  1. Neural Networks Approaches for Discovering the Learnable Correlation between Gene Function and Gene

    E-print Network

    Bonner, Anthony

    Neural Networks Approaches for Discovering the Learnable Correlation between Gene Function and Gene novel clustering and Neural Network (NN) approaches for predicting mouse gene functions from gene. Our results show that neural networks can be extremely useful in this area. We present the improved

  2. TreeQ-VISTA: An Interactive Tree Visualization Tool withFunctional Annotation Query Capabilities

    SciTech Connect

    Gu, Shengyin; Anderson, Iain; Kunin, Victor; Cipriano, Michael; Minovitsky, Simon; Weber, Gunther; Amenta, Nina; Hamann, Bernd; Dubchak,Inna

    2007-05-07

    Summary: We describe a general multiplatform exploratorytool called TreeQ-Vista, designed for presenting functional annotationsin a phylogenetic context. Traits, such as phenotypic and genomicproperties, are interactively queried from a relational database with auser-friendly interface which provides a set of tools for users with orwithout SQL knowledge. The query results are projected onto aphylogenetic tree and can be displayed in multiple color groups. A richset of browsing, grouping and query tools are provided to facilitatetrait exploration, comparison and analysis.Availability: The program,detailed tutorial and examples are available online athttp://genome-test.lbl.gov/vista/TreeQVista.

  3. Biological Profiling of Gene Groups utilizing Gene Ontology

    Microsoft Academic Search

    Nils Blüthgen; Karsten Brand; Branka Cajavec; Maciej Swat; Hanspeter Herzel; Dieter Beule

    2004-01-01

    Increasingly used high throughput experimental techniques, like DNA or protein microarrays give as a result groups of interesting, e.g. differentially regulated genes which require further biological interpretation. With the systematic functional annotation provided by the Gene Ontology the information required to automate the interpretation task is now accessible. However, the determination of statistical significant e.g. molecular functions within these groups

  4. annotations Vojtech Horky

    E-print Network

    Making SPL practical via Java annotations Vojtech Hork´y SPL Annotations Prototype Apache Ant Future work Conclusion Making SPL practical via Java annotations [work in progress report] Vojtech Hork´y 3rd January, 2012 #12;Making SPL practical via Java annotations Vojtech Hork´y SPL Annotations

  5. Cloning and Functional Characterization of the Maize (Zea mays L.) Carotenoid Epsilon Hydroxylase Gene

    PubMed Central

    Sheng, Yanmin; Wang, Yingdian; Capell, Teresa; Shi, Lianxuan; Ni, Xiuzhen; Sandmann, Gerhard; Christou, Paul; Zhu, Changfu

    2015-01-01

    The assignment of functions to genes in the carotenoid biosynthesis pathway is necessary to understand how the pathway is regulated and to obtain the basic information required for metabolic engineering. Few carotenoid ?-hydroxylases have been functionally characterized in plants although this would provide insight into the hydroxylation steps in the pathway. We therefore isolated mRNA from the endosperm of maize (Zea mays L., inbred line B73) and cloned a full-length cDNA encoding CYP97C19, a putative heme-containing carotenoid ? hydroxylase and member of the cytochrome P450 family. The corresponding CYP97C19 genomic locus on chromosome 1 was found to comprise a single-copy gene with nine introns. We expressed CYP97C19 cDNA under the control of the constitutive CaMV 35S promoter in the Arabidopsis thaliana lut1 knockout mutant, which lacks a functional CYP97C1 (LUT1) gene. The analysis of carotenoid levels and composition showed that lutein accumulated to high levels in the rosette leaves of the transgenic lines but not in the untransformed lut1 mutants. These results allowed the unambiguous functional annotation of maize CYP97C19 as an enzyme with strong zeinoxanthin ?-ring hydroxylation activity. PMID:26030746

  6. Functional Annotation of Two New Carboxypeptidases from the Amidohydrolase Superfamily of Enzymes

    SciTech Connect

    Xiang, D.; Xu, C; Kumaran, D; Brown, A; Sauder, M; Burley, S; Swaminathan, S; Raushel, F

    2009-01-01

    Two proteins from the amidohydrolase superfamily of enzymes were cloned, expressed, and purified to homogeneity. The first protein, Cc0300, was from Caulobacter crescentus CB-15 (Cc0300), while the second one (Sgx9355e) was derived from an environmental DNA sequence originally isolated from the Sargasso Sea (gi|44371129). The catalytic functions and the substrate profiles for the two enzymes were determined with the aid of combinatorial dipeptide libraries. Both enzymes were shown to catalyze the hydrolysis of l-Xaa-l-Xaa dipeptides in which the amino acid at the N-terminus was relatively unimportant. These enzymes were specific for hydrophobic amino acids at the C-terminus. With Cc0300, substrates terminating in isoleucine, leucine, phenylalanine, tyrosine, valine, methionine, and tryptophan were hydrolyzed. The same specificity was observed with Sgx9355e, but this protein was also able to hydrolyze peptides terminating in threonine. Both enzymes were able to hydrolyze N-acetyl and N-formyl derivatives of the hydrophobic amino acids and tripeptides. The best substrates identified for Cc0300 were l-Ala-l-Leu with kcat and kcat/Km values of 37 s-1 and 1.1 x 105 M-1 s-1, respectively, and N-formyl-l-Tyr with kcat and kcat/Km values of 33 s-1 and 3.9 x 105 M-1 s-1, respectively. The best substrate identified for Sgx9355e was l-Ala-l-Phe with kcat and kcat/Km values of 0.41 s-1 and 5.8 x 103 M-1 s-1. The three-dimensional structure of Sgx9355e was determined to a resolution of 2.33 Angstroms with l-methionine bound in the active site. The a-carboxylate of the methionine is ion-paired to His-237 and also hydrogen bonded to the backbone amide groups of Val-201 and Leu-202. The a-amino group of the bound methionine interacts with Asp-328. The structural determinants for substrate recognition were identified and compared with other enzymes in this superfamily that hydrolyze dipeptides with different specificities.

  7. INVESTIGATION Gene Functional Trade-Offs and the Evolution

    E-print Network

    Otto, Sarah

    INVESTIGATION Gene Functional Trade-Offs and the Evolution of Pleiotropy Frédéric Guillaume*,1, Vancouver, British Columbia, Canada V6T 1Z4 ABSTRACT Pleiotropy is the property of genes affecting multiple functions or characters of an organism. Genes vary widely in their degree of pleiotropy, but this variation

  8. Gene function prediction from congruent synthetic lethal interactions in yeast

    Microsoft Academic Search

    Ping Ye; Brian D Peyser; Xuewen Pan; Jef D Boeke; Forrest A Spencer; Joel S Bader

    2005-01-01

    We predicted gene function using synthetic lethal genetic interactions between null alleles in Saccharomyces cerevisiae. Phenotypic and protein interaction data indicate that synthetic lethal gene pairs function in parallel or compensating pathways. Congruent gene pairs, defined as sharing synthetic lethal partners, are in single pathway branches. We predicted benomyl sensitivity and nuclear migration defects using congruence; these phenotypes were uncorrelated

  9. Sebida: a database for the functional and evolutionary analysis of genes with sex-biased expression

    Microsoft Academic Search

    Florian Gnad; John Parsch

    2006-01-01

    Summary: We describe Sebida, a database of genes with sex-biased expression.Thedatabaseintegratesresultsfrommultiple,independent microarray studies comparing male and female gene expression in Drosophila melanogaster, Drosophila simulans and Anopheles gam- biae. Sebida uses standard nomenclature, which allows individual genes to be compared across different microarray platforms and to be queried by gene name, symbol, or annotation number. In addition toratiosofmale\\/femaleexpressionforeachgene,Sebidaalsocontains informationusefulforevolutionarystudies,suchaslocalrecombination rate, degree

  10. The Adult Mouse Anatomical Dictionary: a tool for annotating and integrating data

    Microsoft Academic Search

    Terry F Hayamizu; Mary Mangan; John P Corradi; James A Kadin; Martin Ringwald

    2005-01-01

    We have developed an ontology to provide standardized nomenclature for anatomical terms in the postnatal mouse. The Adult Mouse Anatomical Dictionary is structured as a directed acyclic graph, and is organized hierarchically both spatially and functionally. The ontology will be used to annotate and integrate different types of data pertinent to anatomy, such as gene expression patterns and phenotype information,

  11. High-resolution chemical dissection of a model eukaryote reveals targets, pathways and gene functions.

    PubMed

    Hoepfner, Dominic; Helliwell, Stephen B; Sadlish, Heather; Schuierer, Sven; Filipuzzi, Ireos; Brachat, Sophie; Bhullar, Bhupinder; Plikat, Uwe; Abraham, Yann; Altorfer, Marc; Aust, Thomas; Baeriswyl, Lukas; Cerino, Raffaele; Chang, Lena; Estoppey, David; Eichenberger, Juerg; Frederiksen, Mathias; Hartmann, Nicole; Hohendahl, Annika; Knapp, Britta; Krastel, Philipp; Melin, Nicolas; Nigsch, Florian; Oakeley, Edward J; Petitjean, Virginie; Petersen, Frank; Riedl, Ralph; Schmitt, Esther K; Staedtler, Frank; Studer, Christian; Tallarico, John A; Wetzel, Stefan; Fishman, Mark C; Porter, Jeffrey A; Movva, N Rao

    2014-01-01

    Due to evolutionary conservation of biology, experimental knowledge captured from genetic studies in eukaryotic model organisms provides insight into human cellular pathways and ultimately physiology. Yeast chemogenomic profiling is a powerful approach for annotating cellular responses to small molecules. Using an optimized platform, we provide the relative sensitivities of the heterozygous and homozygous deletion collections for nearly 1800 biologically active compounds. The data quality enables unique insights into pathways that are sensitive and resistant to a given perturbation, as demonstrated with both known and novel compounds. We present examples of novel compounds that inhibit the therapeutically relevant fatty acid synthase and desaturase (Fas1p and Ole1p), and demonstrate how the individual profiles facilitate hypothesis-driven experiments to delineate compound mechanism of action. Importantly, the scale and diversity of tested compounds yields a dataset where the number of modulated pathways approaches saturation. This resource can be used to map novel biological connections, and also identify functions for unannotated genes. We validated hypotheses generated by global two-way hierarchical clustering of profiles for (i) novel compounds with a similar mechanism of action acting upon microtubules or vacuolar ATPases, and (ii) an un-annotated ORF, YIL060w, that plays a role in respiration in the mitochondria. Finally, we identify and characterize background mutations in the widely used yeast deletion collection which should improve the interpretation of past and future screens throughout the community. This comprehensive resource of cellular responses enables the expansion of our understanding of eukaryotic pathway biology. PMID:24360837

  12. Computer-Based Annotation of Putative AraC/XylS-Family Transcription Factors of Known Structure but Unknown Function

    PubMed Central

    Schüller, Andreas; Slater, Alex W.; Norambuena, Tomás; Cifuentes, Juan J.; Almonacid, Leonardo I.; Melo, Francisco

    2012-01-01

    Currently, about 20 crystal structures per day are released and deposited in the Protein Data Bank. A significant fraction of these structures is produced by research groups associated with the structural genomics consortium. The biological function of many of these proteins is generally unknown or not validated by experiment. Therefore, a growing need for functional prediction of protein structures has emerged. Here we present an integrated bioinformatics method that combines sequence-based relationships and three-dimensional (3D) structural similarity of transcriptional regulators with computer prediction of their cognate DNA binding sequences. We applied this method to the AraC/XylS family of transcription factors, which is a large family of transcriptional regulators found in many bacteria controlling the expression of genes involved in diverse biological functions. Three putative new members of this family with known 3D structure but unknown function were identified for which a probable functional classification is provided. Our bioinformatics analyses suggest that they could be involved in plant cell wall degradation (Lin2118 protein from Listeria innocua, PDB code 3oou), symbiotic nitrogen fixation (protein from Chromobacterium violaceum, PDB code 3oio), and either metabolism of plant-derived biomass or nitrogen fixation (protein from Rhodopseudomonas palustris, PDB code 3mn2). PMID:22505803

  13. Discovery and Characterization of Chromatin States for Systematic Annotation of the Human Genome

    NASA Astrophysics Data System (ADS)

    Ernst, Jason; Kellis, Manolis

    A plethora of epigenetic modifications have been described in the human genome and shown to play diverse roles in gene regulation, cellular differentiation and the onset of disease. Although individual modifications have been linked to the activity levels of various genetic functional elements, their combinatorial patterns are still unresolved and their potential for systematic de novo genome annotation remains untapped. Here, we use a multivariate Hidden Markov Model to reveal chromatin states in human T cells, based on recurrent and spatially coherent combinations of chromatin marks.We define 51 distinct chromatin states, including promoter-associated, transcription-associated, active intergenic, largescale repressed and repeat-associated states. Each chromatin state shows specific enrichments in functional annotations, sequence motifs and specific experimentally observed characteristics, suggesting distinct biological roles. This approach provides a complementary functional annotation of the human genome that reveals the genome-wide locations of diverse classes of epigenetic function.

  14. Functional Identification of a Putative ?-Galactosidase Gene in the Special lac Gene Cluster of Lactobacillus   acidophilus

    Microsoft Academic Search

    Qu Pan; Junmin Zhu; Lina Liu; Yanguang Cong; Fuquan Hu; Jinchuan Li; Xiaoping Yu

    2010-01-01

    The putative ?-galactosidase gene (lacZ) of Lactobacillus acidophilus has a very low degree of homology to the Escherichia coli ?-galactosidase gene (lacZ) and locates in a special lac gene cluster which contains two ?-galactosidase genes. No functional characteristic of the putative ?-galactosidase has been\\u000a described so far. In this study, the lacZ gene of L. acidophilus was hetero-expressed in E. coli and the recombinant

  15. Reconstruction of a Functional Human Gene Network, with an Application for Prioritizing Positional Candidate Genes

    PubMed Central

    Franke, Lude; Bakel, Harm van; Fokkens, Like; de Jong, Edwin D.; Egmont-Petersen, Michael; Wijmenga, Cisca

    2006-01-01

    Most common genetic disorders have a complex inheritance and may result from variants in many genes, each contributing only weak effects to the disease. Pinpointing these disease genes within the myriad of susceptibility loci identified in linkage studies is difficult because these loci may contain hundreds of genes. However, in any disorder, most of the disease genes will be involved in only a few different molecular pathways. If we know something about the relationships between the genes, we can assess whether some genes (which may reside in different loci) functionally interact with each other, indicating a joint basis for the disease etiology. There are various repositories of information on pathway relationships. To consolidate this information, we developed a functional human gene network that integrates information on genes and the functional relationships between genes, based on data from the Kyoto Encyclopedia of Genes and Genomes, the Biomolecular Interaction Network Database, Reactome, the Human Protein Reference Database, the Gene Ontology database, predicted protein-protein interactions, human yeast two-hybrid interactions, and microarray coexpressions. We applied this network to interrelate positional candidate genes from different disease loci and then tested 96 heritable disorders for which the Online Mendelian Inheritance in Man database reported at least three disease genes. Artificial susceptibility loci, each containing 100 genes, were constructed around each disease gene, and we used the network to rank these genes on the basis of their functional interactions. By following up the top five genes per artificial locus, we were able to detect at least one known disease gene in 54% of the loci studied, representing a 2.8-fold increase over random selection. This suggests that our method can significantly reduce the cost and effort of pinpointing true disease genes in analyses of disorders for which numerous loci have been reported but for which most of the genes are unknown. PMID:16685651

  16. Soybean kinome: functional classification and gene expression patterns

    PubMed Central

    Liu, Jinyi; Chen, Nana; Grant, Joshua N.; Cheng, Zong-Ming (Max); Stewart, C. Neal; Hewezi, Tarek

    2015-01-01

    The protein kinase (PK) gene family is one of the largest and most highly conserved gene families in plants and plays a role in nearly all biological functions. While a large number of genes have been predicted to encode PKs in soybean, a comprehensive functional classification and global analysis of expression patterns of this large gene family is lacking. In this study, we identified the entire soybean PK repertoire or kinome, which comprised 2166 putative PK genes, representing 4.67% of all soybean protein-coding genes. The soybean kinome was classified into 19 groups, 81 families, and 122 subfamilies. The receptor-like kinase (RLK) group was remarkably large, containing 1418 genes. Collinearity analysis indicated that whole-genome segmental duplication events may have played a key role in the expansion of the soybean kinome, whereas tandem duplications might have contributed to the expansion of specific subfamilies. Gene structure, subcellular localization prediction, and gene expression patterns indicated extensive functional divergence of PK subfamilies. Global gene expression analysis of soybean PK subfamilies revealed tissue- and stress-specific expression patterns, implying regulatory functions over a wide range of developmental and physiological processes. In addition, tissue and stress co-expression network analysis uncovered specific subfamilies with narrow or wide interconnected relationships, indicative of their association with particular or broad signalling pathways, respectively. Taken together, our analyses provide a foundation for further functional studies to reveal the biological and molecular functions of PKs in soybean. PMID:25614662

  17. The Distributed Annotation System

    Microsoft Academic Search

    Robin D. Dowell; Rodney M. Jokerst; Allen Day; Sean R. Eddy; Lincoln Stein

    2001-01-01

    BackgroundCurrently, most genome annotation is curated by centralized groups with limitedresources. Efforts to share annotations transparently among multiple groups have not yet beensatisfactory.ResultsHere we introduce a concept called the Distributed Annotation System (DAS). DASallows sequence annotations to be decentralized among multiple third-party annotators andintegrated on an as-needed basis by client-side software. The communication between clientand servers in DAS is...

  18. Geochip-Based Functional Gene Analysis of Anodophilic

    E-print Network

    Geochip-Based Functional Gene Analysis of Anodophilic Communities in Microbial Electrolysis Cells. A microbial electrolysis cell (MEC) is a bioelectrochemical the microbial community functional structure in MECs initially operated under different conditions. We found

  19. Aldo-keto reductase (AKR) superfamily: Genomics and annotation

    PubMed Central

    2009-01-01

    Aldo-keto reductases (AKRs) are phase I metabolising enzymes that catalyse the reduced nicotinamide adenine dinucleotide (phosphate) (NAD(P)H)-dependent reduction of carbonyl groups to yield primary and secondary alcohols on a wide range of substrates, including aliphatic and aromatic aldehydes and ketones, ketoprostaglan-dins, ketosteroids and xenobiotics. In so doing they functionalise the carbonyl group for conjugation (phase II enzyme reactions). Although functionally diverse, AKRs form a protein superfamily based on their high sequence identity and common protein fold, the (?/(?)8-barrel structure. Well over 150 AKR enzymes, from diverse organisms, have been annotated so far and given systematic names according to a nomenclature that is based on multiple protein sequence alignment and degree of identity. Annotation of non-vertebrate AKRs at the National Center for Biotechnology Information or Vertebrate Genome Annotation (vega) database does not often include the systematic nomenclature name, so the most comprehensive overview of all annotated AKRs is found on the AKR website (http://www.med.upenn.edu/akr/). This site also hosts links to more detailed and specialised information (eg on crystal structures, gene expression and single nucleotide polymorphisms [SNPs]). The protein-based AKR nomenclature allows unambiguous identification of a given enzyme but does not reflect the wealth of genomic and transcriptomic variation that exists in the various databases. In this context, identification of putative new AKRs and their distinction from pseudogenes are challenging. This review provides a short summary of the characteristic features of AKR biochemistry and structure that have been reviewed in great detail elsewhere, and focuses mainly on nomenclature and database entries of human AKRs that so far have not been subject to systematic annotation. Recent developments in the annotation of SNP and transcript variance in AKRs are also summarised. PMID:19706366

  20. Recent Achievement in Gene Cloning and Functional Genomics in Soybean

    PubMed Central

    Zhai, Hong; Lü, Shixiang; Wu, Hongyan; Zhang, Yupeng

    2013-01-01

    Soybean is a model plant for photoperiodism as well as for symbiotic nitrogen fixation. However, a rather low efficiency in soybean transformation hampers functional analysis of genes isolated from soybean. In comparison, rapid development and progress in flowering time and photoperiodic response have been achieved in Arabidopsis and rice. As the soybean genomic information has been released since 2008, gene cloning and functional genomic studies have been revived as indicated by successfully characterizing genes involved in maturity and nematode resistance. Here, we review some major achievements in the cloning of some important genes and some specific features at genetic or genomic levels revealed by the analysis of functional genomics of soybean. PMID:24311973

  1. HLA Immune Function Genes in Autism

    PubMed Central

    Torres, Anthony R.; Westover, Jonna B.; Rosenspire, Allen J.

    2012-01-01

    The human leukocyte antigen (HLA) genes on chromosome 6 are instrumental in many innate and adaptive immune responses. The HLA genes/haplotypes can also be involved in immune dysfunction and autoimmune diseases. It is now becoming apparent that many of the non-antigen-presenting HLA genes make significant contributions to autoimmune diseases. Interestingly, it has been reported that autism subjects often have associations with HLA genes/haplotypes, suggesting an underlying dysregulation of the immune system mediated by HLA genes. Genetic studies have only succeeded in identifying autism-causing genes in a small number of subjects suggesting that the genome has not been adequately interrogated. Close examination of the HLA region in autism has been relatively ignored, largely due to extraordinary genetic complexity. It is our proposition that genetic polymorphisms in the HLA region, especially in the non-antigen-presenting regions, may be important in the etiology of autism in certain subjects. PMID:22928105

  2. Thousands of missed genes found in bacterial genomes and their analysis with COMBREX

    PubMed Central

    2012-01-01

    Background The dramatic reduction in the cost of sequencing has allowed many researchers to join in the effort of sequencing and annotating prokaryotic genomes. Annotation methods vary considerably and may fail to identify some genes. Here we draw attention to a large number of likely genes missing from annotations using common tools such as Glimmer and BLAST. Results By analyzing 1,474 prokaryotic genome annotations in GenBank, we identify 13,602 likely missed genes that are homologs to non-hypothetical proteins, and 11,792 likely missed genes that are homologs only to hypothetical proteins, yet have supporting evidence of their protein-coding nature from COMBREX, a newly created gene function database. We also estimate the likelihood that each potential missing gene found is a genuine protein-coding gene using COMBREX. Conclusions Our analysis of the causes of missed genes suggests that larger annotation centers tend to produce annotations with fewer missed genes than smaller centers, and many of the missed genes are short genes <300 bp. Over 1,000 of the likely missed genes could be associated with phenotype information available in COMBREX. 359 of these genes, found in pathogenic organisms, may be potential targets for pharmaceutical research. The newly identified genes are available on COMBREX’s website. Reviewers This article was reviewed by Daniel Haft, Arcady Mushegian, and M. Pilar Francino (nominated by David Ardell). PMID:23111013

  3. The Vertebrate Genome Annotation (Vega) database

    PubMed Central

    Ashurst, J. L.; Chen, C.-K.; Gilbert, J. G. R.; Jekosch, K.; Keenan, S.; Meidl, P.; Searle, S. M.; Stalker, J.; Storey, R.; Trevanion, S.; Wilming, L.; Hubbard, T.

    2005-01-01

    The Vertebrate Genome Annotation (Vega) database (http://vega.sanger.ac.uk) has been designed to be a community resource for browsing manual annotation of finished sequences from a variety of vertebrate genomes. Its core database is based on an Ensembl-style schema, extended to incorporate curation-specific metadata. In collaboration with the genome sequencing centres, Vega attempts to present consistent high-quality annotation of the published human chromosome sequences. In addition, it is also possible to view various finished regions from other vertebrates, including mouse and zebrafish. Vega displays only manually annotated gene structures built using transcriptional evidence, which can be examined in the browser. Attempts have been made to standardize the annotation procedure across each vertebrate genome, which should aid comparative analysis of orthologues across the different finished regions. PMID:15608237

  4. The Vertebrate Genome Annotation (Vega) database.

    PubMed

    Ashurst, J L; Chen, C-K; Gilbert, J G R; Jekosch, K; Keenan, S; Meidl, P; Searle, S M; Stalker, J; Storey, R; Trevanion, S; Wilming, L; Hubbard, T

    2005-01-01

    The Vertebrate Genome Annotation (Vega) database (http://vega.sanger.ac.uk) has been designed to be a community resource for browsing manual annotation of finished sequences from a variety of vertebrate genomes. Its core database is based on an Ensembl-style schema, extended to incorporate curation-specific metadata. In collaboration with the genome sequencing centres, Vega attempts to present consistent high-quality annotation of the published human chromosome sequences. In addition, it is also possible to view various finished regions from other vertebrates, including mouse and zebrafish. Vega displays only manually annotated gene structures built using transcriptional evidence, which can be examined in the browser. Attempts have been made to standardize the annotation procedure across each vertebrate genome, which should aid comparative analysis of orthologues across the different finished regions. PMID:15608237

  5. Transcriptional gene silencing as a tool for uncovering gene function in maize.

    PubMed

    Cigan, A Mark; Unger-Wallace, Erica; Haug-Collet, Kristin

    2005-09-01

    Transcriptional gene silencing has broad applications for studying gene function in planta. In maize, a large number of genes have been identified as tassel-preferred in their expression pattern, both by traditional genetic methods and by recent high-throughput expression profiling platforms. Approaches using RNA suppression may provide a rapid alternative means to identify genes directly related to pollen development in maize. The male fertility gene Ms45 and several anther-expressed genes of unknown function were used to evaluate the efficacy of generating male-sterile plants by transcriptional gene silencing. A high frequency of male-sterile plants was obtained by constitutively expressing inverted repeats (IR) of the Ms45 promoter. These sterile plants lacked MS45 mRNA due to transcriptional inactivity of the target promoter. Moreover, fertility was restored to these promoter IR-containing plants by expressing the Ms45 coding region using heterologous promoters. Transcriptional silencing of other anther-expressed genes also significantly affected male fertility phenotypes and led to increased methylation of the target promoter DNA sequences. These studies provide evidence of disruption of gene activity in monocots by RNA interference constructs directed against either native or transformed promoter regions. This approach not only enables the correlation of monocot anther-expressed genes with functions that are important for reproduction in maize, but may also provide a tool for studying gene function and identifying regulatory components unique to transcriptional gene control. PMID:16146530

  6. Using CATH-Gene3D to Analyze the Sequence, Structure, and Function of Proteins.

    PubMed

    Sillitoe, Ian; Lewis, Tony; Orengo, Christine

    2015-01-01

    The CATH database is a classification of protein structures found in the Protein Data Bank (PDB). Protein structures are chopped into individual units of structural domains, and these domains are grouped together into superfamilies if there is sufficient evidence that they have diverged from a common ancestor during the process of evolution. A sister resource, Gene3D, extends this information by scanning sequence profiles of these CATH domain superfamilies against many millions of known proteins to identify related sequences. Thus the combined CATH-Gene3D resource provides confident predictions of the likely structural fold, domain organisation, and evolutionary relatives of these proteins. In addition, this resource incorporates annotations from a large number of external databases such as known enzyme active sites, GO molecular functions, physical interactions, and mutations. This unit details how to access and understand the information contained within the CATH-Gene3D Web pages, the downloadable data files, and the remotely accessible Web services. © 2015 by John Wiley & Sons, Inc. PMID:26087950

  7. Bioinformatic prediction of gene functions regulated by quorum sensing in the bioleaching bacterium Acidithiobacillus ferrooxidans.

    PubMed

    Banderas, Alvaro; Guiliani, Nicolas

    2013-01-01

    The biomining bacterium Acidithiobacillus ferrooxidans oxidizes sulfide ores and promotes metal solubilization. The efficiency of this process depends on the attachment of cells to surfaces, a process regulated by quorum sensing (QS) cell-to-cell signalling in many Gram-negative bacteria. At. ferrooxidans has a functional QS system and the presence of AHLs enhances its attachment to pyrite. However, direct targets of the QS transcription factor AfeR remain unknown. In this study, a bioinformatic approach was used to infer possible AfeR direct targets based on the particular palindromic features of the AfeR binding site. A set of Hidden Markov Models designed to maintain palindromic regions and vary non-palindromic regions was used to screen for putative binding sites. By annotating the context of each predicted binding site (PBS), we classified them according to their positional coherence relative to other putative genomic structures such as start codons, RNA polymerase promoter elements and intergenic regions. We further used the Multiple EM for Motif Elicitation algorithm (MEME) to further filter out low homology PBSs. In summary, 75 target-genes were identified, 34 of which have a higher confidence level. Among the identified genes, we found afeR itself, zwf, genes encoding glycosyltransferase activities, metallo-beta lactamases, and active transport-related proteins. Glycosyltransferases and Zwf (Glucose 6-phosphate-1-dehydrogenase) might be directly involved in polysaccharide biosynthesis and attachment to minerals by At. ferrooxidans cells during the bioleaching process. PMID:23959118

  8. KSHV 2.0: A Comprehensive Annotation of the Kaposi's Sarcoma-Associated Herpesvirus Genome Using Next-Generation Sequencing Reveals Novel Genomic and Functional Features

    PubMed Central

    Arias, Carolina; Weisburd, Ben; Stern-Ginossar, Noam; Mercier, Alexandre; Madrid, Alexis S.; Bellare, Priya; Holdorf, Meghan; Weissman, Jonathan S.; Ganem, Don

    2014-01-01

    Productive herpesvirus infection requires a profound, time-controlled remodeling of the viral transcriptome and proteome. To gain insights into the genomic architecture and gene expression control in Kaposi's sarcoma-associated herpesvirus (KSHV), we performed a systematic genome-wide survey of viral transcriptional and translational activity throughout the lytic cycle. Using mRNA-sequencing and ribosome profiling, we found that transcripts encoding lytic genes are promptly bound by ribosomes upon lytic reactivation, suggesting their regulation is mainly transcriptional. Our approach also uncovered new genomic features such as ribosome occupancy of viral non-coding RNAs, numerous upstream and small open reading frames (ORFs), and unusual strategies to expand the virus coding repertoire that include alternative splicing, dynamic viral mRNA editing, and the use of alternative translation initiation codons. Furthermore, we provide a refined and expanded annotation of transcription start sites, polyadenylation sites, splice junctions, and initiation/termination codons of known and new viral features in the KSHV genomic space which we have termed KSHV 2.0. Our results represent a comprehensive genome-scale image of gene regulation during lytic KSHV infection that substantially expands our understanding of the genomic architecture and coding capacity of the virus. PMID:24453964

  9. Combined Evidence Annotation of Transposable Elements in Genome Sequences

    Microsoft Academic Search

    Hadi Quesneville; Olivier Andrieu; Delphine Autard; Danielle Nouaud; Michael Ashburner; Dominique Anxolabehere

    2005-01-01

    Transposable elements (TEs) are mobile, repetitive sequences that make up significant fractions of metazoan genomes. Despite their near ubiquity and importance in genome and chromosome biology, most efforts to annotate TEs in genome sequences rely on the results of a single computational program, RepeatMasker. In contrast, recent advances in gene annotation indicate that high-quality gene models can be produced from

  10. SemFunSim: A New Method for Measuring Disease Similarity by Integrating Semantic and Gene Functional Association

    PubMed Central

    Ju, Peng; Peng, Jiajie; Wang, Yadong

    2014-01-01

    Background Measuring similarity between diseases plays an important role in disease-related molecular function research. Functional associations between disease-related genes and semantic associations between diseases are often used to identify pairs of similar diseases from different perspectives. Currently, it is still a challenge to exploit both of them to calculate disease similarity. Therefore, a new method (SemFunSim) that integrates semantic and functional association is proposed to address the issue. Methods SemFunSim is designed as follows. First of all, FunSim (Functional similarity) is proposed to calculate disease similarity using disease-related gene sets in a weighted network of human gene function. Next, SemSim (Semantic Similarity) is devised to calculate disease similarity using the relationship between two diseases from Disease Ontology. Finally, FunSim and SemSim are integrated to measure disease similarity. Results The high average AUC (area under the receiver operating characteristic curve) (96.37%) shows that SemFunSim achieves a high true positive rate and a low false positive rate. 79 of the top 100 pairs of similar diseases identified by SemFunSim are annotated in the Comparative Toxicogenomics Database (CTD) as being targeted by the same therapeutic compounds, while other methods we compared could identify 35 or less such pairs among the top 100. Moreover, when using our method on diseases without annotated compounds in CTD, we could confirm many of our predicted candidate compounds from literature. This indicates that SemFunSim is an effective method for drug repositioning. PMID:24932637

  11. High precision multi-genome scale reannotation of enzyme function by EFICAz

    Microsoft Academic Search

    Adrian K Arakaki; Weidong Tian; Jeffrey Skolnick

    2006-01-01

    BACKGROUND: The functional annotation of most genes in newly sequenced genomes is inferred from similarity to previously characterized sequences, an annotation strategy that often leads to erroneous assignments. We have performed a reannotation of 245 genomes using an updated version of EFICAz, a highly precise method for enzyme function prediction. RESULTS: Based on our three-field EC number predictions, we have

  12. Human Intellectual Disability Genes Form Conserved Functional Modules in Drosophila

    PubMed Central

    Oortveld, Merel A. W.; Keerthikumar, Shivakumar; Oti, Martin; Nijhof, Bonnie; Fernandes, Ana Clara; Kochinke, Korinna; Castells-Nobau, Anna; van Engelen, Eva; Ellenkamp, Thijs; Eshuis, Lilian; Galy, Anne; van Bokhoven, Hans; Habermann, Bianca; Brunner, Han G.; Zweier, Christiane; Verstreken, Patrik; Huynen, Martijn A.; Schenck, Annette

    2013-01-01

    Intellectual Disability (ID) disorders, defined by an IQ below 70, are genetically and phenotypically highly heterogeneous. Identification of common molecular pathways underlying these disorders is crucial for understanding the molecular basis of cognition and for the development of therapeutic intervention strategies. To systematically establish their functional connectivity, we used transgenic RNAi to target 270 ID gene orthologs in the Drosophila eye. Assessment of neuronal function in behavioral and electrophysiological assays and multiparametric morphological analysis identified phenotypes associated with knockdown of 180 ID gene orthologs. Most of these genotype-phenotype associations were novel. For example, we uncovered 16 genes that are required for basal neurotransmission and have not previously been implicated in this process in any system or organism. ID gene orthologs with morphological eye phenotypes, in contrast to genes without phenotypes, are relatively highly expressed in the human nervous system and are enriched for neuronal functions, suggesting that eye phenotyping can distinguish different classes of ID genes. Indeed, grouping genes by Drosophila phenotype uncovered 26 connected functional modules. Novel links between ID genes successfully predicted that MYCN, PIGV and UPF3B regulate synapse development. Drosophila phenotype groups show, in addition to ID, significant phenotypic similarity also in humans, indicating that functional modules are conserved. The combined data indicate that ID disorders, despite their extreme genetic diversity, are caused by disruption of a limited number of highly connected functional modules. PMID:24204314

  13. Combining many interaction networks to predict gene function and analyze gene lists.

    PubMed

    Mostafavi, Sara; Morris, Quaid

    2012-05-01

    In this article, we review how interaction networks can be used alone or in combination in an automated fashion to provide insight into gene and protein function. We describe the concept of a "gene-recommender system" that can be applied to any large collection of interaction networks to make predictions about gene or protein function based on a query list of proteins that share a function of interest. We discuss these systems in general and focus on one specific system, GeneMANIA, that has unique features and uses different algorithms from the majority of other systems. PMID:22589215

  14. Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers

    PubMed Central

    Green, M. L.; Karp, P. D.

    2005-01-01

    We report on a new type of systematic annotation error in genome and pathway databases that results from the misinterpretation of partial Enzyme Commission (EC) numbers such as ‘1.1.1.-’. This error results in the assignment of genes annotated with a partial EC number to many or all biochemical reactions that are annotated with the same partial EC number. That inference is faulty because of the ambiguous nature of partial EC numbers. We have observed this type of error in multiple databases, including KEGG, VIMSS and IMG, all of which assign genes to KEGG pathways. The Escherichia coli subset of the KEGG database exhibits this error for 6.8% of its gene-reaction assignments. For example, KEGG contains 17 reactions that are annotated with EC 1.1.1.-. A group of three E.coli genes, b1580 [putative dehydrogenase, NAD(P)-binding, starvation-sensing protein], b3787 (UDP-N-acetyl-d-mannosaminuronic acid dehydrogenase) and b0207 (2,5-diketo-d-gluconate reductase B), is assigned to 15 of those reactions, despite experimental evidence indicating different single functions for two of the three genes. Furthermore, the databases (DBs) are internally inconsistent in that the description of gene functions for genes with partial EC numbers is inconsistent with the activities implied by reactions to which the genes were assigned. We infer that these inconsistencies result from the processing used to match gene products to reactions within KEGG's metabolic pathways. These errors affect scientists who use these DBs as online encyclopedias and they affect bioinformaticists who use these DBs to train and validate newly developed algorithms. PMID:16034025

  15. MitoProteome: mitochondrial protein sequence database and annotation system

    Microsoft Academic Search

    Dawn Cotter; Purnima Guda; Eoin Fahy; Shankar Subramaniam

    2004-01-01

    MitoProteome is an object-relational mitochondrial protein sequence database and annotation system. The initial release contains 847 human mitochon- drial protein sequences, derived from public sequence databases and mass spectrometric analy- sis of highly purified human heart mitochondria. Each sequence is manually annotated with primary function, subfunction and subcellular location, and extensively annotated in an automated process with data extracted from

  16. In Silico Functional Profiling of Individual Prostate Cancer Tumors: Many Genes, Few Functions

    PubMed Central

    Gorlov, Ivan P.; Byun, Jinyoung; Logothetis, Christopher J.

    2013-01-01

    Background Identification of genes that are differently expressed is a common approach used to analyze genetic mechanisms underlying cancer development. However, recent study results suggest that many such genes relate to a small number of biological functions. We hypothesized that analysis of these functions provides a better understanding of tumor biology than does actual identification of these genes does. Materials and Methods We re-analyzed publicly available gene expression data for paired samples of prostate tumor and adjacent normal tissue from the same patients to identify genes differently expressed in individual tumors and then used them to identify the functions. Results We found significant interindividual variation in the type and the number of functions. After adjusting for redundancy and nonspecificity of the functional terms, we identified seven functions. Several of them showed a strong association with clinical traits, e.g. age at diagnosis, preoperative prostate-specific antigen concentration, Gleason grade, and biochemical recurrence. Actin cytoskeleton was the function most frequently associated with clinical traits. Of note, the association between function and clinical traits was much stronger than that between the genes differently expressed and those traits. Conclusion Different prostate tumors differ in their functional profiles. Functions of differently expressed genes are strongly associated with clinical traits. This suggests that analysis of functions of differently expressed genes may provide a better description of tumor biology than does analysis of the respective genes. PMID:22593245

  17. Evolutionary Persistence of Functional Compensation by Duplicate Genes in Arabidopsis

    PubMed Central

    Kuromori, Takashi; Myouga, Fumiyoshi; Toyoda, Tetsuro; Shinozaki, Kazuo

    2009-01-01

    Knocking out a gene from a genome often causes no phenotypic effect. This phenomenon has been explained in part by the existence of duplicate genes. However, it was found that in mouse knockout data duplicate genes are as essential as singleton genes. Here, we study whether it is also true for the knockout data in Arabidopsis. From the knockout data in Arabidopsis thaliana obtained in our study and in the literature, we find that duplicate genes show a significantly lower proportion of knockout effects than singleton genes. Because the persistence of duplicate genes in evolution tends to be dependent on their phenotypic effect, we compared the ages of duplicate genes whose knockout mutants showed less severe phenotypic effects with those with more severe effects. Interestingly, the latter group of genes tends to be more anciently duplicated than the former group of genes. Moreover, using multiple-gene knockout data, we find that functional compensation by duplicate genes for a more severe phenotypic effect tends to be preserved by natural selection for a longer time than that for a less severe effect. Taken together, we conclude that duplicate genes contribute to genetic robustness mainly by preserving compensation for severe phenotypic effects in A. thaliana. PMID:20333209

  18. Saliva Microbiota Carry Caries-Specific Functional Gene Signatures

    PubMed Central

    Chang, Xingzhi; Yuan, Xiao; Tu, Qichao; Yuan, Tong; Deng, Ye; Hemme, Christopher L.; Van Nostrand, Joy; Cui, Xinping; He, Zhili; Chen, Zhenggang; Guo, Dawei; Yu, Jiangbo; Zhang, Yue; Zhou, Jizhong; Xu, Jian

    2014-01-01

    Human saliva microbiota is phylogenetically divergent among host individuals yet their roles in health and disease are poorly appreciated. We employed a microbial functional gene microarray, HuMiChip 1.0, to reconstruct the global functional profiles of human saliva microbiota from ten healthy and ten caries-active adults. Saliva microbiota in the pilot population featured a vast diversity of functional genes. No significant distinction in gene number or diversity indices was observed between healthy and caries-active microbiota. However, co-presence network analysis of functional genes revealed that caries-active microbiota was more divergent in non-core genes than healthy microbiota, despite both groups exhibited a similar degree of conservation at their respective core genes. Furthermore, functional gene structure of saliva microbiota could potentially distinguish caries-active patients from healthy hosts. Microbial functions such as Diaminopimelate epimerase, Prephenate dehydrogenase, Pyruvate-formate lyase and N-acetylmuramoyl-L-alanine amidase were significantly linked to caries. Therefore, saliva microbiota carried disease-associated functional signatures, which could be potentially exploited for caries diagnosis. PMID:24533043

  19. Cloning of the Arabidopsis rwm1 gene for resistance to Watermelon mosaic virus points to a new function for natural virus resistance genes.

    PubMed

    Ouibrahim, Laurence; Mazier, Marianne; Estevan, Joan; Pagny, Gaëlle; Decroocq, Véronique; Desbiez, Cécile; Moretti, André; Gallois, Jean-Luc; Caranta, Carole

    2014-09-01

    Arabidopsis thaliana represents a valuable and efficient model to understand mechanisms underlying plant susceptibility to viral diseases. Here, we describe the identification and molecular cloning of a new gene responsible for recessive resistance to several isolates of Watermelon mosaic virus (WMV, genus Potyvirus) in the Arabidopsis Cvi-0 accession. rwm1 acts at an early stage of infection by impairing viral accumulation in initially infected leaf tissues. Map-based cloning delimited rwm1 on chromosome 1 in a 114-kb region containing 30 annotated genes. Positional and functional candidate gene analysis suggested that rwm1 encodes cPGK2 (At1g56190), an evolutionary conserved nucleus-encoded chloroplast phosphoglycerate kinase with a key role in cell metabolism. Comparative sequence analysis indicates that a single amino acid substitution (S78G) in the N-terminal domain of cPGK2 is involved in rwm1-mediated resistance. This mutation may have functional consequences because it targets a highly conserved residue, affects a putative phosphorylation site and occurs within a predicted nuclear localization signal. Transgenic complementation in Arabidopsis together with virus-induced gene silencing in Nicotiana benthamiana confirmed that cPGK2 corresponds to rwm1 and that the protein is required for efficient WMV infection. This work uncovers new insight into natural plant resistance mechanisms that may provide interesting opportunities for the genetic control of plant virus diseases. PMID:24930633

  20. In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles

    PubMed Central

    Lin, Frank PY; Coiera, Enrico; Lan, Ruiting; Sintchenko, Vitali

    2009-01-01

    Background In silico candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score. While several CGP methods have been described for identifying human disease genes, corresponding methods for prokaryotic gene function discovery are lacking. Here we present two prokaryotic CGP methods, based on phylogenetic profiles, to assist with this task. Results Using gene occurrence patterns in sample genomes, we developed two CGP methods (statistical and inductive CGP) to assist with the discovery of bacterial gene functions. Statistical CGP exploits the differences in gene frequency against phenotypic groups, while inductive CGP applies supervised machine learning to identify gene occurrence pattern across genomes. Three rediscovery experiments were designed to evaluate the CGP frameworks. The first experiment attempted to rediscover peptidoglycan genes with 417 published genome sequences. Both CGP methods achieved best areas under receiver operating characteristic curve (AUC) of 0.911 in Escherichia coli K-12 (EC-K12) and 0.978 Streptococcus agalactiae 2603 (SA-2603) genomes, with an average improvement in precision of >3.2-fold and a maximum of >27-fold using statistical CGP. A median AUC of >0.95 could still be achieved with as few as 10 genome examples in each group of genome examples in the rediscovery of the peptidoglycan metabolism genes. In the second experiment, a maximum of 109-fold improvement in precision was achieved in the rediscovery of anaerobic fermentation genes in EC-K12. The last experiment attempted to rediscover genes from 31 metabolic pathways in SA-2603, where 14 pathways achieved AUC >0.9 and 28 pathways achieved AUC >0.8 with the best inductive CGP algorithms. Conclusion Our results demonstrate that the two CGP methods can assist with the study of functionally uncategorised genomic regions and discovery of bacterial gene-function relationships. Our rediscovery experiments also provide a set of standard tasks against which future methods may be compared. PMID:19292914

  1. Using Deep RNA Sequencing for the Structural Annotation of the Laccaria Bicolor Mycorrhizal Transcriptome

    PubMed Central

    Larsen, Peter E.; Trivedi, Geetika; Sreedasyam, Avinash; Lu, Vincent; Podila, Gopi K.; Collart, Frank R.

    2010-01-01

    Background Accurate structural annotation is important for prediction of function and required for in vitro approaches to characterize or validate the gene expression products. Despite significant efforts in the field, determination of the gene structure from genomic data alone is a challenging and inaccurate process. The ease of acquisition of transcriptomic sequence provides a direct route to identify expressed sequences and determine the correct gene structure. Methodology We developed methods to utilize RNA-seq data to correct errors in the structural annotation and extend the boundaries of current gene models using assembly approaches. The methods were validated with a transcriptomic data set derived from the fungus Laccaria bicolor, which develops a mycorrhizal symbiotic association with the roots of many tree species. Our analysis focused on the subset of 1501 gene models that are differentially expressed in the free living vs. mycorrhizal transcriptome and are expected to be important elements related to carbon metabolism, membrane permeability and transport, and intracellular signaling. Of the set of 1501 gene models, 1439 (96%) successfully generated modified gene models in which all error flags were successfully resolved and the sequences aligned to the genomic sequence. The remaining 4% (62 gene models) either had deviations from transcriptomic data that could not be spanned or generated sequence that did not align to genomic sequence. The outcome of this process is a set of high confidence gene models that can be reliably used for experimental characterization of protein function. Conclusions 69% of expressed mycorrhizal JGI “best” gene models deviated from the transcript sequence derived by this method. The transcriptomic sequence enabled correction of a majority of the structural inconsistencies and resulted in a set of validated models for 96% of the mycorrhizal genes. The method described here can be applied to improve gene structural annotation in other species, provided that there is a sequenced genome and a set of gene models. PMID:20625404

  2. Using deep RNA sequencing for the structural annotation of the laccaria bicolor mycorrhizal transcriptome.

    SciTech Connect

    Larsen, P. E.; Trivedi, G.; Sreedasyam, A.; Lu, V.; Podila, G. K.; Collart, F. R.; Biosciences Division; Univ. of Alabama

    2010-07-06

    Accurate structural annotation is important for prediction of function and required for in vitro approaches to characterize or validate the gene expression products. Despite significant efforts in the field, determination of the gene structure from genomic data alone is a challenging and inaccurate process. The ease of acquisition of transcriptomic sequence provides a direct route to identify expressed sequences and determine the correct gene structure. We developed methods to utilize RNA-seq data to correct errors in the structural annotation and extend the boundaries of current gene models using assembly approaches. The methods were validated with a transcriptomic data set derived from the fungus Laccaria bicolor, which develops a mycorrhizal symbiotic association with the roots of many tree species. Our analysis focused on the subset of 1501 gene models that are differentially expressed in the free living vs. mycorrhizal transcriptome and are expected to be important elements related to carbon metabolism, membrane permeability and transport, and intracellular signaling. Of the set of 1501 gene models, 1439 (96%) successfully generated modified gene models in which all error flags were successfully resolved and the sequences aligned to the genomic sequence. The remaining 4% (62 gene models) either had deviations from transcriptomic data that could not be spanned or generated sequence that did not align to genomic sequence. The outcome of this process is a set of high confidence gene models that can be reliably used for experimental characterization of protein function. 69% of expressed mycorrhizal JGI 'best' gene models deviated from the transcript sequence derived by this method. The transcriptomic sequence enabled correction of a majority of the structural inconsistencies and resulted in a set of validated models for 96% of the mycorrhizal genes. The method described here can be applied to improve gene structural annotation in other species, provided that there is a sequenced genome and a set of gene models.

  3. Video annotation tools

    E-print Network

    Chaudhary, Ahmed

    2008-10-10

    general purpose annotation toolkits that can be used to create domain specific applications. A video annotation toolkit along with toolkits for searching, retrieving, analyzing and presenting videos can help achieve the broader goal of creating integrated...

  4. Using the transcriptome to annotate the genome

    Microsoft Academic Search

    Saurabh Saha; Andrew B. Sparks; Carlo Rago; Viatcheslav Akmaev; Clarence J. Wang; Bert Vogelstein; Kenneth W. Kinzler; Victor E. Velculescu

    2002-01-01

    A remaining challenge for the human genome project involves the identification and annotation of expressed genes. The public and private sequencing efforts have identified ?15,000 sequences that meet stringent criteria for genes, such as correspondence with known genes from humans or other species, and have made another ?10,000–20,000 gene predictions of lower confidence, supported by various types of in silico

  5. The 2008 update of the Aspergillus nidulans genome annotation: a community effort.

    PubMed

    Wortman, Jennifer Russo; Gilsenan, Jane Mabey; Joardar, Vinita; Deegan, Jennifer; Clutterbuck, John; Andersen, Mikael R; Archer, David; Bencina, Mojca; Braus, Gerhard; Coutinho, Pedro; von Döhren, Hans; Doonan, John; Driessen, Arnold J M; Durek, Pawel; Espeso, Eduardo; Fekete, Erzsébet; Flipphi, Michel; Estrada, Carlos Garcia; Geysens, Steven; Goldman, Gustavo; de Groot, Piet W J; Hansen, Kim; Harris, Steven D; Heinekamp, Thorsten; Helmstaedt, Kerstin; Henrissat, Bernard; Hofmann, Gerald; Homan, Tim; Horio, Tetsuya; Horiuchi, Hiroyuki; James, Steve; Jones, Meriel; Karaffa, Levente; Karányi, Zsolt; Kato, Masashi; Keller, Nancy; Kelly, Diane E; Kiel, Jan A K W; Kim, Jung-Mi; van der Klei, Ida J; Klis, Frans M; Kovalchuk, Andriy; Krasevec, Nada; Kubicek, Christian P; Liu, Bo; Maccabe, Andrew; Meyer, Vera; Mirabito, Pete; Miskei, Márton; Mos, Magdalena; Mullins, Jonathan; Nelson, David R; Nielsen, Jens; Oakley, Berl R; Osmani, Stephen A; Pakula, Tiina; Paszewski, Andrzej; Paulsen, Ian; Pilsyk, Sebastian; Pócsi, István; Punt, Peter J; Ram, Arthur F J; Ren, Qinghu; Robellet, Xavier; Robson, Geoff; Seiboth, Bernhard; van Solingen, Piet; Specht, Thomas; Sun, Jibin; Taheri-Talesh, Naimeh; Takeshita, Norio; Ussery, Dave; vanKuyk, Patricia A; Visser, Hans; van de Vondervoort, Peter J I; de Vries, Ronald P; Walton, Jonathan; Xiang, Xin; Xiong, Yi; Zeng, An Ping; Brandt, Bernd W; Cornell, Michael J; van den Hondel, Cees A M J J; Visser, Jacob; Oliver, Stephen G; Turner, Geoffrey

    2009-03-01

    The identification and annotation of protein-coding genes is one of the primary goals of whole-genome sequencing projects, and the accuracy of predicting the primary protein products of gene expression is vital to the interpretation of the available data and the design of downstream functional applications. Nevertheless, the comprehensive annotation of eukaryotic genomes remains a considerable challenge. Many genomes submitted to public databases, including those of major model organisms, contain significant numbers of wrong and incomplete gene predictions. We present a community-based reannotation of the Aspergillus nidulans genome with the primary goal of increasing the number and quality of protein functional assignments through the careful review of experts in the field of fungal biology. PMID:19146970

  6. Function Annotation of Hepatic Retinoid x Receptor ? Based on Genome-Wide DNA Binding and Transcriptome Profiling

    E-print Network

    Zhan, Qi; Fang, Yaping; He, Yuqi; Liu, Hui-Xin; Fang, Jianwen; Wan, Yu-Jui Yvonne

    2012-11-15

    of RXRa-dependent genes revealed that hepatic RXRa deficiency mainly resulted in up-regulation of steroid and cholesterol biosynthesis-related genes and down-regulation of translation- as well as anti-apoptosis-related genes. Furthermore, RXRa bound... and their downstream signaling is altered in a variety of diseases including breast cancer [3] and viral hepatitis [4]. Correspond- ingly, RXR agonists are implicated in cancer prevention, antiviral therapy, dermatological disease, and metabolic syndromes [2...

  7. Teachers Reference: Annotations

    NSDL National Science Digital Library

    This collection of 171 annotations was written to enhance and explain the text of the book 'Stone Wall Secrets'. Each annotation consists of a number that refers specifically to the phrase preceding it. Each annotation number is followed by three indexing elements: subject category, one or more keywords, and one or more sample questions with answers.

  8. Exploration of Essential Gene Functions via Titratable Promoter Alleles

    Microsoft Academic Search

    Sanie Mnaimneh; Armaity P Davierwala; Jennifer Haynes; Jason Moffat; Wen-Tao Peng; Wen Zhang; Xueqi Yang; Jeff Pootoolal; Gordon Chua; Andres Lopez; Miles Trochesset; Darcy Morse; Nevan J Krogan; Shawna L Hiley; Zhijian Li; Quaid Morris; Jörg Grigull; Nicholas Mitsakakis; Christopher J Roberts; Jack F Greenblatt; Charles Boone; Chris A Kaiser; Brenda J Andrews; Timothy R Hughes

    2004-01-01

    Nearly 20% of yeast genes are required for viability, hindering genetic analysis with knockouts. We created promoter-shutoff strains for over two-thirds of all essential yeast genes and subjected them to morphological analysis, size profiling, drug sensitivity screening, and microarray expression profiling. We then used this compendium of data to ask which phenotypic features characterized different functional classes and used these

  9. Complexity of gene circuits, Pfaan functions and morphogenesis problem

    E-print Network

    Grigoriev, Dima

    with special circuits of the neural type playing a key role in biology [3, 4, 5]. These circuits are dynamicalComplexity of gene circuits, PfaÆan functions and morphogenesis problem Sergey VAKULENKO 1 , Dmitry Rennes, Beaulieu, 35042, Rennes, France Abstract. We consider a model of gene circuits. We show

  10. Genotype and Gene Expression Associations with Immune Function in Drosophila

    E-print Network

    Nachman, Michael

    in genes near the top of the immune system signaling cascade can have a disproportionate effect response to combat pathogens. Unlike vertebrates, the insect immune response consists solely of an innateGenotype and Gene Expression Associations with Immune Function in Drosophila Timothy B. Sackton1

  11. Comparative validation of the D. melanogaster modENCODE transcriptome annotation

    E-print Network

    Kellis, Manolis

    Accurate gene model annotation of reference genomes is critical for making them useful. The modENCODE project has improved the D. melanogaster genome annotation by using deep and diverse high-throughput data. Since ...

  12. Automated pipeline for atlas-based annotation of gene expresssion patterns: application to postnatal day 7 mouse brain

    SciTech Connect

    Carson, James P.; Ju, Tao; Bello, Musodiq; Thaller, Christina; Warren, Joe; Kakadiaris, Ioannis; Chiu, Wah; Eichele, Gregor

    2010-02-01

    Abstract As bio-medical images and volumes are being collected at an increasing speed, there is a growing demand for efficient means to organize spatial information for comparative analysis. In many scenarios, such as determining gene expression patterns by in situ hybridization, the images are collected from multiple subjects over a common anatomical region, such as the brain. A fundamental challenge in comparing spatial data from different images is how to account for the shape variations among subjects, which makes direct image-to-image comparison meaningless. In this paper, we describe subdivision meshes as a geometric means to efficiently organize 2D images and 3D volumes collected from different subjects for comparison. The key advantages of a subdivision mesh for this purpose are its light-weight geometric structure and its explicit modeling of anatomical boundaries, which enable efficient and accurate registration. The multi-resolution structure of a subdivision mesh also allows development of fast comparison algorithms among registered images and volumes.

  13. Correction of the Caulobacter crescentus NA1000 Genome Annotation

    PubMed Central

    Ely, Bert; Scott, LaTia Etheredge

    2014-01-01

    Bacterial genome annotations are accumulating rapidly in the GenBank database and the use of automated annotation technologies to create these annotations has become the norm. However, these automated methods commonly result in a small, but significant percentage of genome annotation errors. To improve accuracy and reliability, we analyzed the Caulobacter crescentus NA1000 genome utilizing computer programs Artemis and MICheck to manually examine the third codon position GC content, alignment to a third codon position GC frame plot peak, and matches in the GenBank database. We identified 11 new genes, modified the start site of 113 genes, and changed the reading frame of 38 genes that had been incorrectly annotated. Furthermore, our manual method of identifying protein-coding genes allowed us to remove 112 non-coding regions that had been designated as coding regions. The improved NA1000 genome annotation resulted in a reduction in the use of rare codons since noncoding regions with atypical codon usage were removed from the annotation and 49 new coding regions were added to the annotation. Thus, a more accurate codon usage table was generated as well. These results demonstrate that a comparison of the location of peaks third codon position GC content to the location of protein coding regions could be used to verify the annotation of any genome that has a GC content that is greater than 60%. PMID:24621776

  14. Functional analysis of putative restriction-modification system genes in the Helicobacter pylori J99 genome.

    PubMed

    Kong, H; Lin, L F; Porter, N; Stickel, S; Byrd, D; Posfai, J; Roberts, R J

    2000-09-01

    Helicobacter pylori is a gram-negative bacterium, which colonizes the gastric mucosa of humans and is implicated in a wide range of gastroduodenal diseases. The genomic sequences of two H.pylori strains, 26695 and J99, have been published recently. About two dozen potential restriction-modification (R-M) systems have been annotated in both genomes, which is far above the average number of R-M systems in other sequenced genomes. Here we describe a functional analysis of the 16 putative Type II R-M systems in the H. pylori J99 genome. To express potentially toxic endonuclease genes, a unique vector was constructed, which features repression and antisense transcription as dual control elements. To determine the methylation activities of putative DNA methyltransferases, we developed polyclonal antibodies able to detect DNA containing N6-methyladenine or N4-methylcytosine. We found that <30% of the potential Type II R-M systems in H.pylori J99 strain were fully functional, displaying both endonuclease and methyltransferase activities. Helicobacter pylori may maintain a variety of functional R-M systems, which are believed to be a primitive bacterial 'immune' system, by alternatively turning on/off a subset of numerous R-M systems. PMID:10954588

  15. Functions of the gene products of Escherichia coli.

    PubMed Central

    Riley, M

    1993-01-01

    A list of currently identified gene products of Escherichia coli is given, together with a bibliography that provides pointers to the literature on each gene product. A scheme to categorize cellular functions is used to classify the gene products of E. coli so far identified. A count shows that the numbers of genes concerned with small-molecule metabolism are on the same order as the numbers concerned with macromolecule biosynthesis and degradation. One large category is the category of tRNAs and their synthetases. Another is the category of transport elements. The categories of cell structure and cellular processes other than metabolism are smaller. Other subjects discussed are the occurrence in the E. coli genome of redundant pairs and groups of genes of identical or closely similar function, as well as variation in the degree of density of genetic information in different parts of the genome. PMID:7508076

  16. OxyGene: an innovative platform for investigating oxidative-response genes in whole prokaryotic genomes

    PubMed Central

    Thybert, David; Avner, Stéphane; Lucchetti-Miganeh, Céline; Chéron, Angélique; Barloy-Hubler, Frédérique

    2008-01-01

    Background Oxidative stress is a common stress encountered by living organisms and is due to an imbalance between intracellular reactive oxygen and nitrogen species (ROS, RNS) and cellular antioxidant defence. To defend themselves against ROS/RNS, bacteria possess a subsystem of detoxification enzymes, which are classified with regard to their substrates. To identify such enzymes in prokaryotic genomes, different approaches based on similarity, enzyme profiles or patterns exist. Unfortunately, several problems persist in the annotation, classification and naming of these enzymes due mainly to some erroneous entries in databases, mistake propagation, absence of updating and disparity in function description. Description In order to improve the current annotation of oxidative stress subsystems, an innovative platform named OxyGene has been developed. It integrates an original database called OxyDB, holding thoroughly tested anchor-based signatures associated to subfamilies of oxidative stress enzymes, and a new anchor-driven annotator, for ab initio detection of ROS/RNS response genes. All complete Bacterial and Archaeal genomes have been re-annotated, and the results stored in the OxyGene repository can be interrogated via a Graphical User Interface. Conclusion OxyGene enables the exploration and comparative analysis of enzymes belonging to 37 detoxification subclasses in 664 microbial genomes. It proposes a new classification that improves both the ontology and the annotation of the detoxification subsystems in prokaryotic whole genomes, while discovering new ORFs and attributing precise function to hypothetical annotated proteins. OxyGene is freely available at: PMID:19117520

  17. Blast2GO: A Comprehensive Suite for Functional Analysis in Plant Genomics

    PubMed Central

    Conesa, Ana; Götz, Stefan

    2008-01-01

    Functional annotation of novel sequence data is a primary requirement for the utilization of functional genomics approaches in plant research. In this paper, we describe the Blast2GO suite as a comprehensive bioinformatics tool for functional annotation of sequences and data mining on the resulting annotations, primarily based on the gene ontology (GO) vocabulary. Blast2GO optimizes function transfer from homologous sequences through an elaborate algorithm that considers similarity, the extension of the homology, the database of choice, the GO hierarchy, and the quality of the original annotations. The tool includes numerous functions for the visualization, management, and statistical analysis of annotation results, including gene set enrichment analysis. The application supports InterPro, enzyme codes, KEGG pathways, GO direct acyclic graphs (DAGs), and GOSlim. Blast2GO is a suitable tool for plant genomics research because of its versatility, easy installation, and friendly use. PMID:18483572

  18. Fine-scale mergers of chloroplast and mitochondrial genes create functional, transcompartmentally chimeric mitochondrial genes

    PubMed Central

    Hao, Weilong; Palmer, Jeffrey D.

    2009-01-01

    The mitochondrial genomes of flowering plants possess a promiscuous proclivity for taking up sequences from the chloroplast genome. All characterized chloroplast integrants exist apart from native mitochondrial genes, and only a few, involving chloroplast tRNA genes that have functionally supplanted their mitochondrial counterparts, appear to be of functional consequence. We developed a novel computational approach to search for homologous recombination (gene conversion) in a large number of sequences and applied it to 22 mitochondrial and chloroplast gene pairs, which last shared common ancestry some 2 billion years ago. We found evidence of recurrent conversion of short patches of mitochondrial genes by chloroplast homologs during angiosperm evolution, but no evidence of gene conversion in the opposite direction. All 9 putative conversion events involve the atp1/atpA gene encoding the alpha subunit of ATP synthase, which is unusually well conserved between the 2 organelles and the only shared gene that is widely sequenced across plant mitochondria. Moreover, all conversions were limited to the 2 regions of greatest nucleotide and amino acid conservation of atp1/atpA. These observations probably reflect constraints operating on both the occurrence and fixation of recombination between ancient homologs. These findings indicate that recombination between anciently related sequences is more frequent than previously appreciated and creates functional mitochondrial genes of chimeric origin. These results also have implications for the widespread use of mitochondrial atp1 in phylogeny reconstruction. PMID:19805364

  19. Rapid Determination of Gene Function by Virus-induced Gene Silencing in Wheat and Barley

    Technology Transfer Automated Retrieval System (TEKTRAN)

    The cereal crops are essential components to the human and animal food supply. Solutions to many of the problems challenging cereal production will require identification of genes responsible for particular traits. Unfortunately, the process of identifying gene function is very slow and complex in...

  20. Differential Selection on Carotenoid Biosynthesis Genes as a Function of Gene Position in the Metabolic Pathway

    E-print Network

    Paris-Sud XI, Université de

    Differential Selection on Carotenoid Biosynthesis Genes as a Function of Gene Position in controlling metabolic fluxes. This hypothesis was tested in the carotenoid biosynthesis pathway using distributed along the carotenoid biosynthesis pathway, IPI, PDS, CRTISO, LCYB, LCYE, CHXE and ZEP, were

  1. Evolutionary Interrogation of Human Biology in Well-Annotated Genomic Framework of Rhesus Macaque

    PubMed Central

    Zhang, Shi-Jian; Liu, Chu-Jun; Yu, Peng; Zhong, Xiaoming; Chen, Jia-Yu; Yang, Xinzhuang; Peng, Jiguang; Yan, Shouyu; Wang, Chenqu; Zhu, Xiaotong; Xiong, Jingwei; Zhang, Yong E.; Tan, Bertrand Chin-Ming; Li, Chuan-Yun

    2014-01-01

    With genome sequence and composition highly analogous to human, rhesus macaque represents a unique reference for evolutionary studies of human biology. Here, we developed a comprehensive genomic framework of rhesus macaque, the RhesusBase2, for evolutionary interrogation of human genes and the associated regulations. A total of 1,667 next-generation sequencing (NGS) data sets were processed, integrated, and evaluated, generating 51.2 million new functional annotation records. With extensive NGS annotations, RhesusBase2 refined the fine-scale structures in 30% of the macaque Ensembl transcripts, reporting an accurate, up-to-date set of macaque gene models. On the basis of these annotations and accurate macaque gene models, we further developed an NGS-oriented Molecular Evolution Gateway to access and visualize macaque annotations in reference to human orthologous genes and associated regulations (www.rhesusbase.org/molEvo). We highlighted the application of this well-annotated genomic framework in generating hypothetical link of human-biased regulations to human-specific traits, by using mechanistic characterization of the DIEXF gene as an example that provides novel clues to the understanding of digestive system reduction in human evolution. On a global scale, we also identified a catalog of 9,295 human-biased regulatory events, which may represent novel elements that have a substantial impact on shaping human transcriptome and possibly underpin recent human phenotypic evolution. Taken together, we provide an NGS data-driven, information-rich framework that will broadly benefit genomics research in general and serves as an important resource for in-depth evolutionary studies of human biology. PMID:24577841

  2. Duplication and relocation of the functional DPY19L2 gene within low copy repeats

    Microsoft Academic Search

    Andrew R Carson; Joseph Cheung; Stephen W Scherer

    2006-01-01

    BACKGROUND: Low copy repeats (LCRs) are thought to play an important role in recent gene evolution, especially when they facilitate gene duplications. Duplicate genes are fundamental to adaptive evolution, providing substrates for the development of new or shared gene functions. Moreover, silencing of duplicate genes can have an indirect effect on adaptive evolution by causing genomic relocation of functional genes.

  3. Complex genomic rearrangements lead to novel primate gene function

    PubMed Central

    Ciccarelli, Francesca D.; von Mering, Christian; Suyama, Mikita; Harrington, Eoghan D.; Izaurralde, Elisa; Bork, Peer

    2005-01-01

    Orthologous genes that maintain a single-copy status in a broad range of species may indicate a selection against gene duplication. If this is the case, then duplicates of such genes that do survive may have escaped the dosage control by rapid and sizable changes in their function. To test this hypothesis and to develop a strategy for the identification of novel gene functions, we have analyzed 22 primate-specific intrachromosomal duplications of genes with a single-copy ortholog in all other completely sequenced metazoans. When comparing this set to genes not exposed to the single-copy status constraint, we observed a higher tendency of the former to modify their gene structure, often through complex genomic rearrangements. The analysis of the most dramatic of these duplications, affecting ?10% of human Chromosome 2, enabled a detailed reconstruction of the events leading to the appearance of a novel gene family. The eight members of this family originated from the highly conserved nucleoporin RanBP2 by several genetic rearrangements such as segmental duplications, inversions, translocations, exon loss, and domain accretion. We have experimentally verified that at least one of the newly formed proteins has a cellular localization different from RanBP2's, and we show that positive selection did act on specific domains during evolution. PMID:15710750

  4. Molecular and Functional Characterization of Broccoli EMBRYONIC FLOWER 2 Genes

    PubMed Central

    Chen, Long-Fang O.; Lin, Chun-Hung; Lai, Ying-Mi; Huang, Jia-Yuan; Sung, Zinmay Renee

    2012-01-01

    Polycomb group (PcG) proteins regulate major developmental processes in Arabidopsis. EMBRYONIC FLOWER 2 (EMF2), the VEFS domain-containing PcG gene, regulates diverse genetic pathways and is required for vegetative development and plant survival. Despite widespread EMF2-like sequences in plants, little is known about their function other than in Arabidopsis and rice. To study the role of EMF2 in broccoli (Brassica oleracea var. italica cv. Elegance) development, we identified two broccoli EMF2 (BoEMF2) genes with sequence homology to and a similar gene expression pattern to that in Arabidopsis (AtEMF2). Reducing their expression in broccoli resulted in aberrant phenotypes and gene expression patterns. BoEMF2 regulates genes involved in diverse developmental and stress programs similar to AtEMF2 in Arabidopsis. However, BoEMF2 differs from AtEMF2 in the regulation of flower organ identity, cell proliferation and elongation, and death-related genes, which may explain the distinct phenotypes. The expression of BoEMF2.1 in the Arabidopsis emf2 mutant (Rescued emf2) partially rescued the mutant phenotype and restored the gene expression pattern to that of the wild type. Many EMF2-mediated molecular and developmental functions are conserved in broccoli and Arabidopsis. Furthermore, the restored gene expression pattern in Rescued emf2 provides insights into the molecular basis of PcG-mediated growth and development. PMID:22537758

  5. Functional gene diversity of oolitic sands from Great Bahama Bank.

    PubMed

    Diaz, M R; Van Norstrand, J D; Eberli, G P; Piggot, A M; Zhou, J; Klaus, J S

    2014-05-01

    Despite the importance of oolitic depositional systems as indicators of climate and reservoirs of inorganic C, little is known about the microbial functional diversity, structure, composition, and potential metabolic processes leading to precipitation of carbonates. To fill this gap, we assess the metabolic gene carriage and extracellular polymeric substance (EPS) development in microbial communities associated with oolitic carbonate sediments from the Bahamas Archipelago. Oolitic sediments ranging from high-energy 'active' to lower energy 'non-active' and 'microbially stabilized' environments were examined as they represent contrasting depositional settings, mostly influenced by tidal flows and wave-generated currents. Functional gene analysis, which employed a microarray-based gene technology, detected a total of 12,432 of 95,847 distinct gene probes, including a large number of metabolic processes previously linked to mineral precipitation. Among these, gene-encoding enzymes for denitrification, sulfate reduction, ammonification, and oxygenic/anoxygenic photosynthesis were abundant. In addition, a broad diversity of genes was related to organic carbon degradation, and N2 fixation implying these communities has metabolic plasticity that enables survival under oligotrophic conditions. Differences in functional genes were detected among the environments, with higher diversity associated with non-active and microbially stabilized environments in comparison with the active environment. EPS showed a gradient increase from active to microbially stabilized communities, and when combined with functional gene analysis, which revealed genes encoding EPS-degrading enzymes (chitinases, glucoamylase, amylases), supports a putative role of EPS-mediated microbial calcium carbonate precipitation. We propose that carbonate precipitation in marine oolitic biofilms is spatially and temporally controlled by a complex consortium of microbes with diverse physiologies, including photosynthesizers, heterotrophs, denitrifiers, sulfate reducers, and ammonifiers. PMID:24612324

  6. CodeAnnotator: digital ink annotation within Eclipse

    Microsoft Academic Search

    Xiaofan Chen; Beryl Plimmer

    2007-01-01

    Programming environments do not support ink annotation. Yet, annotation is the most effective way to actively read and review a document. This paper describes a tool, CodeAnnotator, which integrates annotation support inside an Integrated Development Environment (IDE). This tool is designed and developed to support direct annotation of program code with digital ink in the IDE. Programmers will benefit from

  7. Structure based annotation of Helicobacter pylori strain 26695 proteome.

    PubMed

    Singh, Swati; Guttula, Praveen Kumar; Guruprasad, Lalitha

    2014-01-01

    The availability of complete genome sequences of H. pylori 26695 has provided a wealth of information enabling us to carry out in silico studies to identify new molecular targets for pharmaceutical treatment. In order to construe the structural and functional information of complete proteome, use of computational methods are more relevant since these methods are reliable and provide a solution to the time consuming and expensive experimental methods. Out of 1590 predicted protein coding genes in H. pylori, experimentally determined structures are available for only 145 proteins in the PDB. In the absence of experimental structures, computational studies on the three dimensional (3D) structural organization would help in deciphering the protein fold, structure and active site. Functional annotation of each protein was carried out based on structural fold and binding site based ligand association. Most of these proteins are uncharacterized in this proteome and through our annotation pipeline we were able to annotate most of them. We could assign structural folds to 464 uncharacterized proteins from an initial list of 557 sequences. Of the 1195 known structural folds present in the SCOP database, 411 (34% of all known folds) are observed in the whole H. pylori 26695 proteome, with greater inclination for domains belonging to ?/? class (36.63%). Top folds include P-loop containing nucleoside triphosphate hydrolases (22.6%), TIM barrel (16.7%), transmembrane helix hairpin (16.05%), alpha-alpha superhelix (11.1%) and S-adenosyl-L-methionine-dependent methyltransferases (10.7%). PMID:25549250

  8. Functional Characterization of the dRYBP Gene in Drosophila

    PubMed Central

    González, Inma; Aparicio, Ricardo; Busturia, Ana

    2008-01-01

    The Drosophila dRYBP gene has been described to function as a Polycomb-dependent transcriptional repressor. To determine the in vivo function of the dRYBP gene, we have generated mutations and analyzed the associated phenotypes. Homozygous null mutants die progressively throughout development and present phenotypes variable both in their penetrance and in their expressivity, including disrupted oogenesis, a disorganized pattern of the syncytial nuclear divisions, defects in pattern formation, and decreased wing size. Although dRYBP mutations do not show the homeotic-like phenotypes typical of mutations in the PcG and trxG genes, they enhance the phenotypes of mutations of either the Sex comb extra gene (PcG) or the trithorax gene (trxG). Finally, the dRYBP protein interacts physically with the Sex comb extra and the Pleiohomeotic proteins, and the homeotic-like phenotypes produced by the high levels of the dRYBP protein are mediated through its C-terminal domain. Our results indicate that the dRYBP gene functions in the control of cell identity together with the PcG/trxG proteins. Furthermore, they also indicate that dRYBP participates in the control of cell proliferation and cell differentiation and we propose that its functional requirement may well depend on the robustness of the animal. PMID:18562658

  9. Convergence in pigmentation at multiple levels: mutations, genes and function

    PubMed Central

    Manceau, Marie; Domingues, Vera S.; Linnen, Catherine R.; Rosenblum, Erica Bree; Hoekstra, Hopi E.

    2010-01-01

    Convergence—the independent evolution of the same trait by two or more taxa—has long been of interest to evolutionary biologists, but only recently has the molecular basis of phenotypic convergence been identified. Here, we highlight studies of rapid evolution of cryptic coloration in vertebrates to demonstrate that phenotypic convergence can occur at multiple levels: mutations, genes and gene function. We first show that different genes can be responsible for convergent phenotypes even among closely related populations, for example, in the pale beach mice inhabiting Florida's Gulf and Atlantic coasts. By contrast, the exact same mutation can create similar phenotypes in distantly related species such as mice and mammoths. Next, we show that different mutations in the same gene need not be functionally equivalent to produce similar phenotypes. For example, separate mutations produce divergent protein function but convergent pale coloration in two lizard species. Similarly, mutations that alter the expression of a gene in different ways can, nevertheless, result in similar phenotypes, as demonstrated by sister species of deer mice. Together these studies underscore the importance of identifying not only the genes, but also the precise mutations and their effects on protein function, that contribute to adaptation and highlight how convergence can occur at different genetic levels. PMID:20643733

  10. Paralogue annotation identifies novel pathogenic variants in patients with Brugada syndrome and catecholaminergic polymorphic ventricular tachycardia

    PubMed Central

    Walsh, Roddy; Peters, Nicholas S; Cook, Stuart A; Ware, James S

    2014-01-01

    Background Distinguishing genetic variants that cause disease from variants that are rare but benign is one of the principal challenges in contemporary clinical genetics, particularly as variants are identified at a pace exceeding the capacity of researchers to characterise them functionally. Methods We previously developed a novel method, called paralogue annotation, which accurately and specifically identifies disease-causing missense variants by transferring disease-causing annotations across families of related proteins. Here we refine our approach, and apply it to novel variants found in 2266 patients across two large cohorts with inherited sudden death syndromes, namely catecholaminergic polymorphic ventricular tachycardia (CPVT) or Brugada syndrome (BrS). Results Over one third of the novel non-synonymous variants found in these studies, which would otherwise be reported in a clinical diagnostics setting as ‘variants of unknown significance’, are categorised by our method as likely disease causing (positive predictive value 98.7%). This identified more than 500 new disease loci for BrS and CPVT. Conclusions Our methodology is widely transferable across all human disease genes, with an estimated 150?000 potentially informative annotations in more than 1800 genes. We have developed a web resource that allows researchers and clinicians to annotate variants found in individuals with inherited arrhythmias, comprising a referenced compendium of known missense variants in these genes together with a user-friendly implementation of our approach. This tool will facilitate the interpretation of many novel variants that might otherwise remain unclassified. PMID:24136861

  11. Additive functions in boolean models of gene regulatory network modules.

    PubMed

    Darabos, Christian; Di Cunto, Ferdinando; Tomassini, Marco; Moore, Jason H; Provero, Paolo; Giacobini, Mario

    2011-01-01

    Gene-on-gene regulations are key components of every living organism. Dynamical abstract models of genetic regulatory networks help explain the genome's evolvability and robustness. These properties can be attributed to the structural topology of the graph formed by genes, as vertices, and regulatory interactions, as edges. Moreover, the actual gene interaction of each gene is believed to play a key role in the stability of the structure. With advances in biology, some effort was deployed to develop update functions in boolean models that include recent knowledge. We combine real-life gene interaction networks with novel update functions in a boolean model. We use two sub-networks of biological organisms, the yeast cell-cycle and the mouse embryonic stem cell, as topological support for our system. On these structures, we substitute the original random update functions by a novel threshold-based dynamic function in which the promoting and repressing effect of each interaction is considered. We use a third real-life regulatory network, along with its inferred boolean update functions to validate the proposed update function. Results of this validation hint to increased biological plausibility of the threshold-based function. To investigate the dynamical behavior of this new model, we visualized the phase transition between order and chaos into the critical regime using Derrida plots. We complement the qualitative nature of Derrida plots with an alternative measure, the criticality distance, that also allows to discriminate between regimes in a quantitative way. Simulation on both real-life genetic regulatory networks show that there exists a set of parameters that allows the systems to operate in the critical region. This new model includes experimentally derived biological information and recent discoveries, which makes it potentially useful to guide experimental research. The update function confers additional realism to the model, while reducing the complexity and solution space, thus making it easier to investigate. PMID:22132067

  12. Stochastic gene expression modeling with Hill function for switch-like gene responses.

    PubMed

    Kim, Haseong; Gelenbe, Erol

    2012-01-01

    Gene expression models play a key role to understand the mechanisms of gene regulation whose aspects are grade and switch-like responses. Though many stochastic approaches attempt to explain the gene expression mechanisms, the Gillespie algorithm which is commonly used to simulate the stochastic models requires additional gene cascade to explain the switch-like behaviors of gene responses. In this study, we propose a stochastic gene expression model describing the switch-like behaviors of a gene by employing Hill functions to the conventional Gillespie algorithm. We assume eight processes of gene expression and their biologically appropriate reaction rates are estimated based on published literatures. We observed that the state of the system of the toggled switch model is rarely changed since the Hill function prevents the activation of involved proteins when their concentrations stay below a criterion. In ScbA-ScbR system, which can control the antibiotic metabolite production of microorganisms, our modified Gillespie algorithm successfully describes the switch-like behaviors of gene responses and oscillatory expressions which are consistent with the published experimental study. PMID:22144531

  13. Cross-Ontological Analytics: Combining Associative and Hierarchical Relations in the Gene Ontologies to Assess Gene Product Similarity

    SciTech Connect

    Posse, Christian; Sanfilippo, Antonio P.; Gopalan, Banu; Riensche, Roderick M.; Beagley, Nathaniel; Baddeley, Bob L.

    2006-05-28

    Gene and gene product similarity is a fundamental diagnostic measure in analyzing biological data and constructing predictive models for functional genomics. With the rising influence of the gene ontologies, two complementary approaches have emerged where the similarity between two genes/gene products is obtained by comparing gene ontology (GO) annotations associated with the gene/gene products. One approach captures GO-based similarity in terms of hierarchical relations within each gene ontology. The other approach identifies GO-based similarity in terms of associative relations across the three gene ontologies. We propose a novel methodology where the two approaches can be merged with ensuing benefits in coverage and accuracy.

  14. Drosha Regulates Gene Expression Independently of RNA Cleavage Function

    PubMed Central

    Gromak, Natalia; Dienstbier, Martin; Macias, Sara; Plass, Mireya; Eyras, Eduardo; Cáceres, Javier F.; Proudfoot, Nicholas J.

    2013-01-01

    Summary Drosha is the main RNase III-like enzyme involved in the process of microRNA (miRNA) biogenesis in the nucleus. Using whole-genome ChIP-on-chip analysis, we demonstrate that, in addition to miRNA sequences, Drosha specifically binds promoter-proximal regions of many human genes in a transcription-dependent manner. This binding is not associated with miRNA production or RNA cleavage. Drosha knockdown in HeLa cells downregulated nascent gene transcription, resulting in a reduction of polyadenylated mRNA produced from these gene regions. Furthermore, we show that this function of Drosha is dependent on its N-terminal protein-interaction domain, which associates with the RNA-binding protein CBP80 and RNA Polymerase II. Consequently, we uncover a previously unsuspected RNA cleavage-independent function of Drosha in the regulation of human gene expression. PMID:24360955

  15. proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS PFP: Automated prediction of gene ontology

    E-print Network

    Kihara, Daisuke

    proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS PFP: Automated prediction of gene ontology functional introduced PFP (Protein Function Prediction) as our sequence-based predictor of Gene Ontology (GO) functional

  16. A first insight into the occurrence and expression of functional amoA and accA genes of autotrophic and ammonia-oxidizing bathypelagic Crenarchaeota of Tyrrhenian Sea

    Microsoft Academic Search

    Michail M. Yakimov; Violetta La Cono; Renata Denaro

    2009-01-01

    The autotrophic and ammonia-oxidizing crenarchaeal assemblage at offshore site located in the deep Mediterranean (Tyrrhenian Sea, depth 3000m) water was studied by PCR amplification of the key functional genes involved in energy (ammonia mono-oxygenase alpha subunit, amoA) and central metabolism (acetyl-CoA carboxylase alpha subunit, accA). Using two recently annotated genomes of marine crenarchaeons, an initial set of primers targeting archaeal

  17. Dynamic nature of a wheat centromere with a functional gene

    Microsoft Academic Search

    Jasdeep S. Mutti; Devinder Sandhu; Deepak Sidhu; Kulvinder S. Gill

    2010-01-01

    Centromeric regions of higher eukaryotes are comprised mainly of tandem and non-tandem repeat sequences with variable copy\\u000a number, spacing, order and orientation; are heterochromatic in nature, and are believed to be devoid of actively transcribing\\u000a genes. Here, we report an actively transcribing wheat homolog of HSP70 gene that maps in the functional wheat centromere, and copy number of which seems

  18. Functions of rol genes in plant secondary metabolism

    Microsoft Academic Search

    Victor P. Bulgakov

    2008-01-01

    For a long time, the Agrobacterium rhizogenes rolA, rolB and rolC oncogenes have been considered to be modulators of plant growth and cell differentiation. A new function of the rol genes in plant–Agrobacterium interaction became apparent with the discovery that these genes are potential activators of secondary metabolism in transformed cells from the Solanaceae, Araliaceae, Rubiaceae, Vitaceae and Rosaceae families.

  19. Omics data management and annotation.

    PubMed

    Harel, Arye; Dalah, Irina; Pietrokovski, Shmuel; Safran, Marilyn; Lancet, Doron

    2011-01-01

    Technological Omics breakthroughs, including next generation sequencing, bring avalanches of data which need to undergo effective data management to ensure integrity, security, and maximal knowledge-gleaning. Data management system requirements include flexible input formats, diverse data entry mechanisms and views, user friendliness, attention to standards, hardware and software platform definition, as well as robustness. Relevant solutions elaborated by the scientific community include Laboratory Information Management Systems (LIMS) and standardization protocols facilitating data sharing and managing. In project planning, special consideration has to be made when choosing relevant Omics annotation sources, since many of them overlap and require sophisticated integration heuristics. The data modeling step defines and categorizes the data into objects (e.g., genes, articles, disorders) and creates an application flow. A data storage/warehouse mechanism must be selected, such as file-based systems and relational databases, the latter typically used for larger projects. Omics project life cycle considerations must include the definition and deployment of new versions, incorporating either full or partial updates. Finally, quality assurance (QA) procedures must validate data and feature integrity, as well as system performance expectations. We illustrate these data management principles with examples from the life cycle of the GeneCards Omics project (http://www.genecards.org), a comprehensive, widely used compendium of annotative information about human genes. For example, the GeneCards infrastructure has recently been changed from text files to a relational database, enabling better organization and views of the growing data. Omics data handling benefits from the wealth of Web-based information, the vast amount of public domain software, increasingly affordable hardware, and effective use of data management and annotation principles as outlined in this chapter. PMID:21370079

  20. Norrie disease gene: Characterization of deletions and possible function

    SciTech Connect

    Chen, Z.Y.; Battinelli, E.M.; Hendriks, R.W.; Craig, I.W. [Univ. of Oxford (United Kingdom)] [Univ. of Oxford (United Kingdom); Powell, J.F. [Institute of Psychiatry, London (United Kingdom)] [Institute of Psychiatry, London (United Kingdom); Middleton-Price, H. [Univ. of London (United Kingdom)] [Univ. of London (United Kingdom); Sims, K.B.; Breakefield, X.O. [Massachusetts General Hospital, Charlestown, MA (United States)] [Massachusetts General Hospital, Charlestown, MA (United States)

    1993-05-01

    Positional cloning experiments have resulted recently in the isolation of a candidate gene for Norrie disease (pseudoglioma; NDP), a severe X-linked neuro-developmental disorder. Here the authors report the isolation and analysis of human genomic DNA clones encompassing the NDP gene. The gene spans 28 kb and consists of 3 exons, the first of which is entirely contained within the 5{prime} untranslated region. Detailed analysis of genomic deletions in Norrie patients shows that they are heterogeneous, both in size and in position. By PCR analysis, they found that expression of the NDP gene was not confined to the eye or to the brain. An extensive DNA and protein sequence comparison between the human NDP gene and related genes from the database revealed homology with cysteine-rich protein-binding domains of immediate--early genes implicated in the regulation of cell proliferation. They propose that NDP is a molecule related in function to these genes and may be involved in a pathway that regulates neural cell differentiation and proliferation. 19 refs., 2 figs.

  1. RNAi-Mediated Gene Function Analysis in Skin

    PubMed Central

    Beronja, Slobodan; Fuchs, Elaine

    2014-01-01

    We have recently developed a method for RNAi-mediated gene function analysis in skin (Beronja et al., Nat Med 16:821–827, 2010). It employs ultrasound-guided in utero microinjections of lentivirus into the amniotic cavity of embryonic day 9 mice, which result in rapid, efficient, and stable transduction into mouse skin. Our technique greatly extends the available molecular and genetic toolbox for comprehensive functional examination of outstanding problems in epidermal biology. In its simplest form, as a single-gene function analysis via shRNA-mediated gene knockdown, our technique requires no animal mating and may need as little as only a few days between manipulation and phenotypic analysis. PMID:23325656

  2. Genome-Wide and Functional Annotation of Human E3 Ubiquitin Ligases Identifies MULAN, a Mitochondrial E3 that Regulates the Organelle's Dynamics and Signaling

    PubMed Central

    Ulbrich, Axel; Matsuda, Akio; Reddy, Venkateshwar A.; Orth, Anthony; Chanda, Sumit K.; Batalov, Serge; Joazeiro, Claudio A. P.

    2008-01-01

    Specificity of protein ubiquitylation is conferred by E3 ubiquitin (Ub) ligases. We have annotated ?617 putative E3s and substrate-recognition subunits of E3 complexes encoded in the human genome. The limited knowledge of the function of members of the large E3 superfamily prompted us to generate genome-wide E3 cDNA and RNAi expression libraries designed for functional screening. An imaging-based screen using these libraries to identify E3s that regulate mitochondrial dynamics uncovered MULAN/FLJ12875, a RING finger protein whose ectopic expression and knockdown both interfered with mitochondrial trafficking and morphology. We found that MULAN is a mitochondrial protein – two transmembrane domains mediate its localization to the organelle's outer membrane. MULAN is oriented such that its E3-active, C-terminal RING finger is exposed to the cytosol, where it has access to other components of the Ub system. Both an intact RING finger and the correct subcellular localization were required for regulation of mitochondrial dynamics, suggesting that MULAN's downstream effectors are proteins that are either integral to, or associated with, mitochondria and that become modified with Ub. Interestingly, MULAN had previously been identified as an activator of NF-?B, thus providing a link between mitochondrial dynamics and mitochondria-to-nucleus signaling. These findings suggest the existence of a new, Ub-mediated mechanism responsible for integration of mitochondria into the cellular environment. PMID:18213395

  3. Searching for functional gene modules with interaction component models

    PubMed Central

    2010-01-01

    Background Functional gene modules and protein complexes are being sought from combinations of gene expression and protein-protein interaction data with various clustering-type methods. Central features missing from most of these methods are handling of uncertainty in both protein interaction and gene expression measurements, and in particular capability of modeling overlapping clusters. It would make sense to assume that proteins may play different roles in different functional modules, and the roles are evidenced in their interactions. Results We formulate a generative probabilistic model for protein-protein interaction links and introduce two ways for including gene expression data into the model. The model finds interaction components, which can be interpreted as overlapping clusters or functional modules. We demonstrate the performance on two data sets of yeast Saccharomyces cerevisiae. Our methods outperform a representative set of earlier models in the task of finding biologically relevant modules having enriched functional classes. Conclusions Combining protein interaction and gene expression data with a probabilistic generative model improves discovery of modules compared to approaches based on either data source alone. With a fairly simple model we can find biologically relevant modules better than with alternative methods, and in addition the modules may be inherently overlapping in the sense that different interactions may belong to different modules. PMID:20100324

  4. Functional analysis of fungal polyketide biosynthesis genes

    Microsoft Academic Search

    Isao Fujii

    2010-01-01

    Fungal polyketides have huge structural diversity from simple aromatics to highly modified complex reduced-type compounds. Despite such diversty, single modular iterative type I polyketide synthases (iPKSs) are responsible for their carbon skeleton construction. Using heterologous expression systems, we have studied on ATX, a 6-methylsalicylic acid synthase from Aspergillus terreus as a model iPKS. In addition, iPKS functions involved in fungal

  5. ASAP, a systematic annotation package for community analysis of genomes

    Microsoft Academic Search

    Jeremy D. Glasner; Paul Liss; Guy Plunkett III; Aaron E. Darling; Tejasvini Prasad; Michael Rusch; Alexis Byrnes; Michael K. Gilson; Bryan S. Biehl; Frederick R. Blattner; Nicole T. Perna

    2003-01-01

    ASAP (a systematic annotation package for community analysis of genomes) is a relational database and web interface developed to store, update and distribute genome sequence data and functional characterization (https:\\/\\/asap.ahabs.wisc. edu\\/annotation\\/php\\/ASAP1.htm). ASAP facilitates ongoing community annotation of genomes and tracking of information as genome projects move from preliminary data collection through post- sequencing functional analysis. The ASAP database includes multiple

  6. ASGARD: an open-access database of annotated transcriptomes for emerging model arthropod species.

    PubMed

    Zeng, Victor; Extavour, Cassandra G

    2012-01-01

    The increased throughput and decreased cost of next-generation sequencing (NGS) have shifted the bottleneck genomic research from sequencing to annotation, analysis and accessibility. This is particularly challenging for research communities working on organisms that lack the basic infrastructure of a sequenced genome, or an efficient way to utilize whatever sequence data may be available. Here we present a new database, the Assembled Searchable Giant Arthropod Read Database (ASGARD). This database is a repository and search engine for transcriptomic data from arthropods that are of high interest to multiple research communities but currently lack sequenced genomes. We demonstrate the functionality and utility of ASGARD using de novo assembled transcriptomes from the milkweed bug Oncopeltus fasciatus, the cricket Gryllus bimaculatus and the amphipod crustacean Parhyale hawaiensis. We have annotated these transcriptomes to assign putative orthology, coding region determination, protein domain identification and Gene Ontology (GO) term annotation to all possible assembly products. ASGARD allows users to search all assemblies by orthology annotation, GO term annotation or Basic Local Alignment Search Tool. User-friendly features of ASGARD include search term auto-completion suggestions based on database content, the ability to download assembly product sequences in FASTA format, direct links to NCBI data for predicted orthologs and graphical representation of the location of protein domains and matches to similar sequences from the NCBI non-redundant database. ASGARD will be a useful repository for transcriptome data from future NGS studies on these and other emerging model arthropods, regardless of sequencing platform, assembly or annotation status. This database thus provides easy, one-stop access to multi-species annotated transcriptome information. We anticipate that this database will be useful for members of multiple research communities, including developmental biology, physiology, evolutionary biology, ecology, comparative genomics and phylogenomics. Database URL: asgard.rc.fas.harvard.edu. PMID:23180770

  7. Core Promoter Functions in the Regulation of Gene Expression of Drosophila Dorsal Target Genes*

    PubMed Central

    Zehavi, Yonathan; Kuznetsov, Olga; Ovadia-Shochat, Avital; Juven-Gershon, Tamar

    2014-01-01

    Developmental processes are highly dependent on transcriptional regulation by RNA polymerase II. The RNA polymerase II core promoter is the ultimate target of a multitude of transcription factors that control transcription initiation. Core promoters consist of core promoter motifs, e.g. the initiator, TATA box, and the downstream core promoter element (DPE), which confer specific properties to the core promoter. Here, we explored the importance of core promoter functions in the dorsal-ventral developmental gene regulatory network. This network includes multiple genes that are activated by different nuclear concentrations of Dorsal, an NF?B homolog transcription factor, along the dorsal-ventral axis. We show that over two-thirds of Dorsal target genes contain DPE sequence motifs, which is significantly higher than the proportion of DPE-containing promoters in Drosophila genes. We demonstrate that multiple Dorsal target genes are evolutionarily conserved and functionally dependent on the DPE. Furthermore, we have analyzed the activation of key Dorsal target genes by Dorsal, as well as by another Rel family transcription factor, Relish, and the dependence of their activation on the DPE motif. Using hybrid enhancer-promoter constructs in Drosophila cells and embryo extracts, we have demonstrated that the core promoter composition is an important determinant of transcriptional activity of Dorsal target genes. Taken together, our results provide evidence for the importance of core promoter composition in the regulation of Dorsal target genes. PMID:24634215

  8. Gene Profiling of Mta1 Identifies Novel Gene Targets and Functions

    PubMed Central

    Eswaran, Jeyanthy; Kumar, Rakesh

    2011-01-01

    Background Metastasis-associated protein 1 (MTA1), a master dual co-regulatory protein is found to be an integral part of NuRD (Nucleosome Remodeling and Histone Deacetylation) complex, which has indispensable transcriptional regulatory functions via histone deacetylation and chromatin remodeling. Emerging literature establishes MTA1 to be a valid DNA-damage responsive protein with a significant role in maintaining the optimum DNA-repair activity in mammalian cells exposed to genotoxic stress. This DNA-damage responsive function of MTA1 was reported to be a P53-dependent and independent function. Here, we investigate the influence of P53 on gene regulation function of Mta1 to identify novel gene targets and functions of Mta1. Methods Gene expression analysis was performed on five different mouse embryonic fibroblasts (MEFs) samples (i) the Mta1 wild type, (ii) Mta1 knock out (iii) Mta1 knock out in which Mta1 was reintroduced (iv) P53 knock out (v) P53 knock out in which Mta1 was over expressed using Affymetrix Mouse Exon 1.0 ST arrays. Further Hierarchical Clustering, Gene Ontology analysis with GO terms satisfying corrected p-value<0.1, and the Ingenuity Pathway Analysis were performed. Finally, RT-qPCR was carried out on selective candidate genes. Significance/Conclusion This study represents a complete genome wide screen for possible target genes of a coregulator, Mta1. The comparative gene profiling of Mta1 wild type, Mta1 knockout and Mta1 re-expression in the Mta1 knockout conditions define “bona fide” Mta1 target genes. Further extensive analyses of the data highlights the influence of P53 on Mta1 gene regulation. In the presence of P53 majority of the genes regulated by Mta1 are related to inflammatory and anti-microbial responses whereas in the absence of P53 the predominant target genes are involved in cancer signaling. Thus, the presented data emphasizes the known functions of Mta1 and serves as a rich resource which could help us identify novel Mta1 functions. PMID:21364872

  9. Polymorphism Identification and Improved Genome Annotation of Brassica rapa Through Deep RNA Sequencing

    PubMed Central

    Devisetty, Upendra Kumar; Covington, Michael F.; Tat, An V.; Lekkala, Saradadevi; Maloof, Julin N.

    2014-01-01

    The mapping and functional analysis of quantitative traits in Brassica rapa can be greatly improved with the availability of physically positioned, gene-based genetic markers and accurate genome annotation. In this study, deep transcriptome RNA sequencing (RNA-Seq) of Brassica rapa was undertaken with two objectives: SNP detection and improved transcriptome annotation. We performed SNP detection on two varieties that are parents of a mapping population to aid in development of a marker system for this population and subsequent development of high-resolution genetic map. An improved Brassica rapa transcriptome was constructed to detect novel transcripts and to improve the current genome annotation. This is useful for accurate mRNA abundance and detection of expression QTL (eQTLs) in mapping populations. Deep RNA-Seq of two Brassica rapa genotypes—R500 (var. trilocularis, Yellow Sarson) and IMB211 (a rapid cycling variety)—using eight different tissues (root, internode, leaf, petiole, apical meristem, floral meristem, silique, and seedling) grown across three different environments (growth chamber, greenhouse and field) and under two different treatments (simulated sun and simulated shade) generated 2.3 billion high-quality Illumina reads. A total of 330,995 SNPs were identified in transcribed regions between the two genotypes with an average frequency of one SNP in every 200 bases. The deep RNA-Seq reassembled Brassica rapa transcriptome identified 44,239 protein-coding genes. Compared with current gene models of B. rapa, we detected 3537 novel transcripts, 23,754 gene models had structural modifications, and 3655 annotated proteins changed. Gaps in the current genome assembly of B. rapa are highlighted by our identification of 780 unmapped transcripts. All the SNPs, annotations, and predicted transcripts can be viewed at http://phytonetworks.ucdavis.edu/. PMID:25122667

  10. Insights into hepatopancreatic functions for nutrition metabolism and ovarian development in the crab Portunus trituberculatus: gene discovery in the comparative transcriptome of different hepatopancreas stages.

    PubMed

    Wang, Wei; Wu, Xugan; Liu, Zhijun; Zheng, Huajun; Cheng, Yongxu

    2014-01-01

    The crustacean hepatopancreas has different functions including absorption, storage of nutrients and vitellogenesis during growth, and ovarian development. However, genetic information on the biological functions of the crustacean hepatopancreas during such processes is limited. The swimming crab, Portunus trituberculatus, is a commercially important species for both aquaculture and fisheries in the Asia-Pacific region. This study compared the transcriptome in the hepatopancreas of female P. trituberculatus during the growth and ovarian maturation stages by 454 high-throughput pyrosequencing and bioinformatics. The goal was to discover genes in the hepatopancreas involved in food digestion, nutrition metabolism and ovarian development, and to identify patterns of gene expression during growth and ovarian maturation. Our transcriptome produced 303,450 reads with an average length of 351 bp, and the high quality reads were assembled into 21,635 contigs and 31,844 singlets. Based on BLASTP searches of the deduced protein sequences, there were 7,762 contigs and 4,098 singlets with functional annotation. Further analysis revealed 33,427 unigenes with ORFs, including 17,388 contigs and 16,039 singlets in the hepatopancreas, while only 7,954 unigenes (5,691 contigs and 2,263 singlets) with the predicted protein sequences were annotated with biological functions. The deduced protein sequences were assigned to 3,734 GO terms, 25 COG categories and 294 specific pathways. Furthermore, there were 14, 534, and 22 identified unigenes involved in food digestion, nutrition metabolism and ovarian development, respectively. 212 differentially expressed genes (DEGs) were found between the growth and endogenous stage of the hepatopancreas, while there were 382 DEGs between the endogenous and exogenous stage hepatopancreas. Our results not only enhance the understanding of crustacean hepatopancreatic functions during growth and ovarian development, but also represent a basis for further research on new genes and functional genomics of P. trituberculatus or closely related species. PMID:24454766

  11. Insights into Hepatopancreatic Functions for Nutrition Metabolism and Ovarian Development in the Crab Portunus trituberculatus: Gene Discovery in the Comparative Transcriptome of Different Hepatopancreas Stages

    PubMed Central

    Liu, Zhijun; Zheng, Huajun; Cheng, Yongxu

    2014-01-01

    The crustacean hepatopancreas has different functions including absorption, storage of nutrients and vitellogenesis during growth, and ovarian development. However, genetic information on the biological functions of the crustacean hepatopancreas during such processes is limited. The swimming crab, Portunus trituberculatus, is a commercially important species for both aquaculture and fisheries in the Asia-Pacific region. This study compared the transcriptome in the hepatopancreas of female P. trituberculatus during the growth and ovarian maturation stages by 454 high-throughput pyrosequencing and bioinformatics. The goal was to discover genes in the hepatopancreas involved in food digestion, nutrition metabolism and ovarian development, and to identify patterns of gene expression during growth and ovarian maturation. Our transcriptome produced 303,450 reads with an average length of 351 bp, and the high quality reads were assembled into 21,635 contigs and 31,844 singlets. Based on BLASTP searches of the deduced protein sequences, there were 7,762 contigs and 4,098 singlets with functional annotation. Further analysis revealed 33,427 unigenes with ORFs, including 17,388 contigs and 16,039 singlets in the hepatopancreas, while only 7,954 unigenes (5,691 contigs and 2,263 singlets) with the predicted protein sequences were annotated with biological functions. The deduced protein sequences were assigned to 3,734 GO terms, 25 COG categories and 294 specific pathways. Furthermore, there were 14, 534, and 22 identified unigenes involved in food digestion, nutrition metabolism and ovarian development, respectively. 212 differentially expressed genes (DEGs) were found between the growth and endogenous stage of the hepatopancreas, while there were 382 DEGs between the endogenous and exogenous stage hepatopancreas. Our results not only enhance the understanding of crustacean hepatopancreatic functions during growth and ovarian development, but also represent a basis for further research on new genes and functional genomics of P. trituberculatus or closely related species. PMID:24454766

  12. ArrayPlex: distributed, interactive and programmatic access to genome sequence, annotation, ontology, and analytical toolsets

    PubMed Central

    Killion, Patrick J; Iyer, Vishwanath R

    2008-01-01

    ArrayPlex is a software package that centrally provides a large number of flexible toolsets useful for functional genomics, including microarray data storage, quality assessments, data visualization, gene annotation retrieval, statistical tests, genomic sequence retrieval and motif analysis. It uses a client-server architecture based on open source components, provides graphical, command-line, and programmatic access to all needed resources, and is extensible by virtue of a documented application programming interface. ArrayPlex is available at . PMID:19014503

  13. A portal for rhizobial genomes: RhizoGATE integrates a Sinorhizobium meliloti genome annotation update with postgenome data.

    PubMed

    Becker, Anke; Barnett, Melanie J; Capela, Delphine; Dondrup, Michael; Kamp, Paul-Bertram; Krol, Elizaveta; Linke, Burkhard; Rüberg, Silvia; Runte, Kai; Schroeder, Brenda K; Weidner, Stefan; Yurgel, Svetlana N; Batut, Jacques; Long, Sharon R; Pühler, Alfred; Goesmann, Alexander

    2009-03-10

    Sinorhizobium meliloti is a symbiotic soil bacterium of the alphaproteobacterial subdivision. Like other rhizobia, S. meliloti induces nitrogen-fixing root nodules on leguminous plants. This is an ecologically and economically important interaction, because plants engaged in symbiosis with rhizobia can grow without exogenous nitrogen fertilizers. The S. meliloti-Medicago truncatula (barrel medic) association is an important symbiosis model. The S. meliloti genome was published in 2001, and the M. truncatula genome currently is being sequenced. Many new resources and data have been made available since the original S. meliloti genome annotation and an update was needed. In June 2008, we submitted our annotation update to the EMBL and NCBI databases. Here we describe this new annotation and a new web-based portal RhizoGATE. About 1000 annotation updates were made; these included assigning functions to 313 putative proteins, assigning EC numbers to 431 proteins, and identifying 86 new putative genes. RhizoGATE incorporates the new annotion with the S. meliloti GenDB project, a platform that allows annotation updates in real time. Locations of transposon insertions, plasmid integrations, and array probe sequences are available in the GenDB project. RhizoGATE employs the EMMA platform for management and analysis of transcriptome data and the IGetDB data warehouse to integrate a variety of heterogeneous external data sources. PMID:19103235

  14. Computational prediction of SEG (single exon gene) function in humans.

    PubMed

    Sakharkar, Meena K; Chow, Vincent T K; Ghosh, Kingshuk; Chaturvedi, Iti; Lee, Pern Chern; Bagavathi, Sundara Perumal; Shapshak, Paul; Subbiah, Subramanian; Kangueane, Pandjassarame

    2005-01-01

    Human genes are often interrupted by non-coding, intragenic sequences called introns. Hence, the gene sequence is divided into exons (coding segments) and introns (non-coding segments). Consequently, a majority of them are multi exon genes (MEG). However, a considerable amount of single exon genes (SEG) are present in the human genome (approximately 12%). This amount is sizeable and it is important to probe their molecular function and cellular role. Hence, we performed a genome wide functional assignment to 3750 SEG sequences using PFAM (protein family database), PROSITE (database of biologically meaningful signatures or motifs) and SUPERFAMILY (a library covering all proteins of known 3 dimensional structure). PFAM assigned 13% SEG to trans-membrane receptor genes of the G-protein coupled receptor (GPCR) family and showed that a majority of SEG proteins have DNA binding function. PROSITE identified 336 unique motif types in them and this accounts for 25% of all known patterns, with a majority having PHOSPHORYLATION and ACETYLATION signals. SUPERFAMILY assigned 33% SEG to the membrane all alpha (proteins containing alpha helix structural elements according to SCOP (structural classification of proteins) definition). Functional assignment of SEG proteins at multiple levels (sequence signals, sequence families, 3D structures) using PFAM, PROSITE and SUPERFAMILY is envisioned to suggest their selective and predominant molecular function in cellular systems. Their function as DNA binding, phosphorylating, acetylating and house-keeping agents is intriguing. The analysis also showed evidence of SEG expression and retro-transposition. However, this information is inadequate to draw concerted conclusion on the prevalent role played by these proteins in cellular biology. A complete understanding of SEG function will help to explore their role in cellular environment. The derived datasets from these analyses are available at http://sege.ntu.edu.sg/wester/intronless/human/. PMID:15769633

  15. Discovery of candidate disease genes in ENU-induced mouse mutants by large-scale sequencing, including a splice-site mutation in nucleoredoxin

    Technology Transfer Automated Retrieval System (TEKTRAN)

    An accurate and precisely annotated genome assembly is a fundamental requirement for functional genomic analysis. Here, the complete DNA sequence and gene annotation of mouse Chromosome 11 was used to test the efficacy of large-scale sequencing for mutation identification. We re-sequenced the 14,000...

  16. Functional validation of GWAS gene candidates for abnormal liver function during zebrafish liver development.

    PubMed

    Liu, Leah Y; Fox, Caroline S; North, Trista E; Goessling, Wolfram

    2013-09-01

    Genome-wide association studies (GWAS) have revealed numerous associations between many phenotypes and gene candidates. Frequently, however, further elucidation of gene function has not been achieved. A recent GWAS identified 69 candidate genes associated with elevated liver enzyme concentrations, which are clinical markers of liver disease. To investigate the role of these genes in liver homeostasis, we narrowed down this list to 12 genes based on zebrafish orthology, zebrafish liver expression and disease correlation. To assess the function of gene candidates during liver development, we assayed hepatic progenitors at 48 hours post fertilization (hpf) and hepatocytes at 72 hpf using in situ hybridization following morpholino knockdown in zebrafish embryos. Knockdown of three genes (pnpla3, pklr and mapk10) decreased expression of hepatic progenitor cells, whereas knockdown of eight genes (pnpla3, cpn1, trib1, fads2, slc2a2, pklr, mapk10 and samm50) decreased cell-specific hepatocyte expression. We then induced liver injury in zebrafish embryos using acetaminophen exposure and observed changes in liver toxicity incidence in morphants. Prioritization of GWAS candidates and morpholino knockdown expedites the study of newly identified genes impacting liver development and represents a feasible method for initial assessment of candidate genes to instruct further mechanistic analyses. Our analysis can be extended to GWAS for additional disease-associated phenotypes. PMID:23813869

  17. A functional gene array for detection of bacterial virulence elements.

    PubMed

    Jaing, Crystal; Gardner, Shea; McLoughlin, Kevin; Mulakken, Nisha; Alegria-Hartman, Michelle; Banda, Phillip; Williams, Peter; Gu, Pauline; Wagner, Mark; Manohar, Chitra; Slezak, Tom

    2008-01-01

    Emerging known and unknown pathogens create profound threats to public health. Platforms for rapid detection and characterization of microbial agents are critically needed to prevent and respond to disease outbreaks. Available detection technologies cannot provide broad functional information about known or novel organisms. As a step toward developing such a system, we have produced and tested a series of high-density functional gene arrays to detect elements of virulence and antibiotic resistance mechanisms. Our first generation array targets genes from Escherichia coli strains K12 and CFT073, Enterococcus faecalis and Staphylococcus aureus. We determined optimal probe design parameters for gene family detection and discrimination. When tested with organisms at varying phylogenetic distances from the four target strains, the array detected orthologs for the majority of targeted gene families present in bacteria belonging to the same taxonomic family. In combination with whole-genome amplification, the array detects femtogram concentrations of purified DNA, either spiked in to an aerosol sample background, or in combinations from one or more of the four target organisms. This is the first report of a high density NimbleGen microarray system targeting microbial antibiotic resistance and virulence mechanisms. By targeting virulence gene families as well as genes unique to specific biothreat agents, these arrays will provide important data about the pathogenic potential and drug resistance profiles of unknown organisms in environmental samples. PMID:18478124

  18. New gene functions in megakaryopoiesis and platelet formation

    PubMed Central

    Gieger, Christian; Radhakrishnan, Aparna; Cvejic, Ana; Tang, Weihong; Porcu, Eleonora; Pistis, Giorgio; Serbanovic-Canic, Jovana; Elling, Ulrich; Goodall, Alison H.; Labrune, Yann; Lopez, Lorna M.; Mägi, Reedik; Meacham, Stuart; Okada, Yukinori; Pirastu, Nicola; Sorice, Rossella; Teumer, Alexander; Voss, Katrin; Zhang, Weihua; Ramirez-Solis, Ramiro; Bis, Joshua C.; Ellinghaus, David; Gögele, Martin; Hottenga, Jouke-Jan; Langenberg, Claudia; Kovacs, Peter; O’Reilly, Paul F.; Shin, So-Youn; Esko, Tõnu; Hartiala, Jaana; Kanoni, Stavroula; Murgia, Federico; Parsa, Afshin; Stephens, Jonathan; van der Harst, Pim; van der Schoot, C. Ellen; Allayee, Hooman; Attwood, Antony; Balkau, Beverley; Bastardot, François; Basu, Saonli; Baumeister, Sebastian E.; Biino, Ginevra; Bomba, Lorenzo; Bonnefond, Amélie; Cambien, François; Chambers, John C.; Cucca, Francesco; D’Adamo, Pio; Davies, Gail; de Boer, Rudolf A.; de Geus, Eco J. C.; Döring, Angela; Elliott, Paul; Erdmann, Jeanette; Evans, David M.; Falchi, Mario; Feng, Wei; Folsom, Aaron R.; Frazer, Ian H.; Gibson, Quince D.; Glazer, Nicole L.; Hammond, Chris; Hartikainen, Anna-Liisa; Heckbert, Susan R.; Hengstenberg, Christian; Hersch, Micha; Illig, Thomas; Loos, Ruth J. F.; Jolley, Jennifer; Khaw, Kay Tee; Kühnel, Brigitte; Kyrtsonis, Marie-Christine; Lagou, Vasiliki; Lloyd-Jones, Heather; Lumley, Thomas; Mangino, Massimo; Maschio, Andrea; Leach, Irene Mateo; McKnight, Barbara; Memari, Yasin; Mitchell, Braxton D.; Montgomery, Grant W.; Nakamura, Yusuke; Nauck, Matthias; Navis, Gerjan; Nöthlings, Ute; Nolte, Ilja M.; Porteous, David J.; Pouta, Anneli; Pramstaller, Peter P.; Pullat, Janne; Ring, Susan M.; Rotter, Jerome I.; Ruggiero, Daniela; Ruokonen, Aimo; Sala, Cinzia; Samani, Nilesh J.; Sambrook, Jennifer; Schlessinger, David; Schreiber, Stefan; Schunkert, Heribert; Scott, James; Smith, Nicholas L.; Snieder, Harold; Starr, John M.; Stumvoll, Michael; Takahashi, Atsushi; Tang, W. H. Wilson; Taylor, Kent; Tenesa, Albert; Thein, Swee Lay; Tönjes, Anke; Uda, Manuela; Ulivi, Sheila; van Veldhuisen, Dirk J.; Visscher, Peter M.; Völker, Uwe; Wichmann, H.-Erich; Wiggins, Kerri L.; Willemsen, Gonneke; Yang, Tsun-Po; Zhao, Jing Hua; Zitting, Paavo; Bradley, John R.; Dedoussis, George V.; Gasparini, Paolo; Hazen, Stanley L.; Metspalu, Andres; Pirastu, Mario; Shuldiner, Alan R.; van Pelt, L. Joost; Zwaginga, Jaap-Jan; Boomsma, Dorret I.; Deary, Ian J.; Franke, Andre; Froguel, Philippe; Ganesh, Santhi K.; Jarvelin, Marjo-Riitta; Martin, Nicholas G.; Meisinger, Christa; Psaty, Bruce M.; Spector, Timothy D.; Wareham, Nicholas J.; Akkerman, Jan-Willem N.; Ciullo, Marina; Deloukas, Panos; Greinacher, Andreas; Jupe, Steve; Kamatani, Naoyuki; Khadake, Jyoti; Kooner, Jaspal S.; Penninger, Josef; Prokopenko, Inga; Stemple, Derek; Toniolo, Daniela; Wernisch, Lorenz; Sanna, Serena; Hicks, Andrew A.; Rendon, Augusto; Ferreira, Manuel A.; Ouwehand, Willem H.; Soranzo, Nicole

    2012-01-01

    Platelets are the second most abundant cell type in blood and are essential for maintaining haemostasis. Their count and volume are tightly controlled within narrow physiological ranges, but there is only limited understanding of the molecular processes controlling both traits. Here we carried out a high-powered meta-analysis of genome-wide association studies (GWAS) in up to 66,867 individuals of European ancestry, followed by extensive biological and functional assessment. We identified 68 genomic loci reliably associated with platelet count and volume mapping to established and putative novel regulators of megakaryopoiesis and platelet formation. These genes show megakaryocyte-specific gene expression patterns and extensive network connectivity. Using gene silencing in Danio rerio and Drosophila melanogaster, we identified 11 of the genes as novel regulators of blood cell formation. Taken together, our findings advance understanding of novel gene functions controlling fate-determining events during megakaryopoiesis and platelet formation, providing a new example of successful translation of GWAS to function. PMID:22139419

  19. Gene expression module-based chemical function similarity search

    PubMed Central

    Li, Yun; Hao, Pei; Zheng, Siyuan; Tu, Kang; Fan, Haiwei; Zhu, Ruixin; Ding, Guohui; Dong, Changzheng; Wang, Chuan; Li, Xuan; Thiesen, H.-J.; Chen, Y. Eugene; Jiang, Hualiang; Li, Yixue

    2008-01-01

    Investigation of biological processes using selective chemical interventions is generally applied in biomedical research and drug discovery. Many studies of this kind make use of gene expression experiments to explore cellular responses to chemical interventions. Recently, some research groups constructed libraries of chemical related expression profiles, and introduced similarity comparison into chemical induced transcriptome analysis. Resembling sequence similarity alignment, expression pattern comparison among chemical intervention related expression profiles provides a new way for chemical function prediction and chemical–gene relation investigation. However, existing methods place more emphasis on comparing profile patterns globally, which ignore noises and marginal effects. At the same time, though the whole information of expression profiles has been used, it is difficult to uncover the underlying mechanisms that lead to the functional similarity between two molecules. Here a new approach is presented to perform biological effects similarity comparison within small biologically meaningful gene categories. Regarding gene categories as units, a reduced similarity matrix is generated for measuring the biological distances between query and profiles in library and pointing out in which modules do chemical pairs resemble. Through the modularization of expression patterns, this method reduces experimental noises and marginal effects and directly correlates chemical molecules with gene function modules. PMID:18842630

  20. Transient transformation meets gene function discovery: the strawberry fruit case

    PubMed Central

    Guidarelli, Michela; Baraldi, Elena

    2015-01-01

    Beside the well known nutritional and health benefits, strawberry (Fragaria X ananassa) crop draws increasing attention as plant model system for the Rosaceae family, due to the short generation time, the rapid in vitro regeneration, and to the availability of the genome sequence of F. X ananassa and F. vesca species. In the last years, the use of high-throughput sequence technologies provided large amounts of molecular information on the genes possibly related to several biological processes of this crop. Nevertheless, the function of most genes or gene products is still poorly understood and needs investigation. Transient transformation technology provides a powerful tool to study gene function in vivo, avoiding difficult drawbacks that typically affect the stable transformation protocols, such as transformation efficiency, transformants selection, and regeneration. In this review we provide an overview of the use of transient expression in the investigation of the function of genes important for strawberry fruit development, defense and nutritional properties. The technical aspects related to an efficient use of this technique are described, and the possible impact and application in strawberry crop improvement are discussed. PMID:26124771

  1. Developmental and functional analysis of Jonah gene expression.

    PubMed

    Carlson, J R; Hogness, D S

    1985-04-01

    The Jonah genes are expressed twice in development: Jonah RNA is detected during all larval stages, disappears at the end of the third larval instar, and then reappears shortly after eclosion, in the adult midgut. Construction and analysis of Jonah cDNA clones reveals that multiple Jonah genes are transcribed; cDNA clones deriving from at least four different clusters of Jonah genes have been identified. In at least one case, multiple genes in a cluster are transcribed, and one cluster is found to be transcribed both in larvae and adults. Evidence that different Jonah genes are under different control with respect to both spatial and temporal patterns of expression has been provided. Jonah RNA encodes a 28-kDa translation product or products for which we consider a possible function. Jonah RNA of constant length is found to be conserved in all strains of Drosophila melanogaster examined, Jonah genes are found at a minimum of three common chromosomal sites in all of seven D. melanogaster strains examined, and multiple Jonah genes are found in other Drosophila species. PMID:2416611

  2. SEED Software Annotations.

    ERIC Educational Resources Information Center

    Bethke, Dee; And Others

    This document provides a composite index of the first five sets of software annotations produced by Project SEED. The software has been indexed by title, subject area, and grade level, and it covers sets of annotations distributed in September 1986, April 1987, September 1987, November 1987, and February 1988. The date column in the index…

  3. Functional Genomic Analysis of Cotton Genes with Agrobacterium-Mediated Virus-Induced Gene Silencing

    PubMed Central

    Gao, Xiquan; Shan, Libo

    2015-01-01

    Cotton (Gossypium spp.) is one of the most agronomically important crops worldwide for its unique textile fiber production and serving as food and feed stock. Molecular breeding and genetic engineering of useful genes into cotton have emerged as advanced approaches to improve cotton yield, fiber quality, and resistance to various stresses. However, the understanding of gene functions and regulations in cotton is largely hindered by the limited molecular and biochemical tools. Here, we describe the method of an Agrobacterium infiltration-based virus-induced gene silencing (VIGS) assay to transiently silence endogenous genes in cotton at 2-week-old seedling stage. The genes of interest could be readily silenced with a consistently high efficiency. To monitor gene silencing efficiency, we have cloned cotton GrCla1 from G. raimondii, a homolog gene of Arabidopsis Cloroplastos alterados 1 (AtCla1) involved in chloroplast development, and inserted into a tobacco rattle virus (TRV) binary vector pYL156. Silencing of GrCla1 results in albino phenotype on the newly emerging leaves, serving as a visual marker for silencing efficiency. To further explore the possibility of using VIGS assay to reveal the essential genes mediating disease resistance to Verticillium dahliae, a fungal pathogen causing severe Verticillium wilt in cotton, we developed a seedling infection assay to inoculate cotton seedlings when the genes of interest are silenced by VIGS. The method we describe here could be further explored for functional genomic analysis of cotton genes involved in development and various biotic and abiotic stresses. PMID:23386302

  4. Interorganellar gene transfer in bryophytes: the functional nad7 gene is nuclear encoded in Marchantia polymorpha

    Microsoft Academic Search

    Y. Kobayashi; V. Knoop; H. Fukuzawa; A. Brennicke; K. Ohyama

    1997-01-01

    The nad7 gene, encoding subunit 7 of NADH dehydrogenase, is mitochondrially encoded in seed plants. In the liverwort, Marchantia polymorpha, only a pseudogene is located in the mitochondrial genome. We have now identified the functional nad7 gene copy in the nuclear genome of Marchantia, coding for a polypeptide of 468 amino acids. The nuclear-encoded nad7 has lost the two group

  5. Functional Overlap Between the mec-8 Gene and Five sym Genes in Caenorhabditis elegans

    Microsoft Academic Search

    Andrew G. Davies; Caroline A. Spike; Jocelyn E. Shaw; Robert K. Herman

    Earlier work showed that the Caenorhabditis elegans gene mec-8 encodes a regulator of alternative RNA splicing and that mec-8 null mutants have defects in sensory neurons and body muscle attachment but are generally viable and fertile. We have used a genetic screen to identify five mutations in four genes, sym- 1-sym-4, that are synthetically lethal with mec-8 loss-of-function mutations. The

  6. Functions of mammalian Smad genes as revealed by targeted gene disruption in mice

    Microsoft Academic Search

    Michael Weinstein; Xiao Yang; Chu-Xia Deng

    2000-01-01

    The Smad genes are the intracellular mediators of TGF-beta signals. Targeted mutagenesis in mice has yielded valuable new insights into the functions of this important gene family. These experiments have shown that Smad2 and Smad4 are needed for gastrulation, Smad5 for angiogenesis, and Smad3 for establishment of the mucosal immune response and proper development of the skeleton. In addition, these

  7. CGKB: an annotation knowledge base for cowpea (Vigna unguiculata L.) methylation filtered genomic genespace sequences

    PubMed Central

    Chen, Xianfeng; Laudeman, Thomas W; Rushton, Paul J; Spraggins, Thomas A; Timko, Michael P

    2007-01-01

    Background Cowpea [Vigna unguiculata (L.) Walp.] is one of the most important food and forage legumes in the semi-arid tropics because of its ability to tolerate drought and grow on poor soils. It is cultivated mostly by poor farmers in developing countries, with 80% of production taking place in the dry savannah of tropical West and Central Africa. Cowpea is largely an underexploited crop with relatively little genomic information available for use in applied plant breeding. The goal of the Cowpea Genomics Initiative (CGI), funded by the Kirkhouse Trust, a UK-based charitable organization, is to leverage modern molecular genetic tools for gene discovery and cowpea improvement. One aspect of the initiative is the sequencing of the gene-rich region of the cowpea genome (termed the genespace) recovered using methylation filtration technology and providing annotation and analysis of the sequence data. Description CGKB, Cowpea Genespace/Genomics Knowledge Base, is an annotation knowledge base developed under the CGI. The database is based on information derived from 298,848 cowpea genespace sequences (GSS) isolated by methylation filtering of genomic DNA. The CGKB consists of three knowledge bases: GSS annotation and comparative genomics knowledge base, GSS enzyme and metabolic pathway knowledge base, and GSS simple sequence repeats (SSRs) knowledge base for molecular marker discovery. A homology-based approach was applied for annotations of the GSS, mainly using BLASTX against four public FASTA formatted protein databases (NCBI GenBank Proteins, UniProtKB-Swiss-Prot, UniprotKB-PIR (Protein Information Resource), and UniProtKB-TrEMBL). Comparative genome analysis was done by BLASTX searches of the cowpea GSS against four plant proteomes from Arabidopsis thaliana, Oryza sativa, Medicago truncatula, and Populus trichocarpa. The possible exons and introns on each cowpea GSS were predicted using the HMM-based Genscan gene predication program and the potential domains on annotated GSS were analyzed using the HMMER package against the Pfam database. The annotated GSS were also assigned with Gene Ontology annotation terms and integrated with 228 curated plant metabolic pathways from the Arabidopsis Information Resource (TAIR) knowledge base. The UniProtKB-Swiss-Prot ENZYME database was used to assign putative enzymatic function to each GSS. Each GSS was also analyzed with the Tandem Repeat Finder (TRF) program in order to identify potential SSRs for molecular marker discovery. The raw sequence data, processed annotation, and SSR results were stored in relational tables designed in key-value pair fashion using a PostgreSQL relational database management system. The biological knowledge derived from the sequence data and processed results are represented as views or materialized views in the relational database management system. All materialized views are indexed for quick data access and retrieval. Data processing and analysis pipelines were implemented using the Perl programming language. The web interface was implemented in JavaScript and Perl CGI running on an Apache web server. The CPU intensive data processing and analysis pipelines were run on a computer cluster of more than 30 dual-processor Apple XServes. A job management system called Vela was created as a robust way to submit large numbers of jobs to the Portable Batch System (PBS). Conclusion CGKB is an integrated and annotated resource for cowpea GSS with features of homology-based and HMM-based annotations, enzyme and pathway annotations, GO term annotation, toolkits, and a large number of other facilities to perform complex queries. The cowpea GSS, chloroplast sequences, mitochondrial sequences, retroelements, and SSR sequences are available as FASTA formatted files and downloadable at CGKB. This database and web interface are publicly accessible at . PMID:17445272

  8. Interpretation Errors related to the GO Annotation File Format

    PubMed Central

    Moreira, Dilvan A.; Shah, Nigam H.; Musen, Mark A.

    2007-01-01

    The Gene Ontology (GO) is the most widely used ontology for creating biomedical annotations. GO annotations are statements associating a biological entity with a GO term. These statements comprise a large dataset of biological knowledge that is used widely in biomedical research. GO Annotations are available as “gene association files” from the GO website in a tab-delimited file format (GO Annotation File Format) composed of rows of 15 tab-delimited fields. This simple format lacks the knowledge representation (KR) capabilities to represent unambiguously semantic relationships between each field. This paper demonstrates that this KR shortcoming leads users to interpret the files in ways that can be erroneous. We propose a complementary format to represent GO annotation files as knowledge bases using the W3C recommended Web Ontology Language (OWL). PMID:18693894

  9. Influence of CHIEF pathway genes on gene expression: a pathway approach to functionality

    PubMed Central

    Slattery, Martha L; Lundgreen, Abbie; Mullany, Lila E; Penney, Rosalind B; Wolff, Roger K

    2014-01-01

    Background: Candidate pathway approaches in disease association studies often utilize a tagSNP approach to capture genetic variation. In this paper we assess gene expression patterns with SNPs in genes in the CHIEF pathway to help determine their potential functionality. Methods: Quantitative real-time RT-PCR was run to determine gene expression of 13 genes in normal colon tissue samples from 82 individuals. TagSNP genotype data were obtained from a GoldenGate Illumina multiplex bead array platform. Age, sex, and genetic ancestry adjusted general linear models were used to estimate beta coefficients and p values. Results: Genetic variation in mTOR (1 SNP), NFKB1 (4 SNPs), PRKAG2 (3 SNPs), and TSC2 (1 SNP) significantly influenced their expression. After adjustment for multiple comparisons several associations between pathway genes and expression of other genes were significant. These included AKT1 rs1130214 associated with expression of PDK1; NF?B1 rs13117745 and rs4648110 with STK11 expression; PRKAG2 rs6965771 with expression of NF?B1, PIK3CA, and RPS6KB2; RPS6KB1 rs80711475 with STK11 expression; STK11 rs741765 with PIK3CA and PRKAG2 expression; and TSC2 rs3087631 with AKT1, IkB?B, NF?B1, PDK1, PIK3CA, PRKAG2, and PTEN expression. The higher levels of differential expression were noted for TSC2 rs3087631 (percent difference ranges from 108% to 198% across genes). Many of these SNPs and genes also were associated with colon and rectal cancer risk. Conclusions: Our results suggest that pathway genes may regulate expression of other genes in the pathway. The convergence of these genes in several biological pathways involved in cancer further supports their importance to the carcinogenic process. PMID:24959314

  10. Gene Perturbation Atlas (GPA): a single-gene perturbation repository for characterizing functional mechanisms of coding and non-coding genes.

    PubMed

    Xiao, Yun; Gong, Yonghui; Lv, Yanling; Lan, Yujia; Hu, Jing; Li, Feng; Xu, Jinyuan; Bai, Jing; Deng, Yulan; Liu, Ling; Zhang, Guanxiong; Yu, Fulong; Li, Xia

    2015-01-01

    Genome-wide transcriptome profiling after gene perturbation is a powerful means of elucidating gene functional mechanisms in diverse contexts. The comprehensive collection and analysis of the resulting transcriptome profiles would help to systematically characterize context-dependent gene functional mechanisms and conduct experiments in biomedical research. To this end, we collected and curated over 3000 transcriptome profiles in human and mouse from diverse gene perturbation experiments, which involved 1585 different perturbed genes (microRNAs, lncRNAs and protein-coding genes) across 1170 different cell lines/tissues. For each profile, we identified differential genes and their associated functions and pathways, constructed perturbation networks, predicted transcription regulation and cancer/drug associations, and assessed cooperative perturbed genes. Based on these transcriptome analyses, the Gene Perturbation Atlas (GPA) can be used to detect (i) novel or cell-specific functions and pathways affected by perturbed genes, (ii) protein interactions and regulatory cascades affected by perturbed genes, and (iii) perturbed gene-mediated cooperative effects. The GPA is a user-friendly database to support the rapid searching and exploration of gene perturbations. Particularly, we visualized functional effects of perturbed genes from multiple perspectives. In summary, the GPA is a valuable resource for characterizing gene functions and regulatory mechanisms after single-gene perturbations. The GPA is freely accessible at http://biocc.hrbmu.edu.cn/GPA/. PMID:26039571

  11. High throughput generation of promoter reporter (GFP) transgenic lines of low expressing genes in Arabidopsis and analysis of their expression patterns

    Microsoft Academic Search

    Yong-Li Xiao; Julia C Redman; Erin L Monaghan; Jun Zhuang; Beverly A Underwood; William A Moskal; Wei Wang; Hank C Wu; Christopher D Town

    2010-01-01

    BACKGROUND: Although the complete genome sequence and annotation of Arabidopsis were released at the end of year 2000, it is still a great challenge to understand the function of each gene in the Arabidopsis genome. One way to understand the function of genes on a genome-wide scale is expression profiling by microarrays. However, the expression level of many genes in

  12. Inference of gene function based on gene fusion events: the rosetta-stone method.

    PubMed

    Suhre, Karsten

    2007-01-01

    The method described in this chapter can be used to infer putative functional links between two proteins. The basic idea is based on the principle of "guilt by association." It is assumed that two proteins, which are found to be transcribed by a single transcript in one (or several) genomes are likely to be functionally linked, for example by acting in a same metabolic pathway or by forming a multiprotein complex. This method is of particular interest for studying genes that exhibit no, or only remote, homologies with already well-characterized proteins. Combined with other non-homology based methods, gene fusion events may yield valuable information for hypothesis building on protein function, and may guide experimental characterization of the target protein, for example by suggesting potential ligands or binding partners. This chapter uses the FusionDB database (http://www.igs.cnrs-mrs.fr/FusionDB/) as source of information. FusionDB provides a characterization of a large number of gene fusion events at hand of multiple sequence alignments. Orthologous genes are included to yield a comprehensive view of the structure of a gene fusion event. Phylogenetic tree reconstruction is provided to evaluate the history of a gene fusion event, and three-dimensional protein structure information is used, where available, to further characterize the nature of the gene fusion. For genes that are not comprised in FusionDB, some instructions are given as how to generate a similar type of information, based solely on publicly available web tools that are listed here. PMID:18025684

  13. Functional analysis of yersiniabactin transport genes of Yersinia enterocolitica.

    PubMed

    Brem, D; Pelludat, C; Rakin, A; Jacobi, C A; Heesemann, J

    2001-05-01

    Yersinia enterocolitica O:8, biogroup (BG) IB, strain WA-C carries a high-pathogenicity island (HPI) including iron-repressible genes (irp1-9, fyuA) for biosynthesis and uptake of the siderophore yersiniabactin (Ybt). The authors report the functional analysis of irp6,7,8, which show 98-99% similarity to the corresponding genes ybtP,Q,X on the HPI of Yersinia pestis. It was demonstrated that irp6,7 are involved in ferric (Fe)-Ybt utilization and mouse virulence of Y. enterocolitica, thus confirming corresponding results for Y. pestis. Additionally it was shown that inactivation of the ampG-like gene irp8 did not affect either Fe-Ybt utilization or mouse virulence. To determine whether irp6, irp7 and fyuA (encoding the outer-membrane Fe-Ybt/pesticin receptor FyuA) are sufficient to mediate Fe-Ybt transport/utilization, these genes were transferred into Escherichia coli entD,F and into non-pathogenic Y. enterocolitica, BG IA, strain NF-O. Surprisingly, E. coli entD,F but not Y. enterocolitica NF-O gained the capability to utilize exogenous Fe-Ybt as a result of this gene transfer, although both strains expressed functional FyuA (pesticin sensitivity). These results suggest that besides irp6, irp7 and fyuA, additional genes are required for sufficient Fe-Ybt transport/utilization. Finally, it was shown that irp6, irp7 and fyuA but not irp8 are involved in controlling Ybt biosynthesis and fyuA gene expression: irp6 and/or irp7 mutation leads to upregulation whereas fyuA mutation leads to downregulation. However, fyuA-dependent control of Ybt biosynthesis could be bypassed in a fyuA mutant by ingredients of chrome azurol S (CAS) siderophore indicator agar. PMID:11320115

  14. Functional optimization of gene clusters by combinatorial design and assembly.

    PubMed

    Smanski, Michael J; Bhatia, Swapnil; Zhao, Dehua; Park, YongJin; B A Woodruff, Lauren; Giannoukos, Georgia; Ciulla, Dawn; Busby, Michele; Calderon, Johnathan; Nicol, Robert; Gordon, D Benjamin; Densmore, Douglas; Voigt, Christopher A

    2014-12-01

    Large microbial gene clusters encode useful functions, including energy utilization and natural product biosynthesis, but genetic manipulation of such systems is slow, difficult and complicated by complex regulation. We exploit the modularity of a refactored Klebsiella oxytoca nitrogen fixation (nif) gene cluster (16 genes, 103 parts) to build genetic permutations that could not be achieved by starting from the wild-type cluster. Constraint-based combinatorial design and DNA assembly are used to build libraries of radically different cluster architectures by varying part choice, gene order, gene orientation and operon occupancy. We construct 84 variants of the nifUSVWZM operon, 145 variants of the nifHDKY operon, 155 variants of the nifHDKYENJ operon and 122 variants of the complete 16-gene pathway. The performance and behavior of these variants are characterized by nitrogenase assay and strand-specific RNA sequencing (RNA-seq), and the results are incorporated into subsequent design cycles. We have produced a fully synthetic cluster that recovers 57% of wild-type activity. Our approach allows the performance of genetic parts to be quantified simultaneously in hundreds of genetic contexts. This parallelized design-build-test-learn cycle, which can access previously unattainable regions of genetic space, should provide a useful, fast tool for genetic optimization and hypothesis testing. PMID:25419741

  15. Gene-function studies in systemic lupus erythematosus.

    PubMed

    Crispín, José C; Hedrich, Christian M; Tsokos, George C

    2013-08-01

    The aetiology of systemic lupus erythematosus (SLE) is complex and is known to involve both genetic and environmental factors. In a small number of patients, single-gene defects can lead to the development of SLE. Such genes include those encoding early components of the complement cascade and the 3'-5' DNA exonuclease TREX1. In addition, genome-wide association studies have identified single-nucleotide polymorphisms that confer some susceptibility to SLE. In this Review, we discuss selected examples of genes whose products have distinctly altered function in SLE and contribute to the pathogenic process. Specifically, we focus on the genes encoding integrin ?M (ITGAM), IgG Fc receptors, sialic acid O-acetyl esterase (SIAE), the catalytic subunit of protein phosphatase PP2A (PPP2CA) and signalling lymphocytic activation molecule (SLAM) family members. Moreover, we highlight the changes in epigenetic signatures that occur in SLE. Such epigenetic modifications, which are abundantly present and might alter gene expression in the presence or absence of susceptibility variants, should be carefully considered when deconstructing the contribution of individual genes to the complex pathogenesis of SLE. PMID:23732569

  16. Nucleotide substitutions revealing specific functions of Polycomb group genes.

    PubMed

    Bajusz, Izabella; Sipos, László; Pirity, Melinda K

    2015-04-01

    POLYCOMB group (PCG) proteins belong to the family of epigenetic regulators of genes playing important roles in differentiation and development. Mutants of PcG genes were isolated first in the fruit fly, Drosophila melanogaster, resulting in spectacular segmental transformations due to the ectopic expression of homeotic genes. Homologs of Drosophila PcG genes were also identified in plants and in vertebrates and subsequent experiments revealed the general role of PCG proteins in the maintenance of the repressed state of chromatin through cell divisions. The past decades of gene targeting experiments have allowed us to make significant strides towards understanding how the network of PCG proteins influences multiple aspects of cellular fate determination during development. Being involved in the transmission of specific expression profiles of different cell lineages, PCG proteins were found to control wide spectra of unrelated epigenetic processes in vertebrates, such as stem cell plasticity and renewal, genomic imprinting and inactivation of X-chromosome. PCG proteins also affect regulation of metabolic genes being important for switching programs between pluripotency and differentiation. Insight into the precise roles of PCG proteins in normal physiological processes has emerged from studies employing cell culture-based systems and genetically modified animals. Here we summarize the findings obtained from PcG mutant fruit flies and mice generated to date with a focus on PRC1 and PRC2 members altered by nucleotide substitutions resulting in specific alleles. We also include a compilation of lessons learned from these models about the in vivo functions of this complex protein family. With multiple knockout lines, sophisticated approaches to study the consequences of peculiar missense point mutations, and insights from complementary gain-of-function systems in hand, we are now in a unique position to significantly advance our understanding of the molecular basis of in vivo functions of PcG proteins. PMID:25669595

  17. Use of functional gene arrays for elucidating in situ biodegradation

    PubMed Central

    Nostrand, Joy D. Van; He, Zhili; Zhou, Jizhong

    2012-01-01

    Microarrays have revolutionized the study of microbiology by providing a high-throughput method for examining thousands of genes with a single test and overcome the limitations of many culture-independent approaches. Functional gene arrays (FGA) probe a wide range of genes involved in a variety of functions of interest to microbial ecology (e.g., carbon degradation, N fixation, metal resistance) from many different microorganisms, cultured and uncultured. The most comprehensive FGA to date is the GeoChip array, which targets tens of thousands of genes involved in the geochemical cycling of carbon, nitrogen, phosphorus, and sulfur, metal resistance and reduction, energy processing, antibiotic resistance and contaminant degradation as well as phylogenetic information (gyrB). Since the development of GeoChips, many studies have been performed using this FGA and have shown it to be a powerful tool for rapid, sensitive, and specific examination of microbial communities in a high-throughput manner. As such, the GeoChip is well-suited for linking geochemical processes with microbial community function and structure. This technology has been used successfully to examine microbial communities before, during, and after in situ bioremediation at a variety of contaminated sites. These studies have expanded our understanding of biodegradation and bioremediation processes and the associated microorganisms and environmental conditions responsible. This review provides an overview of FGA development with a focus on the GeoChip and highlights specific GeoChip studies involving in situ bioremediation. PMID:23049526

  18. Annotation of hypothetical proteins orthologous in Pongo abelii and Sus scrofa

    PubMed Central

    Jitendra, Singh; Narula, Ranjana; Agnihotri, Shefali; Singh, Maneet

    2011-01-01

    A hypothetical protein is predicted to be expressed from an open reading frame without known experimental evidence of translation. They constitute a substantial fraction of proteomes. Domain extraction from these hypothetical sequences helps to search for protein coding genes for protein structural and functional annotation. We describe the analysis of prediction data in a sequence dataset of hypothetical protein orthologs of Pongo abelii (orangutan) and Sus scrofa (pig). It should be noted that these orangutan-pig orthologs are also non-homologous to human proteins. These predicted data find application in the genome wide annotation of proteins in poorly understood genomes. Abbreviations PDB - Protein Data Bank, DEG - Database of Essential Genes, CDD - Conserved Domain Database, IUCN - International Union for Conservation of Nature. PMID:21769189

  19. Functional Analysis of Prognostic Gene Expression Network Genes in Metastatic Breast Cancer Models

    PubMed Central

    Geiger, Thomas R.; Ha, Ngoc-Han; Faraji, Farhoud; Michael, Helen T.; Rodriguez, Loren; Walker, Renard C.; Green, Jeffery E.; Simpson, R. Mark; Hunter, Kent W.

    2014-01-01

    Identification of conserved co-expression networks is a useful tool for clustering groups of genes enriched for common molecular or cellular functions [1]. The relative importance of genes within networks can frequently be inferred by the degree of connectivity, with those displaying high connectivity being significantly more likely to be associated with specific molecular functions [2]. Previously we utilized cross-species network analysis to identify two network modules that were significantly associated with distant metastasis free survival in breast cancer. Here, we validate one of the highly connected genes as a metastasis associated gene. Tpx2, the most highly connected gene within a proliferation network specifically prognostic for estrogen receptor positive (ER+) breast cancers, enhances metastatic disease, but in a tumor autonomous, proliferation-independent manner. Histologic analysis suggests instead that variation of TPX2 levels within disseminated tumor cells may influence the transition between dormant to actively proliferating cells in the secondary site. These results support the co-expression network approach for identification of new metastasis-associated genes to provide new information regarding the etiology of breast cancer progression and metastatic disease. PMID:25368990

  20. Apollo: a sequence annotation editor

    Microsoft Academic Search

    SE Lewis; SMJ Searle; N Harris; M Gibson; V Iyer; J Richter; C Wiel; L Bayraktaroglu; E Birney; MA Crosby; JS Kaminker; BB Matthews; SE Prochnik; CD Smith; JL Tupy; GM Rubin; S Misra; CJ Mungall; ME Clamp

    2002-01-01

    The well-established inaccuracy of purely computational methods for annotating genome sequences necessitates an interactive tool to allow biological experts to refine these approximations by viewing and independently evaluating the data supporting each annotation. Apollo was developed to meet this need, enabling curators to inspect genome annotations closely and edit them. FlyBase biologists successfully used Apollo to annotate the Drosophila melanogaster

  1. Toward Coalescing Gene Expression and Function with QTLs of Water-Deficit Stress in Cotton

    PubMed Central

    Kebede, Hirut; Payton, Paxton; Pham, Hanh Thi My; Allen, Randy D.; Wright, Robert J.

    2015-01-01

    Cotton exhibits moderately high vegetative tolerance to water-deficit stress but lint production is restricted by the available rainfed and irrigation capacity. We have described the impact of water-deficit stress on the genetic and metabolic control of fiber quality and production. Here we examine the association of tentative consensus sequences (TCs) derived from various cotton tissues under irrigated and water-limited conditions with stress-responsive QTLs. Three thousand sixteen mapped sequence-tagged-sites were used as anchored targets to examine sequence homology with 15,784 TCs to test the hypothesis that putative stress-responsive genes will map within QTLs associated with stress-related phenotypic variation more frequently than with other genomic regions not associated with these QTLs. Approximately 1,906 of 15,784 TCs were mapped to the consensus map. About 35% of the annotated TCs that mapped within QTL regions were genes involved in an abiotic stress response. By comparison, only 14.5% of the annotated TCs mapped outside these QTLs were classified as abiotic stress genes. A simple binomial probability calculation of this degree of bias being observed if QTL and non-QTL regions are equally likely to contain stress genes was P(x???85) = 7.99??× 10?15. These results suggest that the QTL regions have a higher propensity to contain stress genes.

  2. A functional update of the Escherichia coli K-12 genome

    Microsoft Academic Search

    Margrethe H Serres; Shuba Gopal; Laila A Nahum; Ping Liang; Terry Gaasterland; Monica Riley

    2001-01-01

    BACKGROUND: Since the genome of Escherichia coli K-12 was initially annotated in 1997, additional functional information based on biological characterization and functions of sequence-similar proteins has become available. On the basis of this new information, an updated version of the annotated chromosome has been generated. RESULTS: The E. coli K-12 chromosome is currently represented by 4,401 genes encoding 116 RNAs

  3. Exploring laccase-like multicopper oxidase genes from the ascomycete Trichoderma reesei: a functional, phylogenetic and evolutionary study

    PubMed Central

    2010-01-01

    Background The diversity and function of ligninolytic genes in soil-inhabiting ascomycetes has not yet been elucidated, despite their possible role in plant litter decay processes. Among ascomycetes, Trichoderma reesei is a model organism of cellulose and hemicellulose degradation, used for its unique secretion ability especially for cellulase production. T. reesei has only been reported as a cellulolytic and hemicellulolytic organism although genome annotation revealed 6 laccase-like multicopper oxidase (LMCO) genes. The purpose of this work was i) to validate the function of a candidate LMCO gene from T. reesei, and ii) to reconstruct LMCO phylogeny and perform evolutionary analysis testing for positive selection. Results After homologous overproduction of a candidate LMCO gene, extracellular laccase activity was detected when ABTS or SRG were used as substrates, and the recombinant protein was purified to homogeneity followed by biochemical characterization. The recombinant protein, called TrLAC1, has a molecular mass of 104 kDa. Optimal temperature and pH were respectively 40-45°C and 4, by using ABTS as substrate. TrLAC1 showed broad pH stability range of 3 to 7. Temperature stability revealed that TrLAC1 is not a thermostable enzyme, which was also confirmed by unfolding studies monitored by circular dichroism. Evolutionary studies were performed to shed light on the LMCO family, and the phylogenetic tree was reconstructed using maximum-likelihood method. LMCO and classical laccases were clearly divided into two distinct groups. Finally, Darwinian selection was tested, and the results showed that positive selection drove the evolution of sequences leading to well-known laccases involved in ligninolysis. Positively-selected sites were observed that could be used as targets for mutagenesis and functional studies between classical laccases and LMCO from T. reesei. Conclusions Homologous production and evolutionary studies of the first LMCO from the biomass-degrading fungus T. reesei gives new insights into the physicochemical parameters and biodiversity in this family. PMID:20735824

  4. Molecular cloning and functional analysis of the goose FSH? gene.

    PubMed

    Huang, Z; Li, X; Li, Y; Liu, R; Chen, Y; Wu, N; Wang, M; Song, Y; Yuan, X; Lan, L; Xu, Q; Chen, G; Zhao, W

    2015-06-01

    The objective of this investigation was to clone goose FSH?-subunit cDNA and to construct a FSH fusion gene to identify the function of FSH? mRNA during stages of the breeding cycle. The FSH? gene was obtained by reverse transcription-PCR, and the full-length FSH? mRNA sequence was amplified by rapid-amplification of cDNA ends. FSH? mRNA expression was detected in reproductive tissues at different stages (pre-laying, laying period, and broody period). Additionally, the expression of 4 genes known to be involved in reproduction (FSH?, GnRH, GH, and BMP) were evaluated in COS-7 cells expressing the fusion gene (pVITRO2-FSH??-CTP). The results show that the FSH? gene consists of a 16 base pair (bp) 5'-untranslated region (UTR), 396 bp open reading frame, and alternative 3'-UTRs at 518 bp and 780 bp, respectively. qPCR analyses revealed that FSH? mRNA is highly transcribed in reproductive tissues, including the pituitary, hypothalamus, ovaries, and oviduct. FSH? mRNA expression increased and subsequently decreased in the pituitary, ovaries, and oviduct during the reproductive stages. Stable FSH expression was confirmed using enzyme-linked immunosorbent assays after transfection with the pVITRO2-FSH??-CTP plasmid. FSH?, GnRH, and BMP expression increased significantly 36 h and 48 h after transfection with the fusion gene in COS-7 cells. The results demonstrate that the FSH? subunit functions in the goose reproductive cycle and provides a theoretical basis for future breeding work. PMID:25719958

  5. DAS Writeback: A Collaborative Annotation System

    PubMed Central

    2011-01-01

    Background Centralised resources such as GenBank and UniProt are perfect examples of the major international efforts that have been made to integrate and share biological information. However, additional data that adds value to these resources needs a simple and rapid route to public access. The Distributed Annotation System (DAS) provides an adequate environment to integrate genomic and proteomic information from multiple sources, making this information accessible to the community. DAS offers a way to distribute and access information but it does not provide domain experts with the mechanisms to participate in the curation process of the available biological entities and their annotations. Results We designed and developed a Collaborative Annotation System for proteins called DAS Writeback. DAS writeback is a protocol extension of DAS to provide the functionalities of adding, editing and deleting annotations. We implemented this new specification as extensions of both a DAS server and a DAS client. The architecture was designed with the involvement of the DAS community and it was improved after performing usability experiments emulating a real annotation task. Conclusions We demonstrate that DAS Writeback is effective, usable and will provide the appropriate environment for the creation and evolution of community protein annotation. PMID:21569281

  6. Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae: lessons from supervised machine learning in functional genomics.

    PubMed

    Lin, Kui; Kuang, Yuyu; Joseph, Jeremiah S; Kolatkar, Prasanna R

    2002-06-01

    Genomics projects have resulted in a flood of sequence data. Functional annotation currently relies almost exclusively on inter-species sequence comparison and is restricted in cases of limited data from related species and widely divergent sequences with no known homologs. Here, we demonstrate that codon composition, a fusion of codon usage bias and amino acid composition signals, can accurately discriminate, in the absence of sequence homology information, cytoplasmic ribosomal protein genes from all other genes of known function in Saccharomyces cerevisiae, Escherichia coli and Mycobacterium tuberculosis using an implementation of support vector machines, SVM(light). Analysis of these codon composition signals is instructive in determining features that confer individuality to ribosomal protein genes. Each of the sets of positively charged, negatively charged and small hydrophobic residues, as well as codon bias, contribute to their distinctive codon composition profile. The representation of all these signals is sensitively detected, combined and augmented by the SVMs to perform an accurate classification. Of special mention is an obvious outlier, yeast gene RPL22B, highly homologous to RPL22A but employing very different codon usage, perhaps indicating a non-ribosomal function. Finally, we propose that codon composition be used in combination with other attributes in gene/protein classification by supervised machine learning algorithms. PMID:12034849

  7. Conserved codon composition of ribosomal protein coding genes in Escherichia coli, Mycobacterium tuberculosis and Saccharomyces cerevisiae: lessons from supervised machine learning in functional genomics

    PubMed Central

    Lin, Kui; Kuang, Yuyu; Joseph, Jeremiah S.; Kolatkar, Prasanna R.

    2002-01-01

    Genomics projects have resulted in a flood of sequence data. Functional annotation currently relies almost exclusively on inter-species sequence comparison and is restricted in cases of limited data from related species and widely divergent sequences with no known homologs. Here, we demonstrate that codon composition, a fusion of codon usage bias and amino acid composition signals, can accurately discriminate, in the absence of sequence homology information, cytoplasmic ribosomal protein genes from all other genes of known function in Saccharomyces cerevisiae, Escherichia coli and Mycobacterium tuberculosis using an implementation of support vector machines, SVMlight. Analysis of these codon composition signals is instructive in determining features that confer individuality to ribosomal protein genes. Each of the sets of positively charged, negatively charged and small hydrophobic residues, as well as codon bias, contribute to their distinctive codon composition profile. The representation of all these signals is sensitively detected, combined and augmented by the SVMs to perform an accurate classification. Of special mention is an obvious outlier, yeast gene RPL22B, highly homologous to RPL22A but employing very different codon usage, perhaps indicating a non-ribosomal function. Finally, we propose that codon composition be used in combination with other attributes in gene/protein classification by supervised machine learning algorithms. PMID:12034849

  8. Empirical evidence of the applicability of functional clustering through gene expression classification.

    PubMed

    Krejník, Milos; Kléma, Jirí

    2012-01-01

    The availability of a great range of prior biological knowledge about the roles and functions of genes and gene-gene interactions allows us to simplify the analysis of gene expression data to make it more robust, compact, and interpretable. Here, we objectively analyze the applicability of functional clustering for the identification of groups of functionally related genes. The analysis is performed in terms of gene expression classification and uses predictive accuracy as an unbiased performance measure. Features of biological samples that originally corresponded to genes are replaced by features that correspond to the centroids of the gene clusters and are then used for classifier learning. Using 10 benchmark data sets, we demonstrate that functional clustering significantly outperforms random clustering without biological relevance. We also show that functional clustering performs comparably to gene expression clustering, which groups genes according to the similarity of their expression profiles. Finally, the suitability of functional clustering as a feature extraction technique is evaluated and discussed. PMID:22291159

  9. A new rhesus macaque assembly and annotation for next-generation sequencing analyses

    PubMed Central

    2014-01-01

    Background The rhesus macaque (Macaca mulatta) is a key species for advancing biomedical research. Like all draft mammalian genomes, the draft rhesus assembly (rheMac2) has gaps, sequencing errors and misassemblies that have prevented automated annotation pipelines from functioning correctly. Another rhesus macaque assembly, CR_1.0, is also available but is substantially more fragmented than rheMac2 with smaller contigs and scaffolds. Annotations for these two assemblies are limited in completeness and accuracy. High quality assembly and annotation files are required for a wide range of studies including expression, genetic and evolutionary analyses. Results We report a new de novo assembly of the rhesus macaque genome (MacaM) that incorporates both the original Sanger sequences used to assemble rheMac2 and new Illumina sequences from the same animal. MacaM has a weighted average (N50) contig size of 64 kilobases, more than twice the size of the rheMac2 assembly and almost five times the size of the CR_1.0 assembly. The MacaM chromosome assembly incorporates information from previously unutilized mapping data and preliminary annotation of scaffolds. Independent assessment of the assemblies using Ion Torrent read alignments indicates that MacaM is more complete and accurate than rheMac2 and CR_1.0. We assembled messenger RNA sequences from several rhesus tissues into transcripts which allowed us to identify a total of 11,712 complete proteins representing 9,524 distinct genes. Using a combination of our assembled rhesus macaque transcripts and human transcripts, we annotated 18,757 transcripts and 16,050 genes with complete coding sequences in the MacaM assembly. Further, we demonstrate that the new annotations provide greatly improved accuracy as compared to the current annotations of rheMac2. Finally, we show that the MacaM genome provides an accurate resource for alignment of reads produced by RNA sequence expression studies. Conclusions The MacaM assembly and annotation files provide a substantially more complete and accurate representation of the rhesus macaque genome than rheMac2 or CR_1.0 and will serve as an important resource for investigators conducting next-generation sequencing studies with nonhuman primates. Reviewers This article was reviewed by Dr. Lutz Walter, Dr. Soojin Yi and Dr. Kateryna Makova. PMID:25319552

  10. Gastrointestinal hormone research - with a Scandinavian annotation.

    PubMed

    Rehfeld, Jens F

    2015-06-01

    Gastrointestinal hormones are peptides released from neuroendocrine cells in the digestive tract. More than 30 hormone genes are currently known to be expressed in the gut, which makes it the largest hormone-producing organ in the body. Modern biology makes it feasible to conceive the hormones under five headings: The structural homology groups a majority of the hormones into nine families, each of which is assumed to originate from one ancestral gene. The individual hormone gene often has multiple phenotypes due to alternative splicing, tandem organization or differentiated posttranslational maturation of the prohormone. By a combination of these mechanisms, more than 100 different hormonally active peptides are released from the gut. Gut hormone genes are also widely expressed outside the gut, some only in extraintestinal endocrine cells and cerebral or peripheral neurons but others also in other cell types. The extraintestinal cells may release different bioactive fragments of the same prohormone due to cell-specific processing pathways. Moreover, endocrine cells, neurons, cancer cells and, for instance, spermatozoa secrete gut peptides in different ways, so the same peptide may act as a blood-borne hormone, a neurotransmitter, a local growth factor or a fertility factor. The targets of gastrointestinal hormones are specific G-protein-coupled receptors that are expressed in the cell membranes also outside the digestive tract. Thus, gut hormones not only regulate digestive functions, but also constitute regulatory systems operating in the whole organism. This overview of gut hormone biology is supplemented with an annotation on some Scandinavian contributions to gastrointestinal hormone research. PMID:25786560

  11. Identification of functional apple scab resistance gene promoters.

    PubMed

    Silfverberg-Dilworth, E; Besse, S; Paris, R; Belfanti, E; Tartarini, S; Sansavini, S; Patocchi, A; Gessler, C

    2005-04-01

    Apple scab (Venturia inaequalis) is one of the most damaging diseases affecting commercial apple production. Some wild Malus species possess resistance against apple scab. One gene, HcrVf2, from a cluster of three genes derived from the wild apple Malus floribunda clone 821, has recently been shown to confer resistance to apple scab when transferred into a scab-susceptible apple variety. For this proof-of-function experiment, the use of the 35S promoter from Cauliflower mosaic virus was reliable and appropriate. However, in order to reduce the amount of non-plant DNA in genetically modified apple to a minimum, with the aim of increasing genetically modified organism acceptability, these genes would ideally be regulated by their own promoters. In this study, sequences from the promoter region of the three members of the HcrVf gene family were compared. Promoter constructs containing progressive 5' deletions were prepared and used for functional analyses. Qualitative assessment confirmed promoter activity in apple. Quantitative promoter comparison was carried out in tobacco (Nicotiana glutinosa) and led to the identification of several promoter regions with different strengths from a basal level to half the strength of the 35S promoter from Cauliflower mosaic virus. PMID:15726316

  12. Functional hierarchy and phenotypic suppression among Drosophila homeotic genes: the labial and empty spiracles genes.

    PubMed Central

    Macías, A; Morata, G

    1996-01-01

    The Drosophila homeotic cluster (HOM-C) is made up of eight genes, which specify the identity of cephalic, thoracic and abdominal segments. These genes can be ordered in a hierarchy which correlates with their position along the 5'-3' transcriptional direction. When they are absent, thoracic and abdominal body segments develop the same'ground' pattern, which is thoracic-like but also includes cephalic structures (sclerotic plates). We find that these plates are specified by the homeobox gene empty spiracles (ems) which is not a member of the HOM-C and which is expressed in all body segments. ems mutations, however, only produce defects in anterior head structures and the posterior spiracles. The ems product has the potential to induce sclerotic plates but this potential is suppressed by any of the HOM-C genes, including the labial gene, which we show to be the lowest ranking of the HOM-C hierarchy. This suppression does not occur at the transcriptional or translational level because the ems function is suppressed in cells containing the ems product. Thus, this appears to be the first case of phenotypic suppression operating in normal development. We propose that ems was originally a member of the HOM-C which escaped from the complex and has also acquired new functions during evolution. Images PMID:8617208

  13. The other side of comparative genomics: genes with no orthologs between the cow and other mammalian species

    Microsoft Academic Search

    Raffaele Mazza; Francesco Strozzi; Andrea Caprera; Paolo Ajmone-Marsan; John L Williams

    2009-01-01

    BACKGROUND: With the rapid growth in the availability of genome sequence data, the automated identification of orthologous genes between species (orthologs) is of fundamental importance to facilitate functional annotation and studies on comparative and evolutionary genomics. Genes with no apparent orthologs between the bovine and human genome may be responsible for major differences between the species, however, such genes are

  14. Semantic annotation of mutable data.

    PubMed

    Morris, Robert A; Dou, Lei; Hanken, James; Kelly, Maureen; Lowery, David B; Ludäscher, Bertram; Macklin, James A; Morris, Paul J

    2013-01-01

    Electronic annotation of scientific data is very similar to annotation of documents. Both types of annotation amplify the original object, add related knowledge to it, and dispute or support assertions in it. In each case, annotation is a framework for discourse about the original object, and, in each case, an annotation needs to clearly identify its scope and its own terminology. However, electronic annotation of data differs from annotation of documents: the content of the annotations, including expectations and supporting evidence, is more often shared among members of networks. Any consequent actions taken by the holders of the annotated data could be shared as well. But even those current annotation systems that admit data as their subject often make it difficult or impossible to annotate at fine-enough granularity to use the results in this way for data quality control. We address these kinds of issues by offering simple extensions to an existing annotation ontology and describe how the results support an interest-based distribution of annotations. We are using the result to design and deploy a platform that supports annotation services overlaid on networks of distributed data, with particular application to data quality control. Our initial instance supports a set of natural science collection metadata services. An important application is the support for data quality control and provision of missing data. A previous proof of concept demonstrated such use based on data annotations modeled with XML-Schema. PMID:24223697

  15. Genome Holography: Deciphering Function-Form Motifs from Gene Expression Data

    E-print Network

    Jacob, Eshel Ben

    Genome Holography: Deciphering Function-Form Motifs from Gene Expression Data Asaf Madi1 information from the vast amount of raw gene-expression data obtained from the microarray measurements here is based on investigating the gene-gene correlations. We analyze a database of gene expression

  16. Efficient assembly and annotation of the transcriptome of catfish by RNA-Seq analysis of a doubled haploid homozygote

    PubMed Central

    2012-01-01

    Background Upon the completion of whole genome sequencing, thorough genome annotation that associates genome sequences with biological meanings is essential. Genome annotation depends on the availability of transcript information as well as orthology information. In teleost fish, genome annotation is seriously hindered by genome duplication. Because of gene duplications, one cannot establish orthologies simply by homology comparisons. Rather intense phylogenetic analysis or structural analysis of orthologies is required for the identification of genes. To conduct phylogenetic analysis and orthology analysis, full-length transcripts are essential. Generation of large numbers of full-length transcripts using traditional transcript sequencing is very difficult and extremely costly. Results In this work, we took advantage of a doubled haploid catfish, which has two sets of identical chromosomes and in theory there should be no allelic variations. As such, transcript sequences generated from next-generation sequencing can be favorably assembled into full-length transcripts. Deep sequencing of the doubled haploid channel catfish transcriptome was performed using Illumina HiSeq 2000 platform, yielding over 300 million high-quality trimmed reads totaling 27 Gbp. Assembly of these reads generated 370,798 non-redundant transcript-derived contigs. Functional annotation of the assembly allowed identification of 25,144 unique protein-encoding genes. A total of 2,659 unique genes were identified as putative duplicated genes in the catfish genome because the assembly of the corresponding transcripts harbored PSVs or MSVs (in the form of pseudo-SNPs in the assembly). Of the 25,144 contigs with unique protein hits, around 20,000 contigs matched 50% length of reference proteins, and over 14,000 transcripts were identified as full-length with complete open reading frames. The characterization of consensus sequences surrounding start codon and the stop codon confirmed the correct assembly of the full-length transcripts. Conclusions The large set of transcripts assembled in this study is the most comprehensive set of genome resources ever developed from catfish, which will provide the much needed resources for functional genome research in catfish, serving as a reference transcriptome for genome annotation, analysis of gene duplication, gene family structures, and digital gene expression analysis. The putative set of duplicated genes provide a starting point for genome scale analysis of gene duplication in the catfish genome, and should be a valuable resource for comparative genome analysis, genome evolution, and genome function studies. PMID:23127152

  17. Remote control of gene function by local translation.

    PubMed

    Jung, Hosung; Gkogkas, Christos G; Sonenberg, Nahum; Holt, Christine E

    2014-03-27

    The subcellular position of a protein is a key determinant of its function. Mounting evidence indicates that RNA localization, where specific mRNAs are transported subcellularly and subsequently translated in response to localized signals, is an evolutionarily conserved mechanism to control protein localization. On-site synthesis confers novel signaling properties to a protein and helps to maintain local proteome homeostasis. Local translation plays particularly important roles in distal neuronal compartments, and dysregulated RNA localization and translation cause defects in neuronal wiring and survival. Here, we discuss key findings in this area and possible implications of this adaptable and swift mechanism for spatial control of gene function. PMID:24679524

  18. Functional gene variants of CYP3A4.

    PubMed

    Werk, A N; Cascorbi, I

    2014-09-01

    Cytochrome P450 3A4 (CYP3A4) is involved in the metabolism of more drugs in clinical use than any other foreign compound-metabolizing enzyme in humans. Recently, increasing evidence has been found showing that variants in the CYP3A4 gene have functional significance and--in rare cases--lead to loss of activity, implying tremendous consequences for patients. This review article highlights the functional consequences of all CYP3A4 variants recognized by the Human Cytochrome P450 (CYP) Allele Nomenclature Database. PMID:24926778

  19. Comparative analyses of stress-responsive genes in Arabidopsis thaliana: insight from genomic data mining, functional enrichment, pathway analysis and phenomics.

    PubMed

    Naika, Mahantesha; Shameer, Khader; Sowdhamini, Ramanathan

    2013-07-01

    Biotic and abiotic stresses adversely affect agriculture by reducing crop growth and productivity worldwide. To investigate the abiotic stress-responsive genes in Arabidopsis thaliana, we compiled a dataset of stress signals and differentially upregulated genes (>= 2.5 fold change) from Stress-responsive transcription Factors DataBase (STIFDB) with additional set of stress signals and genes curated from PubMed and Gene Expression Omnibus. A dataset of 3091 genes differentially upregulated due to 14 different stress signals (abscisic acid, aluminum, cold, cold-drought-salt, dehydration, drought, heat, iron, light, NaCl, osmotic stress, oxidative stress, UV-B and wounding) were curated and used for the analysis. Details about stress-responsive enriched genes and their association with stress signals can be obtained from STIFDB2 database . The gene-stress-signal data were analyzed using an enrichment-based meta-analysis framework consisting of two different ontologies (Gene Ontology and Plant Ontology), biological pathway and functional domain annotations. We found several shared and distinct biological processes, cellular components and molecular functions associated with stress-responsive genes. Pathway analysis revealed that stress-responsive genes perturbed the pathways under the "Metabolic pathways" category. We also found several shared and stress-signal specific protein domains, suggesting functional mechanisms regulating stress-response. Phenomic characteristics of abiotic stress-responsive genes were ascertained for several stresses and found to be shared by multiple stresses in both anatomy and temporal categories of Plant Ontology. We found several constitutive stress-responsive genes that are differentially upregulated due to perturbation of different stress signals, for example a gene (AT1G68440) involved in phenylpropanoid metabolism and polyamine catabolism as responsive to seven different stress signals. We also performed structure-function prediction of five genes associated responsive to multiple abiotic stress signals. We envisage that results from our analysis that provide insight into functional repertoire, metabolic pathways and phenomic characteristics common and specifically associated with stress signals would help to understand abiotic stress regulome in Arabidopsis thaliana and may also help to develop an improved plant variety using molecular breeding and genetic engineering techniques that are rapidly stress-responsive and tolerant. PMID:23645342

  20. The Vertebrate Genome Annotation browser 10 years on

    PubMed Central

    Harrow, Jennifer L.; Steward, Charles A.; Frankish, Adam; Gilbert, James G.; Gonzalez, Jose M.; Loveland, Jane E.; Mudge, Jonathan; Sheppard, Dan; Thomas, Mark; Trevanion, Stephen; Wilming, Laurens G.

    2014-01-01

    The Vertebrate Genome Annotation (VEGA) database (http://vega.sanger.ac.uk), initially designed as a community resource for browsing manual annotation of the human genome project, now contains five reference genomes (human, mouse, zebrafish, pig and rat). Its introduction pages have been redesigned to enable the user to easily navigate between whole genomes and smaller multi-species haplotypic regions of interest such as the major histocompatibility complex. The VEGA browser is unique in that annotation is updated via the Human And Vertebrate Analysis aNd Annotation (HAVANA) update track every 2 weeks, allowing single gene updates to be made publicly available to the research community quickly. The user can now access different haplotypic subregions more easily, such as those from the non-obese diabetic mouse, and display them in a more intuitive way using the comparative tools. We also highlight how the user can browse manually annotated updated patches from the Genome Reference Consortium (GRC). PMID:24316575

  1. Semantic annotation for biological information retrieval system.

    PubMed

    Oshaiba, Mohamed Marouf Z; El Houby, Enas M F; Salah, Akram

    2015-01-01

    Online literatures are increasing in a tremendous rate. Biological domain is one of the fast growing domains. Biological researchers face a problem finding what they are searching for effectively and efficiently. The aim of this research is to find documents that contain any combination of biological process and/or molecular function and/or cellular component. This research proposes a framework that helps researchers to retrieve meaningful documents related to their asserted terms based on gene ontology (GO). The system utilizes GO by semantically decomposing it into three subontologies (cellular component, biological process, and molecular function). Researcher has the flexibility to choose searching terms from any combination of the three subontologies. Document annotation is taking a place in this research to create an index of biological terms in documents to speed the searching process. Query expansion is used to infer semantically related terms to asserted terms. It increases the search meaningful results using the term synonyms and term relationships. The system uses a ranking method to order the retrieved documents based on the ranking weights. The proposed system achieves researchers' needs to find documents that fit the asserted terms semantically. PMID:25737720

  2. Sugarcane functional genomics: gene discovery for agronomic trait development.

    PubMed

    Menossi, M; Silva-Filho, M C; Vincentz, M; Van-Sluys, M-A; Souza, G M

    2008-01-01

    Sugarcane is a highly productive crop used for centuries as the main source of sugar and recently to produce ethanol, a renewable bio-fuel energy source. There is increased interest in this crop due to the impending need to decrease fossil fuel usage. Sugarcane has a highly polyploid genome. Expressed sequence tag (EST) sequencing has significantly contributed to gene discovery and expression studies used to associate function with sugarcane genes. A significant amount of data exists on regulatory events controlling responses to herbivory, drought, and phosphate deficiency, which cause important constraints on yield and on endophytic bacteria, which are highly beneficial. The means to reduce drought, phosphate deficiency, and herbivory by the sugarcane borer have a negative impact on the environment. Improved tolerance for these constraints is being sought. Sugarcane's ability to accumulate sucrose up to 16% of its culm dry weight is a challenge for genetic manipulation. Genome-based technology such as cDNA microarray data indicates genes associated with sugar content that may be used to develop new varieties improved for sucrose content or for traits that restrict the expansion of the cultivated land. The genes can also be used as molecular markers of agronomic traits in traditional breeding programs. PMID:18273390

  3. Tmc gene therapy restores auditory function in deaf mice.

    PubMed

    Askew, Charles; Rochat, Cylia; Pan, Bifeng; Asai, Yukako; Ahmed, Hena; Child, Erin; Schneider, Bernard L; Aebischer, Patrick; Holt, Jeffrey R

    2015-07-01

    Genetic hearing loss accounts for up to 50% of prelingual deafness worldwide, yet there are no biologic treatments currently available. To investigate gene therapy as a potential biologic strategy for restoration of auditory function in patients with genetic hearing loss, we tested a gene augmentation approach in mouse models of genetic deafness. We focused on DFNB7/11 and DFNA36, which are autosomal recessive and dominant deafnesses, respectively, caused by mutations in transmembrane channel-like 1 (TMC1). Mice that carry targeted deletion of Tmc1 or a dominant Tmc1 point mutation, known as Beethoven, are good models for human DFNB7/11 and DFNA36. We screened several adeno-associated viral (AAV) serotypes and promoters and identified AAV2/1 and the chicken ?-actin (Cba) promoter as an efficient combination for driving the expression of exogenous Tmc1 in inner hair cells in vivo. Exogenous Tmc1 or its closely related ortholog, Tmc2, were capable of restoring sensory transduction, auditory brainstem responses, and acoustic startle reflexes in otherwise deaf mice, suggesting that gene augmentation with Tmc1 or Tmc2 is well suited for further development as a strategy for restoration of auditory function in deaf patients who carry TMC1 mutations. PMID:26157030