Note: This page contains sample records for the topic functional gene annotation from Science.gov.
While these samples are representative of the content of Science.gov,
they are not comprehensive nor are they the most current set.
We encourage you to perform a real-time search of Science.gov
to obtain the most current and comprehensive results.
Last update: November 12, 2013.
1

Experience report: issues in comparing gene function annotation in text  

Microsoft Academic Search

Annotating function of genes accurately is one of the most important tasks in molecular biology and medical sciences. The new sequencing technology, called the next generation sequencing technology, made sequencing the whole genomes possible with a fraction of cost of sequencing by using the traditional sequencing technology. As a result, the amount of sequence data has been growing very rapidly,

Youngik Yang; Sun Kim

2009-01-01

2

Gene fusions and gene duplications: relevance to genomic annotation and functional analysis  

Microsoft Academic Search

BACKGROUND: Escherichia coli a model organism provides information for annotation of other genomes. Our analysis of its genome has shown that proteins encoded by fused genes need special attention. Such composite (multimodular) proteins consist of two or more components (modules) encoding distinct functions. Multimodular proteins have been found to complicate both annotation and generation of sequence similar groups. Previous work

Margrethe H Serres; Monica Riley

2005-01-01

3

The Evolution of Gene Annotation  

PubMed Central

Complete and accurate annotation of gene function is an essential starting point for genome interpretation and a host of systems and synthetic biology endeavors. Detecting errors in existing annotation now has an important new tool.

Kasif, Simon; Steffen, Martin

2010-01-01

4

Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets  

PubMed Central

Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence) or text-mining of the published scientific literature (literature profiling). Results We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human) and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2) and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells). Conclusion The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions.

Aubry, Marc; Monnier, Annabelle; Chicault, Celine; de Tayrac, Marie; Galibert, Marie-Dominique; Burgun, Anita; Mosser, Jean

2006-01-01

5

The Duplicated Genes Database: Identification and Functional Annotation of Co-Localised Duplicated Genes across Genomes  

PubMed Central

Background There has been a surge in studies linking genome structure and gene expression, with special focus on duplicated genes. Although initially duplicated from the same sequence, duplicated genes can diverge strongly over evolution and take on different functions or regulated expression. However, information on the function and expression of duplicated genes remains sparse. Identifying groups of duplicated genes in different genomes and characterizing their expression and function would therefore be of great interest to the research community. The ‘Duplicated Genes Database’ (DGD) was developed for this purpose. Methodology Nine species were included in the DGD. For each species, BLAST analyses were conducted on peptide sequences corresponding to the genes mapped on a same chromosome. Groups of duplicated genes were defined based on these pairwise BLAST comparisons and the genomic location of the genes. For each group, Pearson correlations between gene expression data and semantic similarities between functional GO annotations were also computed when the relevant information was available. Conclusions The Duplicated Gene Database provides a list of co-localised and duplicated genes for several species with the available gene co-expression level and semantic similarity value of functional annotation. Adding these data to the groups of duplicated genes provides biological information that can prove useful to gene expression analyses. The Duplicated Gene Database can be freely accessed through the DGD website at http://dgd.genouest.org.

Bretaudeau, Anthony; Sallou, Olivier; Diot, Christian; Demeure, Olivier; Lecerf, Frederic

2012-01-01

6

Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets  

Microsoft Academic Search

BACKGROUND: Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from

Marc Aubry; Annabelle Monnier; Celine Chicault; Marie De Tayrac; Marie-dominique Galibert; Anita Burgun; Jean Mosser

2006-01-01

7

Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation  

PubMed Central

Hypothetical (HyP) and conserved HyP genes account for >30% of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved HyP (9.5%) along with 887 HyP genes (24.4%). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 HyP and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC–MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. One thousand two hundred and twelve of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes. Except for the latter, monocistronic gene annotation was expanded using the above criteria along with matching Clusters of Orthologous Groups. Polycistronic genes were annotated in the same manner with inferences from their proximity to more confidently annotated genes. Two targeted deletion mutants were used as test cases to determine the relevance of the inferred functional annotations.

Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcin P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew W.; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

2009-01-01

8

GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts  

PubMed Central

Genomic projects heavily depend on genome annotations and are limited by the current deficiencies in the published predictions of gene structure and function. It follows that, improved annotation will allow better data mining of genomes, and more secure planning and design of experiments. The purpose of the GeneFarm project is to obtain homogeneous, reliable, documented and traceable annotations for Arabidopsis nuclear genes and gene products, and to enter them into an added-value database. This re-annotation project is being performed exhaustively on every member of each gene family. Performing a family-wide annotation makes the task easier and more efficient than a gene-by-gene approach since many features obtained for one gene can be extrapolated to some or all the other genes of a family. A complete annotation procedure based on the most efficient prediction tools available is being used by 16 partner laboratories, each contributing annotated families from its field of expertise. A database, named GeneFarm, and an associated user-friendly interface to query the annotations have been developed. More than 3000 genes distributed over 300 families have been annotated and are available at http://genoplante-info.infobiogen.fr/Genefarm/. Furthermore, collaboration with the Swiss Institute of Bioinformatics is underway to integrate the GeneFarm data into the protein knowledgebase Swiss-Prot.

Aubourg, Sebastien; Brunaud, Veronique; Bruyere, Clemence; Cock, Mark; Cooke, Richard; Cottet, Annick; Couloux, Arnaud; Dehais, Patrice; Deleage, Gilbert; Duclert, Aymeric; Echeverria, Manuel; Eschbach, Aimee; Falconet, Denis; Filippi, Ghislain; Gaspin, Christine; Geourjon, Christophe; Grienenberger, Jean-Michel; Houlne, Guy; Jamet, Elisabeth; Lechauve, Frederic; Leleu, Olivier; Leroy, Philippe; Mache, Regis; Meyer, Christian; Nedjari, Hafed; Negrutiu, Ioan; Orsini, Valerie; Peyretaillade, Eric; Pommier, Cyril; Raes, Jeroen; Risler, Jean-Loup; Riviere, Stephane; Rombauts, Stephane; Rouze, Pierre; Schneider, Michel; Schwob, Philippe; Small, Ian; Soumayet-Kampetenga, Ghislain; Stankovski, Darko; Toffano, Claire; Tognolli, Michael; Caboche, Michel; Lecharny, Alain

2005-01-01

9

GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts.  

PubMed

Genomic projects heavily depend on genome annotations and are limited by the current deficiencies in the published predictions of gene structure and function. It follows that, improved annotation will allow better data mining of genomes, and more secure planning and design of experiments. The purpose of the GeneFarm project is to obtain homogeneous, reliable, documented and traceable annotations for Arabidopsis nuclear genes and gene products, and to enter them into an added-value database. This re-annotation project is being performed exhaustively on every member of each gene family. Performing a family-wide annotation makes the task easier and more efficient than a gene-by-gene approach since many features obtained for one gene can be extrapolated to some or all the other genes of a family. A complete annotation procedure based on the most efficient prediction tools available is being used by 16 partner laboratories, each contributing annotated families from its field of expertise. A database, named GeneFarm, and an associated user-friendly interface to query the annotations have been developed. More than 3000 genes distributed over 300 families have been annotated and are available at http://genoplante-info.infobiogen.fr/Genefarm/. Furthermore, collaboration with the Swiss Institute of Bioinformatics is underway to integrate the GeneFarm data into the protein knowledgebase Swiss-Prot. PMID:15608279

Aubourg, Sébastien; Brunaud, Véronique; Bruyère, Clémence; Cock, Mark; Cooke, Richard; Cottet, Annick; Couloux, Arnaud; Déhais, Patrice; Deléage, Gilbert; Duclert, Aymeric; Echeverria, Manuel; Eschbach, Aimée; Falconet, Denis; Filippi, Ghislain; Gaspin, Christine; Geourjon, Christophe; Grienenberger, Jean-Michel; Houlné, Guy; Jamet, Elisabeth; Lechauve, Frédéric; Leleu, Olivier; Leroy, Philippe; Mache, Régis; Meyer, Christian; Nedjari, Hafed; Negrutiu, Ioan; Orsini, Valérie; Peyretaillade, Eric; Pommier, Cyril; Raes, Jeroen; Risler, Jean-Loup; Rivière, Stéphane; Rombauts, Stéphane; Rouzé, Pierre; Schneider, Michel; Schwob, Philippe; Small, Ian; Soumayet-Kampetenga, Ghislain; Stankovski, Darko; Toffano, Claire; Tognolli, Michael; Caboche, Michel; Lecharny, Alain

2005-01-01

10

DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis  

PubMed Central

Background Due to the complex and distributed nature of biological research, our current biological knowledge is spread over many redundant annotation databases maintained by many independent groups. Analysts usually need to visit many of these bioinformatics databases in order to integrate comprehensive annotation information for their genes, which becomes one of the bottlenecks, particularly for the analytic task associated with a large gene list. Thus, a highly centralized and ready-to-use gene-annotation knowledgebase is in demand for high throughput gene functional analysis. Description The DAVID Knowledgebase is built around the DAVID Gene Concept, a single-linkage method to agglomerate tens of millions of gene/protein identifiers from a variety of public genomic resources into DAVID gene clusters. The grouping of such identifiers improves the cross-reference capability, particularly across NCBI and UniProt systems, enabling more than 40 publicly available functional annotation sources to be comprehensively integrated and centralized by the DAVID gene clusters. The simple, pair-wise, text format files which make up the DAVID Knowledgebase are freely downloadable for various data analysis uses. In addition, a well organized web interface allows users to query different types of heterogeneous annotations in a high-throughput manner. Conclusion The DAVID Knowledgebase is designed to facilitate high throughput gene functional analysis. For a given gene list, it not only provides the quick accessibility to a wide range of heterogeneous annotation data in a centralized location, but also enriches the level of biological information for an individual gene. Moreover, the entire DAVID Knowledgebase is freely downloadable or searchable at .

Sherman, Brad T; Huang, Da Wei; Tan, Qina; Guo, Yongjian; Bour, Stephan; Liu, David; Stephens, Robert; Baseler, Michael W; Lane, H Clifford; Lempicki, Richard A

2007-01-01

11

Algal Functional Annotation Tool: a web-based analysis suite to functionally interpret large gene lists using integrated annotation and expression data  

PubMed Central

Background Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited and integrate functional terms from a limited number of databases. Another challenge is the use of annotations to interpret large lists of 'interesting' genes generated by genome-scale datasets. Previously, these gene lists had to be analyzed across several independent biological databases, often on a gene-by-gene basis. In contrast, several annotation databases, such as DAVID, integrate data from multiple functional databases and reveal underlying biological themes of large gene lists. While several such databases have been constructed for animals, none is currently available for the study of algae. Due to renewed interest in algae as potential sources of biofuels and the emergence of multiple algal genome sequences, a significant need has arisen for such a database to process the growing compendiums of algal genomic data. Description The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps and batch gene identifier conversion. Conclusions The Algal Functional Annotation Tool aims to provide an integrated data-mining environment for algal genomics by combining data from multiple annotation databases into a centralized tool. This site is designed to expedite the process of functional annotation and the interpretation of gene lists, such as those derived from high-throughput RNA-seq experiments. The tool is publicly available at http://pathways.mcdb.ucla.edu.

2011-01-01

12

Enhanced function annotations for Drosophila serine proteases: A case study for systematic annotation of multi-member gene families  

Microsoft Academic Search

Systematicallyannotatingfunctionofenzymesthatbelongtolargeproteinfamiliesencodedinasingleeukaryoticgenomeisaverychallengingtask. We carried out such an exercise to annotate function for serine-protease family of the trypsin fold in Drosophila melanogaster, with an emphasis on annotating serine-protease homologues (SPHs) that may have lost their catalytic function. Our approach involves data mining and data integration to providefunction annotations for 190Drosophilagene products containing serine-protease-like domains,ofwhich 35areSPHs. Thiswas accomplished by analysis of structure-function relationships,

Parantu K. Shah; Lokesh P. Tripathi; Lars Juhl Jensen; Murad Gahnim; Christopher Mason; Eileen E. Furlong; Veronica Rodrigues; Kevin P. White; Peer Bork; R. Sowdhamini

2007-01-01

13

Global profiling of Shewanella oneidensis MR-1: Expression of hypothetical genes and improved functional annotations  

SciTech Connect

The gamma-proteobacterium Shewanella oneidensis strain MR-1 is a metabolically versatile organism that can reduce a wide range of organic compounds, metal ions, and radionuclides. Similar to most other sequenced organisms, approximate to40% of the predicted ORFs in the S. oneidensis genome were annotated as uncharacterized "hypothetical" genes. We implemented an integrative approach by using experimental and computational analyses to provide more detailed insight into gene function. Global expression profiles were determined for cells after UV irradiation and under aerobic and suboxic growth conditions. Transcriptomic and proteomic analyses confidently identified 538 hypothetical genes as expressed in S. oneidensis cells both as mRNAs and proteins (33% of all predicted hypothetical proteins). Publicly available analysis tools and databases and the expression data were applied to improve the annotation of these genes. The annotation results were scored by using a seven-category schema that ranked both confidence and precision of the functional assignment. We were able to identify homologs for nearly all of these hypothetical proteins (97%), but could confidently assign exact biochemical functions for only 16 proteins (category 1; 3%). Altogether, computational and experimental evidence provided functional assignments or insights for 240 more genes (categories 2-5; 45%). These functional annotations advance our understanding of genes involved in vital cellular processes, including energy conversion, ion transport, secondary metabolism, and signal transduction. We propose that this integrative approach offers a valuable means to undertake the enormous challenge of characterizing the rapidly growing number of hypothetical proteins with each newly sequenced genome.

Picone, Alex F. [Biatech, Bothell WA; Galperin, Michael Y. [National Center for Biotechnology Information; Romine, Margaret [Pacific Northwest National Laboratory (PNNL); Higdon, Roger [Biatech, Bothell WA; Makarova, Kira S. [National Center for Biotechnology Information; Kolker, Natali [Biatech, Bothell WA; Anderson, Gordon A [ORNL; Qiu, Xiaoyun [ORNL; Babnigg, Gyorgy [Oak Ridge National Laboratory (ORNL); Beliaev, Alexander S [ORNL; Edlefsen, Paul [Biatech, Bothell WA; Elias, Dwayne A. [Pacific Northwest National Laboratory (PNNL); Gorby, Dr. Yuri A. [J. Craig Venter Institute; Holzman, Ted [Biatech, Bothell WA; Klappenbach, Joel [Michigan State University, East Lansing; Konstantinidis, Konstantinos T [Michigan State University, East Lansing; Land, Miriam L [ORNL; Lipton, Mary S. [Pacific Northwest National Laboratory (PNNL); McCue, Lee Ann [Pacific Northwest National Laboratory (PNNL); Monroe, Matthew [Pacific Northwest National Laboratory (PNNL); Pasa-Tolic, Ljiljana [Pacific Northwest National Laboratory (PNNL); Pinchuk, Grigoriy [Pacific Northwest National Laboratory (PNNL); Purvine, Samuel [Pacific Northwest National Laboratory (PNNL); Serres, Margrethe H. [Woods Hole Oceanographic Institution (WHOI), Woods Hole, MA; Tsapin, Sasha [University of Southern California; Zakrajsek, Brian A. [Pacific Northwest National Laboratory (PNNL); Zhu, Wenguang [Harvard University; Zhou, Jizhong [University of Oklahoma; Larimer, Frank W [ORNL; Lawrence, Charles E. [Wadsworth Center, Albany, NY; Riley, Monica [Woods Hole Oceanographic Institution (WHOI), Woods Hole, MA; Collart, Frank [Argonne National Laboratory (ANL); YatesIII, John R. [Scripps Research Institute, The, La Jolla, CA; Smith, Richard D. [Pacific Northwest National Laboratory (PNNL); Nealson, Kenneth H. [University of Southern California; Fredrickson, James K [Pacific Northwest National Laboratory (PNNL); Tiedje, James M. [Michigan State University, East Lansing

2005-01-01

14

Cellular functions of genetically imprinted genes in human and mouse as annotated in the gene ontology.  

PubMed

By analyzing the cellular functions of genetically imprinted genes as annotated in the Gene Ontology for human and mouse, we found that imprinted genes are often involved in developmental, transport and regulatory processes. In the human, paternally expressed genes are enriched in GO terms related to the development of organs and of anatomical structures. In the mouse, maternally expressed genes regulate cation transport as well as G-protein signaling processes. Furthermore, we investigated if imprinted genes are regulated by common transcription factors. We identified 25 TF families that showed an enrichment of binding sites in the set of imprinted genes in human and 40 TF families in mouse. In general, maternally and paternally expressed genes are not regulated by different transcription factors. The genes Nnat, Klf14, Blcap, Gnas and Ube3a contribute most to the enrichment of TF families. In the mouse, genes that are maternally expressed in placenta are enriched for AP1 binding sites. In the human, we found that these genes possessed binding sites for both, AP1 and SP1. PMID:23226257

Hamed, Mohamed; Ismael, Siba; Paulsen, Martina; Helms, Volkhard

2012-11-30

15

Cellular Functions of Genetically Imprinted Genes in Human and Mouse as Annotated in the Gene Ontology  

PubMed Central

By analyzing the cellular functions of genetically imprinted genes as annotated in the Gene Ontology for human and mouse, we found that imprinted genes are often involved in developmental, transport and regulatory processes. In the human, paternally expressed genes are enriched in GO terms related to the development of organs and of anatomical structures. In the mouse, maternally expressed genes regulate cation transport as well as G-protein signaling processes. Furthermore, we investigated if imprinted genes are regulated by common transcription factors. We identified 25 TF families that showed an enrichment of binding sites in the set of imprinted genes in human and 40 TF families in mouse. In general, maternally and paternally expressed genes are not regulated by different transcription factors. The genes Nnat, Klf14, Blcap, Gnas and Ube3a contribute most to the enrichment of TF families. In the mouse, genes that are maternally expressed in placenta are enriched for AP1 binding sites. In the human, we found that these genes possessed binding sites for both, AP1 and SP1.

Hamed, Mohamed; Ismael, Siba; Paulsen, Martina; Helms, Volkhard

2012-01-01

16

Validation and functional annotation of expression-based clusters based on gene ontology  

PubMed Central

Background The biological interpretation of large-scale gene expression data is one of the paramount challenges in current bioinformatics. In particular, placing the results in the context of other available functional genomics data, such as existing bio-ontologies, has already provided substantial improvement for detecting and categorizing genes of interest. One common approach is to look for functional annotations that are significantly enriched within a group or cluster of genes, as compared to a reference group. Results In this work, we suggest the information-theoretic concept of mutual information to investigate the relationship between groups of genes, as given by data-driven clustering, and their respective functional categories. Drawing upon related approaches (Gibbons and Roth, Genome Research 12:1574-1581, 2002), we seek to quantify to what extent individual attributes are sufficient to characterize a given group or cluster of genes. Conclusion We show that the mutual information provides a systematic framework to assess the relationship between groups or clusters of genes and their functional annotations in a quantitative way. Within this framework, the mutual information allows us to address and incorporate several important issues, such as the interdependence of functional annotations and combinatorial combinations of attributes. It thus supplements and extends the conventional search for overrepresented attributes within a group or cluster of genes. In particular taking combinations of attributes into account, the mutual information opens the way to uncover specific functional descriptions of a group of genes or clustering result. All datasets and functional annotations used in this study are publicly available. All scripts used in the analysis are provided as additional files.

Steuer, Ralf; Humburg, Peter; Selbig, Joachim

2006-01-01

17

Comparative analysis of grapevine whole-genome gene predictions, functional annotation, categorization and integration of the predicted gene sequences  

PubMed Central

Background The first draft assembly and gene prediction of the grapevine genome (8X base coverage) was made available to the scientific community in 2007, and functional annotation was developed on this gene prediction. Since then additional Sanger sequences were added to the 8X sequences pool and a new version of the genomic sequence with superior base coverage (12X) was produced. Results In order to more efficiently annotate the function of the genes predicted in the new assembly, it is important to build on as much of the previous work as possible, by transferring 8X annotation of the genome to the 12X version. The 8X and 12X assemblies and gene predictions of the grapevine genome were compared to answer the question, “Can we uniquely map 8X predicted genes to 12X predicted genes?” The results show that while the assemblies and gene structure predictions are too different to make a complete mapping between them, most genes (18,725) showed a one-to-one relationship between 8X predicted genes and the last version of 12X predicted genes. In addition, reshuffled genomic sequence structures appeared. These highlight regions of the genome where the gene predictions need to be taken with caution. Based on the new grapevine gene functional annotation and in-depth functional categorization, twenty eight new molecular networks have been created for VitisNet while the existing networks were updated. Conclusions The outcomes of this study provide a functional annotation of the 12X genes, an update of VitisNet, the system of the grapevine molecular networks, and a new functional categorization of genes. Data are available at the VitisNet website (http://www.sdstate.edu/ps/research/vitis/pathways.cfm).

2012-01-01

18

Quality of Computationally Inferred Gene Ontology Annotations  

PubMed Central

Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation.

Skunca, Nives; Altenhoff, Adrian; Dessimoz, Christophe

2012-01-01

19

Global Profiling of Shewanella oneidensis MR-1: Expression of Hypothetical Genes and Improved functional annotations  

SciTech Connect

The y-proteobacterium Shewanella oneidensis strain MR-1 is a respiratory versatile organism that can reduce a wide range of organics, metals, and radionuclides. Similar to most other sequenced organisms, approximately 40% of the predicted ORFs in the MR-1 genome were annotated as uncharacterized ''hypothetical'' genes. We implemented an integrative approach using experimental and computational analyses to provide more detailed insight into their function. Global expression studies were conducted using RNA and protein expression profiling of cells cultivated under aerobic, suboxic, and fumarate reducing conditions, phosphate limitation and UV irradiation. transcriptomic and proteomic analyses confidently identified 538 ''hypothetical'' genes as expressed in S. oneidensis cells both as mRNAs and proteins (33% of all ''hypothetical'' proteins). Publicly available analysis tools and databases and our own expression data were applied to improve the annotation of these genes. The annotation results were scored using a seven-category schema that ranked both confidence and precision of the functional assignment. We identified homologs for nearly all of these ''hypothetical'' proteins (96%), thus allowing us to minimally classify them as ''conserved proteins''. Computational and/or experimental evidence provided more precise functional assignments for 297 genes (categories 1-4; 55%). These improved functional annotations will significantly widen our understanding of vital cellular processes including signal transduction, ion transport, secondary metabolism, and transcription, as well as structural elements, such as cellular membranes. We propose that this integrative approach offers a viable means to undertake the enormous challenge of characterizing the rapidly growing number of ''hypothetical'' proteins with each newly sequenced genome.

Kolker, Eugene; Picone, Alessandro F.; Galperin, Michael Y.; Romine, Margaret F.; Higdon, Roger; Makarova, Kira S.; Kolker, Natali; Anderson, Gordon A.; Qiu, Xiaoyun; Auberry, Kenneth J.; Babnigg, Gyorgy; Beliaev, Alex S.; Edlefsen, Paul; Elias, Dwayne A.; Gorby, Yuri A.; Holzman, Ted; Klappenbach, Joel; Konstantinidis, Kostas; Land, Miriam L.; Lipton, Mary S.; McCue, Lee-Ann; Monroe, Matthew E.; Pasa-Tolic, Liljiana; Pinchuk, Grigoriy E.; Purvine, Samuel O.; Serres, Margaret; Tsapin, Sasha; Zakrajsek, Brian A.; Zhu, Wenhong; Zhou, Jizhong; Larimer, Frank; Lawrence, Charles; Riley, Monica; Collart, Frank R.; Yates, III, John R.; Smith, Richard D.; Giometti, Carol S.; Nealson, Kenneth; Fredrickson, Jim K.; Tiedje, James M.

2005-02-08

20

Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga Nannochloropsis oceanica CCMP1779  

PubMed Central

Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred organism. Moreover, adequate molecular and genetic resources prerequisite for the rational engineering of marine algal feedstocks are lacking for most candidate species. Heterokonts of the genus Nannochloropsis naturally have high cellular oil content and are already in use for industrial production of high-value lipid products. First success in applying reverse genetics by targeted gene replacement makes Nannochloropsis oceanica an attractive model to investigate the cell and molecular biology and biochemistry of this fascinating organism group. Here we present the assembly of the 28.7 Mb genome of N. oceanica CCMP1779. RNA sequencing data from nitrogen-replete and nitrogen-depleted growth conditions support a total of 11,973 genes, of which in addition to automatic annotation some were manually inspected to predict the biochemical repertoire for this organism. Among others, more than 100 genes putatively related to lipid metabolism, 114 predicted transcription factors, and 109 transcriptional regulators were annotated. Comparison of the N. oceanica CCMP1779 gene repertoire with the recently published N. gaditana genome identified 2,649 genes likely specific to N. oceanica CCMP1779. Many of these N. oceanica–specific genes have putative orthologs in other species or are supported by transcriptional evidence. However, because similarity-based annotations are limited, functions of most of these species-specific genes remain unknown. Aside from the genome sequence and its analysis, protocols for the transformation of N. oceanica CCMP1779 are provided. The availability of genomic and transcriptomic data for Nannochloropsis oceanica CCMP1779, along with efficient transformation protocols, provides a blueprint for future detailed gene functional analysis and genetic engineering of Nannochloropsis species by a growing academic community focused on this genus.

Tsai, Chia-Hong; Bullard, Blair; Cornish, Adam J.; Harvey, Christopher; Reca, Ida-Barbara; Thornburg, Chelsea; Achawanantakun, Rujira; Buehl, Christopher J.; Campbell, Michael S.; Cavalier, David; Childs, Kevin L.; Clark, Teresa J.; Deshpande, Rahul; Erickson, Erika; Armenia Ferguson, Ann; Handee, Witawas; Kong, Que; Li, Xiaobo; Liu, Bensheng; Lundback, Steven; Peng, Cheng; Roston, Rebecca L.; Sanjaya; Simpson, Jeffrey P.; TerBush, Allan; Warakanont, Jaruswan; Zauner, Simone; Farre, Eva M.; Hegg, Eric L.; Jiang, Ning; Kuo, Min-Hao; Lu, Yan; Niyogi, Krishna K.; Ohlrogge, John; Osteryoung, Katherine W.; Shachar-Hill, Yair; Sears, Barbara B.; Sun, Yanni; Takahashi, Hideki; Yandell, Mark; Shiu, Shin-Han; Benning, Christoph

2012-01-01

21

GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts  

Microsoft Academic Search

Genomicprojectsheavilydependongenomeannota- tions and are limited by the current deficiencies in the publishedpredictionsofgenestructureandfunction. It follows that, improved annotation will allow better data mining of genomes, and more secure planning and design of experiments. The purpose of the GeneFarmprojectistoobtainhomogeneous,reliable, documented and traceable annotations for Arabidopsis nuclear genes and gene products, and to enter them into an added-value database. This re- annotation project

Sébastien Aubourg; Véronique Brunaud; Clémence Bruyère; Mark Cock; Richard Cooke; Annick Cottet; Arnaud Couloux; Patrice Déhais; Gilbert Deléage; Aymeric Duclert; Manuel Echeverria; Aimée Eschbach; Denis Falconet; Ghislain Filippi; Christine Gaspin; Christophe Geourjon; Jean-michel Grienenberger; Guy Houlné; Elisabeth Jamet; Frédéric Lechauve; Olivier Leleu; Philippe Leroy; Régis Mache; Christian Meyer; Hafed Nedjari; Ioan Negrutiu; Valérie Orsini; Eric Peyretaillade; Cyril Pommier; Jeroen Raes; Jean-loup Risler; Stéphane Rivière; Stephane Rombauts; Pierre Rouzé; Michel Schneider; Philippe Schwob; Ian Small; Ghislain Soumayet-kampetenga; Darko Stankovski; Claire Toffano; Michael Tognolli; Michel Caboche; Alain Lecharny

2005-01-01

22

Gene Expression and Functional Annotation of the Human Ciliary Body Epithelia  

PubMed Central

Purpose The ciliary body (CB) of the human eye consists of the non-pigmented (NPE) and pigmented (PE) neuro-epithelia. We investigated the gene expression of NPE and PE, to shed light on the molecular mechanisms underlying the most important functions of the CB. We also developed molecular signatures for the NPE and PE and studied possible new clues for glaucoma. Methods We isolated NPE and PE cells from seven healthy human donor eyes using laser dissection microscopy. Next, we performed RNA isolation, amplification, labeling and hybridization against 44×k Agilent microarrays. For microarray conformations, we used a literature study, RT-PCRs, and immunohistochemical stainings. We analyzed the gene expression data with R and with the knowledge database Ingenuity. Results The gene expression profiles and functional annotations of the NPE and PE were highly similar. We found that the most important functionalities of the NPE and PE were related to developmental processes, neural nature of the tissue, endocrine and metabolic signaling, and immunological functions. In total 1576 genes differed statistically significantly between NPE and PE. From these genes, at least 3 were cell-specific for the NPE and 143 for the PE. Finally, we observed high expression in the (N)PE of 35 genes previously implicated in molecular mechanisms related to glaucoma. Conclusion Our gene expression analysis suggested that the NPE and PE of the CB were quite similar. Nonetheless, cell-type specific differences were found. The molecular machineries of the human NPE and PE are involved in a range of neuro-endocrinological, developmental and immunological functions, and perhaps glaucoma.

Janssen, Sarah F.; Gorgels, Theo G. M. F.; Bossers, Koen; ten Brink, Jacoline B.; Essing, Anke H. W.; Nagtegaal, Martijn; van der Spek, Peter J.; Jansonius, Nomdo M.; Bergen, Arthur A. B.

2012-01-01

23

Relating gene expression data on two-component systems to functional annotations in Escherichia coli  

PubMed Central

Background Obtaining physiological insights from microarray experiments requires computational techniques that relate gene expression data to functional information. Traditionally, this has been done in two consecutive steps. The first step identifies important genes through clustering or statistical techniques, while the second step assigns biological functions to the identified groups. Recently, techniques have been developed that identify such relationships in a single step. Results We have developed an algorithm that relates patterns of gene expression in a set of microarray experiments to functional groups in one step. Our only assumption is that patterns co-occur frequently. The effectiveness of the algorithm is demonstrated as part of a study of regulation by two-component systems in Escherichia coli. The significance of the relationships between expression data and functional annotations is evaluated based on density histograms that are constructed using product similarity among expression vectors. We present a biological analysis of three of the resulting functional groups of proteins, develop hypotheses for further biological studies, and test one of these hypotheses experimentally. A comparison with other algorithms and a different data set is presented. Conclusion Our new algorithm is able to find interesting and biologically meaningful relationships, not found by other algorithms, in previously analyzed data sets. Scaling of the algorithm to large data sets can be achieved based on a theoretical model.

Denton, Anne M; Wu, Jianfei; Townsend, Megan K; Sule, Preeti; Pruss, Birgit M

2008-01-01

24

Applying the Gene Ontology in microbial annotation.  

PubMed

The ever-increasing number of microbial sequencing projects necessitates a standardized system for the capture of genomic data to ensure that the flood of information produced can be effectively utilized. The Gene Ontology (GO) provides the standard for gene product annotations in the areas of molecular function, biological process and cellular component. A recent effort by the Plant-Associated Microbe Gene Ontology (PAMGO) Consortium has produced more than 800 new GO terms specific for annotating interactions between microbes and their hosts and other symbiotic interactions. In addition, there have been changes and additions to the GO annotation format and evidence storage system to reflect the needs of the microbial annotation community. The capture of annotation information with systems like the GO is absolutely essential to enable the efficient mining of annotation information across diverse genomes and thus to further biological research in meaningful ways. PMID:19577473

Giglio, Michelle G; Collmer, Candace W; Lomax, Jane; Ireland, Amelia

2009-07-02

25

Community gene annotation in practice  

PubMed Central

Manual annotation of genomic data is extremely valuable to produce an accurate reference gene set but is expensive compared with automatic methods and so has been limited to model organisms. Annotation tools that have been developed at the Wellcome Trust Sanger Institute (WTSI, http://www.sanger.ac.uk/.) are being used to fill that gap, as they can be used remotely and so open up viable community annotation collaborations. We introduce the ‘Blessed’ annotator and ‘Gatekeeper’ approach to Community Annotation using the Otterlace/ZMap genome annotation tool. We also describe the strategies adopted for annotation consistency, quality control and viewing of the annotation. Database URL: http://vega.sanger.ac.uk/index.html

Loveland, Jane E.; Gilbert, James G.R.; Griffiths, Ed; Harrow, Jennifer L.

2012-01-01

26

Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation  

Microsoft Academic Search

Hypothetical (HyP) and conserved HyP genes account for >30% of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved HyP (9.5%) along with 887 HyP genes (24.4%). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were

Dwayne A. Elias; Aindrila Mukhopadhyay; Marcine P. Joachimiak; Elliott C. Drury; Alyssa M. Redding; Huei-Che B. Yen; Matthew W. Fields; Terry C. Hazen; Adam P. Arkin; Jay D. Keasling; Judy D. Wall

2009-01-01

27

Facilitating functional annotation of chicken microarray data  

PubMed Central

Background Modeling results from chicken microarray studies is challenging for researchers due to little functional annotation associated with these arrays. The Affymetrix GenChip chicken genome array, one of the biggest arrays that serve as a key research tool for the study of chicken functional genomics, is among the few arrays that link gene products to Gene Ontology (GO). However the GO annotation data presented by Affymetrix is incomplete, for example, they do not show references linked to manually annotated functions. In addition, there is no tool that facilitates microarray researchers to directly retrieve functional annotations for their datasets from the annotated arrays. This costs researchers amount of time in searching multiple GO databases for functional information. Results We have improved the breadth of functional annotations of the gene products associated with probesets on the Affymetrix chicken genome array by 45% and the quality of annotation by 14%. We have also identified the most significant diseases and disorders, different types of genes, and known drug targets represented on Affymetrix chicken genome array. To facilitate functional annotation of other arrays and microarray experimental datasets we developed an Array GO Mapper (AGOM) tool to help researchers to quickly retrieve corresponding functional information for their dataset. Conclusion Results from this study will directly facilitate annotation of other chicken arrays and microarray experimental datasets. Researchers will be able to quickly model their microarray dataset into more reliable biological functional information by using AGOM tool. The disease, disorders, gene types and drug targets revealed in the study will allow researchers to learn more about how genes function in complex biological systems and may lead to new drug discovery and development of therapies. The GO annotation data generated will be available for public use via AgBase website and will be updated on regular basis.

2009-01-01

28

Algal Functional Annotation Tool: a web-based analysis suite to functionally interpret large gene lists using integrated annotation and expression data  

Microsoft Academic Search

Background  Progress in genome sequencing is proceeding at an exponential pace, and several new algal genomes are becoming available every\\u000a year. One of the challenges facing the community is the association of protein sequences encoded in the genomes with biological\\u000a function. While most genome assembly projects generate annotations for predicted protein sequences, they are usually limited\\u000a and integrate functional terms from

David Lopez; David Casero; Shawn J Cokus; Sabeeha S Merchant; Matteo Pellegrini

2011-01-01

29

Gene Ontology annotation quality analysis in model eukaryotes  

Microsoft Academic Search

Functional analysis using the Gene Ontology (GO) is crucial for array analysis, but it is often difficult for researchers to assess the amount and quality of GO annotations associated with different sets of gene products. In many cases the source of the GO annotations and the date the GO annotations were last updated is not apparent, further complicating a researchers'

Teresia J. Buza; Fiona M. McCarthy; Nan Wang; Susan M. Bridges; Shane C. Burgess

2008-01-01

30

CORNET 2.0: integrating plant coexpression, protein-protein interactions, regulatory interactions, gene associations and functional annotations.  

PubMed

To enable easy access and interpretation of heterogeneous and scattered data, we have developed a user-friendly tool for data mining and integration in Arabidopsis, named CORNET. This tool allows the browsing of microarray data, the construction of coexpression and protein-protein interaction (PPI) networks and the exploration of diverse functional annotations. Here, we present the new functionalities of CORNET 2.0 for data integration in plants. First of all, CORNET allows the integration of regulatory interaction datasets accessible through the new transcription factor (TF) tool that can be used in combination with the coexpression tool or the PPI tool. In addition, we have extended the PPI tool to enable the analysis of gene-gene associations from AraNet as well as newly identified PPIs. Different search options are implemented to enable the construction of networks centered around multiple input genes or proteins. New functional annotation resources are included to retrieve relevant literature, phenotypes, plant ontology and biological pathways. We have also extended CORNET to attain the construction of coexpression and PPI networks in the crop species maize. Networks and associated evidence of the majority of currently available data types are visualized in Cytoscape. CORNET is available at https://bioinformatics.psb.ugent.be/cornet. PMID:22651224

De Bodt, Stefanie; Hollunder, Jens; Nelissen, Hilde; Meulemeester, Nick; Inzé, Dirk

2012-05-31

31

Gene and alternative splicing annotation with AIR  

PubMed Central

Designing effective and accurate tools for identifying the functional and structural elements in a genome remains at the frontier of genome annotation owing to incompleteness and inaccuracy of the data, limitations in the computational models, and shifting paradigms in genomics, such as alternative splicing. We present a methodology for the automated annotation of genes and their alternatively spliced mRNA transcripts based on existing cDNA and protein sequence evidence from the same species or projected from a related species using syntenic mapping information. At the core of the method is the splice graph, a compact representation of a gene, its exons, introns, and alternatively spliced isoforms. The putative transcripts are enumerated from the graph and assigned confidence scores based on the strength of sequence evidence, and a subset of the high-scoring candidates are selected and promoted into the annotation. The method is highly selective, eliminating the unlikely candidates while retaining 98% of the high-quality mRNA evidence in well-formed transcripts, and produces annotation that is measurably more accurate than some evidence-based gene sets. The process is fast, accurate, and fully automated, and combines the traditionally distinct gene annotation and alternative splicing detection processes in a comprehensive and systematic way, thus considerably aiding in the ensuing manual curation efforts.

Florea, Liliana; Di Francesco, Valentina; Miller, Jason; Turner, Russell; Yao, Alison; Harris, Michael; Walenz, Brian; Mobarry, Clark; Merkulov, Gennady V.; Charlab, Rosane; Dew, Ian; Deng, Zuoming; Istrail, Sorin; Li, Peter; Sutton, Granger

2005-01-01

32

Gene and alternative splicing annotation with AIR.  

PubMed

Designing effective and accurate tools for identifying the functional and structural elements in a genome remains at the frontier of genome annotation owing to incompleteness and inaccuracy of the data, limitations in the computational models, and shifting paradigms in genomics, such as alternative splicing. We present a methodology for the automated annotation of genes and their alternatively spliced mRNA transcripts based on existing cDNA and protein sequence evidence from the same species or projected from a related species using syntenic mapping information. At the core of the method is the splice graph, a compact representation of a gene, its exons, introns, and alternatively spliced isoforms. The putative transcripts are enumerated from the graph and assigned confidence scores based on the strength of sequence evidence, and a subset of the high-scoring candidates are selected and promoted into the annotation. The method is highly selective, eliminating the unlikely candidates while retaining 98% of the high-quality mRNA evidence in well-formed transcripts, and produces annotation that is measurably more accurate than some evidence-based gene sets. The process is fast, accurate, and fully automated, and combines the traditionally distinct gene annotation and alternative splicing detection processes in a comprehensive and systematic way, thus considerably aiding in the ensuing manual curation efforts. PMID:15632090

Florea, Liliana; Di Francesco, Valentina; Miller, Jason; Turner, Russell; Yao, Alison; Harris, Michael; Walenz, Brian; Mobarry, Clark; Merkulov, Gennady V; Charlab, Rosane; Dew, Ian; Deng, Zuoming; Istrail, Sorin; Li, Peter; Sutton, Granger

2005-01-01

33

PIPA: A High-Throughput Pipeline for Protein Function Annotation  

Microsoft Academic Search

We developed Pipeline for Protein Annotation (PIPA), a genome-wide protein function annotation pipeline that runs in a high performance computing environment. PIPA integrates different tools and employs the Gene Ontology (GO) to provide consistent annotation and resolve prediction conflicts. PIPA has three modules that allow for easy development of specialized databases and integration of various bioinformatics tools. The first module,

Chenggang Yu; Valmik Desai; Nela Zavaljevski; Jaques Reifman

2008-01-01

34

Prosecutor: parameter-free inference of gene function for prokaryotes using DNA microarray data, genomic context and multiple gene annotation sources  

PubMed Central

Background Despite a plethora of functional genomic efforts, the function of many genes in sequenced genomes remains unknown. The increasing amount of microarray data for many species allows employing the guilt-by-association principle to predict function on a large scale: genes exhibiting similar expression patterns are more likely to participate in shared biological processes. Results We developed Prosecutor, an application that enables researchers to rapidly infer gene function based on available gene expression data and functional annotations. Our parameter-free functional prediction method uses a sensitive algorithm to achieve a high association rate of linking genes with unknown function to annotated genes. Furthermore, Prosecutor utilizes additional biological information such as genomic context and known regulatory mechanisms that are specific for prokaryotes. We analyzed publicly available transcriptome data sets and used literature sources to validate putative functions suggested by Prosecutor. We supply the complete results of our analysis for 11 prokaryotic organisms on a dedicated website. Conclusion The Prosecutor software and supplementary datasets available at allow researchers working on any of the analyzed organisms to quickly identify the putative functions of their genes of interest. A de novo analysis allows new organisms to be studied.

Blom, Evert Jan; Breitling, Rainer; Hofstede, Klaas Jan; Roerdink, Jos BTM; van Hijum, Sacha AFT; Kuipers, Oscar P

2008-01-01

35

Logical Gene Ontology Annotations (GOAL): exploring gene ontology annotations with OWL  

PubMed Central

Motivation Ontologies such as the Gene Ontology (GO) and their use in annotations make cross species comparisons of genes possible, along with a wide range of other analytical activities. The bio-ontologies community, in particular the Open Biomedical Ontologies (OBO) community, have provided many other ontologies and an increasingly large volume of annotations of gene products that can be exploited in query and analysis. As many annotations with different ontologies centre upon gene products, there is a possibility to explore gene products through multiple ontological perspectives at the same time. Questions could be asked that link a gene product’s function, process, cellular location, phenotype and disease. Current tools, such as AmiGO, allow exploration of genes based on their GO annotations, but not through multiple ontological perspectives. In addition, the semantics of these ontology’s representations should be able to, through automated reasoning, afford richer query opportunities of the gene product annotations than is currently possible. Results To do this multi-perspective, richer querying of gene product annotations, we have created the Logical Gene Ontology, or GOAL ontology, in OWL that combines the Gene Ontology, Human Disease Ontology and the Mammalian Phenotype Ontology, together with classes that represent the annotations with these ontologies for mouse gene products. Each mouse gene product is represented as a class, with the appropriate relationships to the GO aspects, phenotype and disease with which it has been annotated. We then use defined classes to query these protein classes through automated reasoning, and to build a complex hierarchy of gene products. We have presented this through a Web interface that allows arbitrary queries to be constructed and the results displayed. Conclusion This standard use of OWL affords a rich interaction with Gene Ontology, Human Disease Ontology and Mammalian Phenotype Ontology annotations for the mouse, to give a fine partitioning of the gene products in the GOAL ontology. OWL in combination with automated reasoning can be effectively used to query across ontologies to ask biologically rich questions. We have demonstrated that automated reasoning can be used to deliver practical on-line querying support for the ontology annotations available for the mouse. Availability The GOAL Web page is to be found at http://owl.cs.manchester.ac.uk/goal.

2012-01-01

36

A towards-multidimensional screening approach to predict candidate genes of rheumatoid arthritis based on SNP, structural and functional annotations  

PubMed Central

Background According to the Genetic Analysis Workshops (GAW), hundreds of thousands of SNPs have been tested for association with rheumatoid arthritis. Traditional genome-wide association studies (GWAS) have been developed to identify susceptibility genes using a "most significant SNPs/genes" model. However, many minor- or modest-risk genes are likely to be missed after adjustment of multiple testing. This screening process uses a strict selection of statistical thresholds that aim to identify susceptibility genes based only on statistical model, without considering multi-dimensional biological similarities in sequence arrangement, crystal structure, or functional categories/biological pathways between candidate and known disease genes. Methods Multidimensional screening approaches combined with traditional statistical genetics methods can consider multiple biological backgrounds of genetic mutation, structural, and functional annotations. Here we introduce a newly developed multidimensional screening approach for rheumatoid arthritis candidate genes that considers all SNPs with nominal evidence of Bayesian association (BFLn > 0), and structural and functional similarities of corresponding genes or proteins. Results Our multidimensional screening approach extracted all risk genes (BFLn > 0) by odd ratios of hypothesis H1 to H0, and determined whether a particular group of genes shared underlying biological similarities with known disease genes. Using this method, we found 6614 risk SNPs in our Bayesian screen result set. Finally, we identified 146 likely causal genes for rheumatoid arthritis, including CD4, FGFR1, and KDR, which have been reported as high risk factors by recent studies. We must denote that 790 (96.1%) of genes identified by GWAS could not easily be classified into related functional categories or biological processes associated with the disease, while our candidate genes shared underlying biological similarities (e.g. were in the same pathway or GO term) and contributed to disease etiology, but where common variations in each of these genes make modest contributions to disease risk. We also found 6141 risk SNPs that were too minor to be detected by conventional approaches, and associations between 58 candidate genes and rheumatoid arthritis were verified by literature retrieved from the NCBI PubMed module. Conclusions Our proposed approach to the analysis of GAW16 data for rheumatoid arthritis was based on an underlying biological similarities-based method applied to candidate and known disease genes. Application of our method could identify likely causal candidate disease genes of rheumatoid arthritis, and could yield biological insights that not detected when focusing only on genes that give the strongest evidence by multiple testing. We hope that our proposed method complements the "most significant SNPs/genes" model, and provides additional insights into the pathogenesis of rheumatoid arthritis and other diseases, when searching datasets for hundreds of genetic variances.

2010-01-01

37

An atlas of tissue-specific conserved coexpression for functional annotation and disease gene prediction  

PubMed Central

Gene coexpression relationships that are phylogenetically conserved between human and mouse have been shown to provide important clues about gene function that can be efficiently used to identify promising candidate genes for human hereditary disorders. In the past, such approaches have considered mostly generic gene expression profiles that cover multiple tissues and organs. The individual genes of multicellular organisms, however, can participate in different transcriptional programs, operating at scales as different as single-cell types, tissues, organs, body regions or the entire organism. Therefore, systematic analysis of tissue-specific coexpression could be, in principle, a very powerful strategy to dissect those functional relationships among genes that emerge only in particular tissues or organs. In this report, we show that, in fact, conserved coexpression as determined from tissue-specific and condition-specific data sets can predict many functional relationships that are not detected by analyzing heterogeneous microarray data sets. More importantly, we find that, when combined with disease networks, the simultaneous use of both generic (multi-tissue) and tissue-specific conserved coexpression allows a more efficient prediction of human disease genes than the use of generic conserved coexpression alone. Using this strategy, we were able to identify high-probability candidates for 238 orphan disease loci. We provide proof of concept that this combined use of generic and tissue-specific conserved coexpression can be very useful to prioritize the mutational candidates obtained from deep-sequencing projects, even in the case of genetic disorders as heterogeneous as XLMR.

Piro, Rosario Michael; Ala, Ugo; Molineris, Ivan; Grassi, Elena; Bracco, Chiara; Perego, Gian Paolo; Provero, Paolo; Di Cunto, Ferdinando

2011-01-01

38

GIFtS: annotation landscape analysis with GeneCards  

PubMed Central

Background Gene annotation is a pivotal component in computational genomics, encompassing prediction of gene function, expression analysis, and sequence scrutiny. Hence, quantitative measures of the annotation landscape constitute a pertinent bioinformatics tool. GeneCards® is a gene-centric compendium of rich annotative information for over 50,000 human gene entries, building upon 68 data sources, including Gene Ontology (GO), pathways, interactions, phenotypes, publications and many more. Results We present the GeneCards Inferred Functionality Score (GIFtS) which allows a quantitative assessment of a gene's annotation status, by exploiting the unique wealth and diversity of GeneCards information. The GIFtS tool, linked from the GeneCards home page, facilitates browsing the human genome by searching for the annotation level of a specified gene, retrieving a list of genes within a specified range of GIFtS value, obtaining random genes with a specific GIFtS value, and experimenting with the GIFtS weighting algorithm for a variety of annotation categories. The bimodal shape of the GIFtS distribution suggests a division of the human gene repertoire into two main groups: the high-GIFtS peak consists almost entirely of protein-coding genes; the low-GIFtS peak consists of genes from all of the categories. Cluster analysis of GIFtS annotation vectors provides the classification of gene groups by detailed positioning in the annotation arena. GIFtS also provide measures which enable the evaluation of the databases that serve as GeneCards sources. An inverse correlation is found (for GIFtS>25) between the number of genes annotated by each source, and the average GIFtS value of genes associated with that source. Three typical source prototypes are revealed by their GIFtS distribution: genome-wide sources, sources comprising mainly highly annotated genes, and sources comprising mainly poorly annotated genes. The degree of accumulated knowledge for a given gene measured by GIFtS was correlated (for GIFtS>30) with the number of publications for a gene, and with the seniority of this entry in the HGNC database. Conclusion GIFtS can be a valuable tool for computational procedures which analyze lists of large set of genes resulting from wet-lab or computational research. GIFtS may also assist the scientific community with identification of groups of uncharacterized genes for diverse applications, such as delineation of novel functions and charting unexplored areas of the human genome.

Harel, Arye; Inger, Aron; Stelzer, Gil; Strichman-Almashanu, Liora; Dalah, Irina; Safran, Marilyn; Lancet, Doron

2009-01-01

39

SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data  

Microsoft Academic Search

The explosion in the number of functional genomic datasets generated with tools such as DNA micro- arrays has created a critical need for resources that facilitate the interpretation of large-scale biological data. SOURCE is a web-based database that brings together information from a broad range of resources, and provides it in manner particularly useful for genome-scale analyses. SOURCE's GeneReports include

Maximilian Diehn; Gavin Sherlock; Gail Binkley; Heng Jin; John C. Matese; Tina Hernandez-boussard; Christian A. Rees; J. Michael Cherry; David Botstein; Patrick O. Brown; Ash A. Alizadeh

2003-01-01

40

Improving gene annotation of complete viral genomes  

PubMed Central

Gene annotation in viruses often relies upon similarity search methods. These methods possess high specificity but some genes may be missed, either those unique to a particular genome or those highly divergent from known homologs. To identify potentially missing viral genes we have analyzed all complete viral genomes currently available in GenBank with a specialized and augmented version of the gene finding program GeneMarkS. In particular, by implementing genome-specific self-training protocols we have better adjusted the GeneMarkS statistical models to sequences of viral genomes. Hundreds of new genes were identified, some in well studied viral genomes. For example, a new gene predicted in the genome of the Epstein–Barr virus was shown to encode a protein similar to ?-herpesvirus minor tegument protein UL14 with heat shock functions. Convincing evidence of this similarity was obtained after only 12 PSI-BLAST iterations. In another example, several iterations of PSI-BLAST were required to demonstrate that a gene predicted in the genome of Alcelaphine herpesvirus 1 encodes a BALF1-like protein which is thought to be involved in apoptosis regulation and, potentially, carcinogenesis. New predictions were used to refine annotations of viral genomes in the RefSeq collection curated by the National Center for Biotechnology Information. Importantly, even in those cases where no sequence similarities were detected, GeneMarkS significantly reduced the number of primary targets for experimental characterization by identifying the most probable candidate genes. The new genome annotations were stored in VIOLIN, an interactive database which provides access to similarity search tools for up-to-date analysis of predicted viral proteins.

Mills, Ryan; Rozanov, Michael; Lomsadze, Alexandre; Tatusova, Tatiana; Borodovsky, Mark

2003-01-01

41

Functional annotation of the Arabidopsis genome using controlled vocabularies.  

PubMed

Controlled vocabularies are increasingly used by databases to describe genes and gene products because they facilitate identification of similar genes within an organism or among different organisms. One of The Arabidopsis Information Resource's goals is to associate all Arabidopsis genes with terms developed by the Gene Ontology Consortium that describe the molecular function, biological process, and subcellular location of a gene product. We have also developed terms describing Arabidopsis anatomy and developmental stages and use these to annotate published gene expression data. As of March 2004, we used computational and manual annotation methods to make 85,666 annotations representing 26,624 unique loci. We focus on associating genes to controlled vocabulary terms based on experimental data from the literature and use The Arabidopsis Information Resource-developed PubSearch software to facilitate this process. Each annotation is tagged with a combination of evidence codes, evidence descriptions, and references that provide a robust means to assess data quality. Annotation of all Arabidopsis genes will allow quantitative comparisons between sets of genes derived from sources such as microarray experiments. The Arabidopsis annotation data will also facilitate annotation of newly sequenced plant genomes by using sequence similarity to transfer annotations to homologous genes. In addition, complete and up-to-date annotations will make unknown genes easy to identify and target for experimentation. Here, we describe the process of Arabidopsis functional annotation using a variety of data sources and illustrate several ways in which this information can be accessed and used to infer knowledge about Arabidopsis and other plant species. PMID:15173566

Berardini, Tanya Z; Mundodi, Suparna; Reiser, Leonore; Huala, Eva; Garcia-Hernandez, Margarita; Zhang, Peifen; Mueller, Lukas A; Yoon, Jungwoon; Doyle, Aisling; Lander, Gabriel; Moseyko, Nick; Yoo, Danny; Xu, Iris; Zoeckler, Brandon; Montoya, Mary; Miller, Neil; Weems, Dan; Rhee, Seung Y

2004-06-01

42

A Framework for Comparing Phenotype Annotations of Orthologous Genes  

Microsoft Academic Search

Objectives: Animal models are a key resource for the investi- gation of human diseases. In contrast to functional annota- tion, phenotype annotation is less standard, and comparing phenotypes across species remains challenging. The objective of this paper is to propose a framework for comparing pheno- type annotations of orthologous genes based on the MeSH indexing of biomedical articles in which

Olivier Bodenreider; Anita Burgun

2010-01-01

43

Gene Ontology annotation quality analysis in model eukaryotes  

PubMed Central

Functional analysis using the Gene Ontology (GO) is crucial for array analysis, but it is often difficult for researchers to assess the amount and quality of GO annotations associated with different sets of gene products. In many cases the source of the GO annotations and the date the GO annotations were last updated is not apparent, further complicating a researchers’ ability to assess the quality of the GO data provided. Moreover, GO biocurators need to ensure that the GO quality is maintained and optimal for the functional processes that are most relevant for their research community. We report the GO Annotation Quality (GAQ) score, a quantitative measure of GO quality that includes breadth of GO annotation, the level of detail of annotation and the type of evidence used to make the annotation. As a case study, we apply the GAQ scoring method to a set of diverse eukaryotes and demonstrate how the GAQ score can be used to track changes in GO annotations over time and to assess the quality of GO annotations available for specific biological processes. The GAQ score also allows researchers to quantitatively assess the functional data available for their experimental systems (arrays or databases).

Buza, Teresia J.; McCarthy, Fiona M.; Wang, Nan; Bridges, Susan M.; Burgess, Shane C.

2008-01-01

44

Gene Ontology annotation quality analysis in model eukaryotes.  

PubMed

Functional analysis using the Gene Ontology (GO) is crucial for array analysis, but it is often difficult for researchers to assess the amount and quality of GO annotations associated with different sets of gene products. In many cases the source of the GO annotations and the date the GO annotations were last updated is not apparent, further complicating a researchers' ability to assess the quality of the GO data provided. Moreover, GO biocurators need to ensure that the GO quality is maintained and optimal for the functional processes that are most relevant for their research community. We report the GO Annotation Quality (GAQ) score, a quantitative measure of GO quality that includes breadth of GO annotation, the level of detail of annotation and the type of evidence used to make the annotation. As a case study, we apply the GAQ scoring method to a set of diverse eukaryotes and demonstrate how the GAQ score can be used to track changes in GO annotations over time and to assess the quality of GO annotations available for specific biological processes. The GAQ score also allows researchers to quantitatively assess the functional data available for their experimental systems (arrays or databases). PMID:18187504

Buza, Teresia J; McCarthy, Fiona M; Wang, Nan; Bridges, Susan M; Burgess, Shane C

2008-01-10

45

A prioritization analysis of disease association by data-mining of functional annotation of human genes.  

PubMed

Complex diseases result from contributions of multiple genes that act in concert through pathways. Here we present a method to prioritize novel candidates of disease-susceptibility genes depending on the biological similarities to the known disease-related genes. The extent of disease-susceptibility of a gene is prioritized by analyzing seven features of human genes captured in H-InvDB. Taking rheumatoid arthritis (RA) and prostate cancer (PC) as two examples, we evaluated the efficiency of our method. Highly scored genes obtained included TNFSF12 and OSM as candidate disease genes for RA and PC, respectively. Subsequent characterization of these genes based upon an extensive literature survey reinforced the validity of these highly scored genes as possible disease-susceptibility genes. Our approach, Prioritization ANalysis of Disease Association (PANDA), is an efficient and cost-effective method to narrow down a large set of genes into smaller subsets that are most likely to be involved in the disease pathogenesis. PMID:22019378

Taniya, Takayuki; Tanaka, Susumu; Yamaguchi-Kabata, Yumi; Hanaoka, Hideki; Yamasaki, Chisato; Maekawa, Harutoshi; Barrero, Roberto A; Lenhard, Boris; Datta, Milton W; Shimoyama, Mary; Bumgarner, Roger; Chakraborty, Ranajit; Hopkinson, Ian; Jia, Libin; Hide, Winston; Auffray, Charles; Minoshima, Shinsei; Imanishi, Tadashi; Gojobori, Takashi

2011-10-14

46

Metagenomic gene annotation by a homology-independent approach  

SciTech Connect

Fully understanding the genetic potential of a microbial community requires functional annotation of all the genes it encodes. The recently developed deep metagenome sequencing approach has enabled rapid identification of millions of genes from a complex microbial community without cultivation. Current homology-based gene annotation fails to detect distantly-related or structural homologs. Furthermore, homology searches with millions of genes are very computational intensive. To overcome these limitations, we developed rhModeller, a homology-independent software pipeline to efficiently annotate genes from metagenomic sequencing projects. Using cellulases and carbonic anhydrases as two independent test cases, we demonstrated that rhModeller is much faster than HMMER but with comparable accuracy, at 94.5percent and 99.9percent accuracy, respectively. More importantly, rhModeller has the ability to detect novel proteins that do not share significant homology to any known protein families. As {approx}50percent of the 2 million genes derived from the cow rumen metagenome failed to be annotated based on sequence homology, we tested whether rhModeller could be used to annotate these genes. Preliminary results suggest that rhModeller is robust in the presence of missense and frameshift mutations, two common errors in metagenomic genes. Applying the pipeline to the cow rumen genes identified 4,990 novel cellulases candidates and 8,196 novel carbonic anhydrase candidates.In summary, we expect rhModeller to dramatically increase the speed and quality of metagnomic gene annotation.

Froula, Jeff; Zhang, Tao; Salmeen, Annette; Hess, Matthias; Kerfeld, Cheryl A.; Wang, Zhong; Du, Changbin

2011-06-02

47

Automatic annotation of eukaryotic genes, pseudogenes and promoters  

PubMed Central

Background The ENCODE gene prediction workshop (EGASP) has been organized to evaluate how well state-of-the-art automatic gene finding methods are able to reproduce the manual and experimental gene annotation of the human genome. We have used Softberry gene finding software to predict genes, pseudogenes and promoters in 44 selected ENCODE sequences representing approximately 1% (30 Mb) of the human genome. Predictions of gene finding programs were evaluated in terms of their ability to reproduce the ENCODE-HAVANA annotation. Results The Fgenesh++ gene prediction pipeline can identify 91% of coding nucleotides with a specificity of 90%. Our automatic pseudogene finder (PSF program) found 90% of the manually annotated pseudogenes and some new ones. The Fprom promoter prediction program identifies 80% of TATA promoters sequences with one false positive prediction per 2,000 base-pairs (bp) and 50% of TATA-less promoters with one false positive prediction per 650 bp. It can be used to identify transcription start sites upstream of annotated coding parts of genes found by gene prediction software. Conclusion We review our software and underlying methods for identifying these three important structural and functional genome components and discuss the accuracy of predictions, recent advances and open problems in annotating genomic sequences. We have demonstrated that our methods can be effectively used for initial automatic annotation of the eukaryotic genome.

Solovyev, Victor; Kosarev, Peter; Seledsov, Igor; Vorobyev, Denis

2006-01-01

48

Using reasoning to guide annotation with gene ontology terms in GOAT  

Microsoft Academic Search

High-quality annotation of biological data is central to bioinformatics. Annotation using terms from ontologies provides reliable computational access to data. The Gene Ontology (GO), a structured controlled vocabulary of nearly 17,000 terms, is becoming the de facto standard for describing the functionality of gene products. Many prominent biomedical databases use GO as a source of terms for functional annotation of

Michael Bada; Daniele Turi; Robin McEntire; Robert Stevens

2004-01-01

49

The GATO gene annotation tool for research laboratories.  

PubMed

Large-scale genome projects have generated a rapidly increasing number of DNA sequences. Therefore, development of computational methods to rapidly analyze these sequences is essential for progress in genomic research. Here we present an automatic annotation system for preliminary analysis of DNA sequences. The gene annotation tool (GATO) is a Bioinformatics pipeline designed to facilitate routine functional annotation and easy access to annotated genes. It was designed in view of the frequent need of genomic researchers to access data pertaining to a common set of genes. In the GATO system, annotation is generated by querying some of the Web-accessible resources and the information is stored in a local database, which keeps a record of all previous annotation results. GATO may be accessed from everywhere through the internet or may be run locally if a large number of sequences are going to be annotated. It is implemented in PHP and Perl and may be run on any suitable Web server. Usually, installation and application of annotation systems require experience and are time consuming, but GATO is simple and practical, allowing anyone with basic skills in informatics to access it without any special training. GATO can be downloaded at [http://mariwork.iq.usp.br/gato/]. Minimum computer free space required is 2 MB. PMID:16258624

Fujita, A; Massirer, K B; Durham, A M; Ferreira, C E; Sogayar, M C

2005-10-26

50

Functional Annotation of Small Noncoding RNAs Target Genes Provides Evidence for a Deregulated Ubiquitin-Proteasome Pathway in Spinocerebellar Ataxia Type 1  

PubMed Central

Spinocerebellar ataxia type 1 (SCA1) is a neurodegenerative disorder caused by the expansion of CAG repeats in the ataxin 1 (ATXN1) gene. In affected cerebellar neurons of patients, mutant ATXN1 accumulates in ubiquitin-positive nuclear inclusions, indicating that protein misfolding is involved in SCA1 pathogenesis. In this study, we functionally annotated the target genes of the small noncoding RNAs (ncRNAs) that were selectively activated in the affected brain compartments. The primary targets of these RNAs, which exhibited a significant enrichment in the cerebellum and cortex of SCA1 patients, were members of the ubiquitin-proteasome system. Thus, we identified and functionally annotated a plausible regulatory pathway that may serve as a potential target to modulate the outcome of neurodegenerative diseases.

Persengiev, Stephan; Kondova, Ivanela; Bontrop, Ronald E.

2012-01-01

51

FunnyBase: a systems level functional annotation of Fundulus ESTs for the analysis of gene expression  

Microsoft Academic Search

BACKGROUND: While studies of non-model organisms are critical for many research areas, such as evolution, development, and environmental biology, they present particular challenges for both experimental and computational genomic level research. Resources such as mass-produced microarrays and the computational tools linking these data to functional annotation at the system and pathway level are rarely available for non-model species. This type

Justin E Paschall; Marjorie F Oleksiak; Jeffrey D VanWye; Jennifer L Roach; J Andrew Whitehead; Gerald J Wyckoff; Kevin J Kolell; Douglas L Crawford

2004-01-01

52

FUNCTIONAL ANNOTATION OF OIL PALM GENES USING AN AUTOMATED BIOINFORMATICS APPROACH  

Microsoft Academic Search

Recent advances in DNA sequencing technologies have led to a tremendous increase in the amount of sequence information available in public databases. To address the need for automated methods of assigning a putative function to each sequence, we have developed bioinformatics tools that can be run on a desktop computer and save significant time and effort. Elaeis guineensis and Elaeis

LAURA B WILLIS; PHILIP A LESSARD; JEFFERSON A PARKER; XIAN M O'BRIEN; ANTHONY J SINSKEY

53

Functional genomics tools applied to plant metabolism: a survey on plant respiration, its connections and the annotation of complex gene functions  

PubMed Central

The application of post-genomic techniques in plant respiration studies has greatly improved our ability to assign functions to gene products. In addition it has also revealed previously unappreciated interactions between distal elements of metabolism. Such results have reinforced the need to consider plant respiratory metabolism as part of a complex network and making sense of such interactions will ultimately require the construction of predictive and mechanistic models. Transcriptomics, proteomics, metabolomics, and the quantification of metabolic flux will be of great value in creating such models both by facilitating the annotation of complex gene function, determining their structure and by furnishing the quantitative data required to test them. In this review, we highlight how these experimental approaches have contributed to our current understanding of plant respiratory metabolism and its interplay with associated process (e.g., photosynthesis, photorespiration, and nitrogen metabolism). We also discuss how data from these techniques may be integrated, with the ultimate aim of identifying mechanisms that control and regulate plant respiration and discovering novel gene functions with potential biotechnological implications.

Araujo, Wagner L.; Nunes-Nesi, Adriano; Williams, Thomas C. R.

2012-01-01

54

The Gene Wiki: community intelligence applied to human gene annotation.  

PubMed

Annotating the function of all human genes is a critical, yet formidable, challenge. Current gene annotation efforts focus on centralized curation resources, but it is increasingly clear that this approach does not scale with the rapid growth of the biomedical literature. The Gene Wiki utilizes an alternative and complementary model based on the principle of community intelligence. Directly integrated within the online encyclopedia, Wikipedia, the goal of this effort is to build a gene-specific review article for every gene in the human genome, where each article is collaboratively written, continuously updated and community reviewed. Previously, we described the creation of Gene Wiki 'stubs' for approximately 9000 human genes. Here, we describe ongoing systematic improvements to these articles to increase their utility. Moreover, we retrospectively examine the community usage and improvement of the Gene Wiki, providing evidence of a critical mass of users and editors. Gene Wiki articles are freely accessible within the Wikipedia web site, and additional links and information are available at http://en.wikipedia.org/wiki/Portal:Gene_Wiki. PMID:19755503

Huss, Jon W; Lindenbaum, Pierre; Martone, Michael; Roberts, Donabel; Pizarro, Angel; Valafar, Faramarz; Hogenesch, John B; Su, Andrew I

2009-09-15

55

Re-Annotation Is an Essential Step in Systems Biology Modeling of Functional Genomics Data  

PubMed Central

One motivation of systems biology research is to understand gene functions and interactions from functional genomics data such as that derived from microarrays. Up-to-date structural and functional annotations of genes are an essential foundation of systems biology modeling. We propose that the first essential step in any systems biology modeling of functional genomics data, especially for species with recently sequenced genomes, is gene structural and functional re-annotation. To demonstrate the impact of such re-annotation, we structurally and functionally re-annotated a microarray developed, and previously used, as a tool for disease research. We quantified the impact of this re-annotation on the array based on the total numbers of structural- and functional-annotations, the Gene Annotation Quality (GAQ) score, and canonical pathway coverage. We next quantified the impact of re-annotation on systems biology modeling using a previously published experiment that used this microarray. We show that re-annotation improves the quantity and quality of structural- and functional-annotations, allows a more comprehensive Gene Ontology based modeling, and improves pathway coverage for both the whole array and a differentially expressed mRNA subset. Our results also demonstrate that re-annotation can result in a different knowledge outcome derived from previous published research findings. We propose that, because of this, re-annotation should be considered to be an essential first step for deriving value from functional genomics data.

van den Berg, Bart H. J.; McCarthy, Fiona M.; Lamont, Susan J.; Burgess, Shane C.

2010-01-01

56

Protein family classification and functional annotation  

Microsoft Academic Search

With the accelerated accumulation of genomic sequence data, there is a pressing need to develop computational methods and advanced bioinformatics infrastructure for reliable and large-scale protein annotation and biological knowledge discovery. The Protein Information Resource (PIR) provides an integrated public resource of protein informatics to support genomic and proteomic research. PIR produces the Protein Sequence Database of functionally annotated protein

Cathy H. Wu; Hongzhan Huang; Lai-su L. Yeh; Winona C. Barker

2003-01-01

57

Genetic Annotation of Gain-Of-Function Screens Using RNA Interference and in Situ Hybridization of Candidate Genes in the Drosophila Wing  

PubMed Central

Gain-of-function screens in Drosophila are an effective method with which to identify genes that affect the development of particular structures or cell types. It has been found that a fraction of 2–10% of the genes tested, depending on the particularities of the screen, results in a discernible phenotype when overexpressed. However, it is not clear to what extent a gain-of-function phenotype generated by overexpression is informative about the normal function of the gene. Thus, very few reports attempt to correlate the loss- and overexpression phenotype for collections of genes identified in gain-of-function screens. In this work we use RNA interference and in situ hybridization to annotate a collection of 123 P-GS insertions that in combination with different Gal4 drivers affect the size and/or patterning of the wing. We identify the gene causing the overexpression phenotype by expressing, in a background of overexpression, RNA interference for the genes affected by each P-GS insertion. Then, we compare the loss and gain-of-function phenotypes obtained for each gene and relate them to its expression pattern in the wing disc. We find that 52% of genes identified by their overexpression phenotype are required during normal development. However, only in 9% of the cases analyzed was there some complementarity between the gain- and loss-of-function phenotype, suggesting that, in general, the overexpression phenotypes would not be indicative of the normal requirements of the gene.

Molnar, Cristina; Casado, Mar; Lopez-Varea, Ana; Cruz, Cristina; de Celis, Jose F.

2012-01-01

58

Dependence Relationships between Gene Ontology Terms based on TIGR Gene Product Annotations  

Microsoft Academic Search

The Gene Ontology is an important tool for the representation and processing of information about gene products and functions. It provides controlled vocabularies for the designations of cellular components, molecular functions, and biological processes used in the annotation of genes and gene products. These constitute three separate ontologies, of cellular com- ponents), molecular functions and biological processes, respectively. The question

Anand KUMAR; Barry SMITH; Christian BORGELT

59

POCUS: mining genomic sequence annotation to predict disease genes  

Microsoft Academic Search

Here we present POCUS (prioritization of candidate genes using statistics), a novel computational approach to prioritize candidate disease genes that is based on over-representation of functional annotation between loci for the same disease. We show that POCUS can provide high (up to 81-fold) enrichment of real disease genes in the candidate-gene shortlists it produces compared with the original large sets

Frances S Turner; Daniel R Clutterbuckand Colin; Colin AM Semple

2003-01-01

60

Metabolomics as a Hypothesis-Generating Functional Genomics Tool for the Annotation of Arabidopsis thaliana Genes of "Unknown Function"  

PubMed Central

Metabolomics is the methodology that identifies and measures global pools of small molecules (of less than about 1,000?Da) of a biological sample, which are collectively called the metabolome. Metabolomics can therefore reveal the metabolic outcome of a genetic or environmental perturbation of a metabolic regulatory network, and thus provide insights into the structure and regulation of that network. Because of the chemical complexity of the metabolome and limitations associated with individual analytical platforms for determining the metabolome, it is currently difficult to capture the complete metabolome of an organism or tissue, which is in contrast to genomics and transcriptomics. This paper describes the analysis of Arabidopsis metabolomics data sets acquired by a consortium that includes five analytical laboratories, bioinformaticists, and biostatisticians, which aims to develop and validate metabolomics as a hypothesis-generating functional genomics tool. The consortium is determining the metabolomes of Arabidopsis T-DNA mutant stocks, grown in standardized controlled environment optimized to minimize environmental impacts on the metabolomes. Metabolomics data were generated with seven analytical platforms, and the combined data is being provided to the research community to formulate initial hypotheses about genes of unknown function (GUFs). A public database (www.PlantMetabolomics.org) has been developed to provide the scientific community with access to the data along with tools to allow for its interactive analysis. Exemplary datasets are discussed to validate the approach, which illustrate how initial hypotheses can be generated from the consortium-produced metabolomics data, integrated with prior knowledge to provide a testable hypothesis concerning the functionality of GUFs.

Quanbeck, Stephanie M.; Brachova, Libuse; Campbell, Alexis A.; Guan, Xin; Perera, Ann; He, Kun; Rhee, Seung Y.; Bais, Preeti; Dickerson, Julie A.; Dixon, Philip; Wohlgemuth, Gert; Fiehn, Oliver; Barkan, Lenore; Lange, Iris; Lange, B. Markus; Lee, Insuk; Cortes, Diego; Salazar, Carolina; Shuman, Joel; Shulaev, Vladimir; Huhman, David V.; Sumner, Lloyd W.; Roth, Mary R.; Welti, Ruth; Ilarslan, Hilal; Wurtele, Eve S.; Nikolau, Basil J.

2012-01-01

61

Functional genome annotation through phylogenomic mapping  

Microsoft Academic Search

Accurate determination of functional interactions among proteins at the genome level remains a challenge for genomic research. Here we introduce a genome-scale approach to functional protein annotation—phylogenomic mapping—that requires only sequence data, can be applied equally well to both finished and unfinished genomes, and can be extended beyond single genomes to annotate multiple genomes simultaneously. We have developed and applied

Balaji S Srinivasan; Nora B Caberoy; Garret Suen; Rion G Taylor; Radhika Shah; Farah Tengra; Barry S Goldman; Anthony G Garza; Roy D Welch

2005-01-01

62

The disease and gene annotations (DGA): an annotation resource for human disease  

PubMed Central

Disease and Gene Annotations database (DGA, http://dga.nubic.northwestern.edu) is a collaborative effort aiming to provide a comprehensive and integrative annotation of the human genes in disease network context by integrating computable controlled vocabulary of the Disease Ontology (DO version 3 revision 2510, which has 8043 inherited, developmental and acquired human diseases), NCBI Gene Reference Into Function (GeneRIF) and molecular interaction network (MIN). DGA integrates these resources together using semantic mappings to build an integrative set of disease-to-gene and gene-to-gene relationships with excellent coverage based on current knowledge. DGA is kept current by periodically reparsing DO, GeneRIF, and MINs. DGA provides a user-friendly and interactive web interface system enabling users to efficiently query, download and visualize the DO tree structure and annotations as a tree, a network graph or a tabular list. To facilitate integrative analysis, DGA provides a web service Application Programming Interface for integration with external analytic tools.

Peng, Kai; Xu, Wei; Zheng, Jianyong; Huang, Kegui; Wang, Huisong; Tong, Jiansong; Lin, Zhifeng; Liu, Jun; Cheng, Wenqing; Fu, Dong; Du, Pan; Kibbe, Warren A.; Lin, Simon M.; Xia, Tian

2013-01-01

63

The Disease and Gene Annotations (DGA): an annotation resource for human disease.  

PubMed

Disease and Gene Annotations database (DGA, http://dga.nubic.northwestern.edu) is a collaborative effort aiming to provide a comprehensive and integrative annotation of the human genes in disease network context by integrating computable controlled vocabulary of the Disease Ontology (DO version 3 revision 2510, which has 8043 inherited, developmental and acquired human diseases), NCBI Gene Reference Into Function (GeneRIF) and molecular interaction network (MIN). DGA integrates these resources together using semantic mappings to build an integrative set of disease-to-gene and gene-to-gene relationships with excellent coverage based on current knowledge. DGA is kept current by periodically reparsing DO, GeneRIF, and MINs. DGA provides a user-friendly and interactive web interface system enabling users to efficiently query, download and visualize the DO tree structure and annotations as a tree, a network graph or a tabular list. To facilitate integrative analysis, DGA provides a web service Application Programming Interface for integration with external analytic tools. PMID:23197658

Peng, Kai; Xu, Wei; Zheng, Jianyong; Huang, Kegui; Wang, Huisong; Tong, Jiansong; Lin, Zhifeng; Liu, Jun; Cheng, Wenqing; Fu, Dong; Du, Pan; Kibbe, Warren A; Lin, Simon M; Xia, Tian

2012-11-28

64

GFam: a platform for automatic annotation of gene families  

PubMed Central

We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam’s capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/.

Sasidharan, Rajkumar; Nepusz, Tamas; Swarbreck, David; Huala, Eva; Paccanaro, Alberto

2012-01-01

65

A dictionary based approach for gene annotation  

Microsoft Academic Search

This paper describes a fast and fully automated dic- tionary based approach to gene annotation and exon prediction. Two dictionaries are constructed, one from the nonredundant protein OWL database and the other from the dbEST database. These dictionaries are used to obtain O(1) time lookups of tuples in the dictionar- ies (4 tuples for the OWL database and 11 tuples

Lior Pachter; Serafim Batzoglouti; Valentin I. Spitkovsky; William S. Beebee Jr.; Eric S. Lander; Bonnie Berger; Daniel J. Kleitmanll

1999-01-01

66

Modularization of Protein Interaction Networks by Incorporating Gene Ontology Annotations  

Microsoft Academic Search

Recent computational analyses of protein interaction networks have attempted to understand cellular organizations, processes and functions. However, they have encountered difficulties due to unreliable interaction data and the complexity of the networks. In this paper, we propose the integration of protein interaction networks with gene ontology annotations for assessing the reliability of current protein-protein interaction data. The interaction reliability can

Young-rae Cho; Woochang Hwang; Aidong Zhang

2007-01-01

67

Taxonomic Precision of Different Hypervariable Regions of 16S rRNA Gene and Annotation Methods for Functional Bacterial Groups in Biological Wastewater Treatment  

PubMed Central

High throughput sequencing of 16S rRNA gene leads us into a deeper understanding on bacterial diversity for complex environmental samples, but introduces blurring due to the relatively low taxonomic capability of short read. For wastewater treatment plant, only those functional bacterial genera categorized as nutrient remediators, bulk/foaming species, and potential pathogens are significant to biological wastewater treatment and environmental impacts. Precise taxonomic assignment of these bacteria at least at genus level is important for microbial ecological research and routine wastewater treatment monitoring. Therefore, the focus of this study was to evaluate the taxonomic precisions of different ribosomal RNA (rRNA) gene hypervariable regions generated from a mix activated sludge sample. In addition, three commonly used classification methods including RDP Classifier, BLAST-based best-hit annotation, and the lowest common ancestor annotation by MEGAN were evaluated by comparing their consistency. Under an unsupervised way, analysis of consistency among different classification methods suggests there are no hypervariable regions with good taxonomic coverage for all genera. Taxonomic assignment based on certain regions of the 16S rRNA genes, e.g. the V1&V2 regions – provide fairly consistent taxonomic assignment for a relatively wide range of genera. Hence, it is recommended to use these regions for studying functional groups in activated sludge. Moreover, the inconsistency among methods also demonstrated that a specific method might not be suitable for identification of some bacterial genera using certain 16S rRNA gene regions. As a general rule, drawing conclusions based only on one sequencing region and one classification method should be avoided due to the potential false negative results.

Guo, Feng; Ju, Feng; Cai, Lin; Zhang, Tong

2013-01-01

68

Genome Annotation in Plants and Fungi: EuGene as a Model Platform  

Microsoft Academic Search

In this era of whole genome sequencing, reliable genome annotations (identification of functional regions) are the cornerstones for many subsequent analyses. Not only is careful annotation important for studying the gene and gene family content of a genome and its host, but also for wide-scale transcriptome and proteome analyses attempting to de- scribe a certain biological process or to get

Sylvain Foissac; Jerome Gouzy; Stephane Rombauts; Catherine Mathe; Joelle Amselem; Lieven Sterck; Yves Van de Peer; Pierre Rouze; Thomas Schiex

2008-01-01

69

Improving gene annotation using peptide mass spectrometry  

PubMed Central

Annotation of protein-coding genes is a key goal of genome sequencing projects. In spite of tremendous recent advances in computational gene finding, comprehensive annotation remains a challenge. Peptide mass spectrometry is a powerful tool for researching the dynamic proteome and suggests an attractive approach to discover and validate protein-coding genes. We present algorithms to construct and efficiently search spectra against a genomic database, with no prior knowledge of encoded proteins. By searching a corpus of 18.5 million tandem mass spectra (MS/MS) from human proteomic samples, we validate 39,000 exons and 11,000 introns at the level of translation. We present translation-level evidence for novel or extended exons in 16 genes, confirm translation of 224 hypothetical proteins, and discover or confirm over 40 alternative splicing events. Polymorphisms are efficiently encoded in our database, allowing us to observe variant alleles for 308 coding SNPs. Finally, we demonstrate the use of mass spectrometry to improve automated gene prediction, adding 800 correct exons to our predictions using a simple rescoring strategy. Our results demonstrate that proteomic profiling should play a role in any genome sequencing project.

Tanner, Stephen; Shen, Zhouxin; Ng, Julio; Florea, Liliana; Guigo, Roderic; Briggs, Steven P.; Bafna, Vineet

2007-01-01

70

Gene Ontology annotations at SGD: new data sources and annotation methods  

PubMed Central

The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) collects and organizes biological information about the chromosomal features and gene products of the budding yeast Saccharomyces cerevisiae. Although published data from traditional experimental methods are the primary sources of evidence supporting Gene Ontology (GO) annotations for a gene product, high-throughput experiments and computational predictions can also provide valuable insights in the absence of an extensive body of literature. Therefore, GO annotations available at SGD now include high-throughput data as well as computational predictions provided by the GO Annotation Project (GOA UniProt; http://www.ebi.ac.uk/GOA/). Because the annotation method used to assign GO annotations varies by data source, GO resources at SGD have been modified to distinguish data sources and annotation methods. In addition to providing information for genes that have not been experimentally characterized, GO annotations from independent sources can be compared to those made by SGD to help keep the literature-based GO annotations current.

Hong, Eurie L.; Balakrishnan, Rama; Dong, Qing; Christie, Karen R.; Park, Julie; Binkley, Gail; Costanzo, Maria C.; Dwight, Selina S.; Engel, Stacia R.; Fisk, Dianna G.; Hirschman, Jodi E.; Hitz, Benjamin C.; Krieger, Cynthia J.; Livstone, Michael S.; Miyasato, Stuart R.; Nash, Robert S.; Oughtred, Rose; Skrzypek, Marek S.; Weng, Shuai; Wong, Edith D.; Zhu, Kathy K.; Dolinski, Kara; Botstein, David; Cherry, J. Michael

2008-01-01

71

GeneAnnot: Interfacing GeneCards with high-throughput gene expression compendia  

Microsoft Academic Search

The interpretation of microarray expression results often includes extensive efforts to identify and annotate the gene representatives immobilised on the arrays. In this paper we describe the usage of our automatic GeneAnnot system, which links between Affymetrix arrays and the rich human gene annotations available in GeneCards. We explain GeneCards search options and results display; elaborate on the presentation of

Vered Chalifa-caspi; Orit Shmueli; Hila Benjamin-rodrig; Naomi Rosen; Michael Shmoish; Itai Yanai; Ron Ophir; Pavel Kats; Marilyn Safran; Doron Lancet

2003-01-01

72

Functional annotation of the human retinal pigment epithelium transcriptome  

PubMed Central

Background To determine level, variability and functional annotation of gene expression of the human retinal pigment epithelium (RPE), the key tissue involved in retinal diseases like age-related macular degeneration and retinitis pigmentosa. Macular RPE cells from six selected healthy human donor eyes (aged 63–78 years) were laser dissected and used for 22k microarray studies (Agilent technologies). Data were analyzed with Rosetta Resolver, the web tool DAVID and Ingenuity software. Results In total, we identified 19,746 array entries with significant expression in the RPE. Gene expression was analyzed according to expression levels, interindividual variability and functionality. A group of highly (n = 2,194) expressed RPE genes showed an overrepresentation of genes of the oxidative phosphorylation, ATP synthesis and ribosome pathways. In the group of moderately expressed genes (n = 8,776) genes of the phosphatidylinositol signaling system and aminosugars metabolism were overrepresented. As expected, the top 10 percent (n = 2,194) of genes with the highest interindividual differences in expression showed functional overrepresentation of the complement cascade, essential in inflammation in age-related macular degeneration, and other signaling pathways. Surprisingly, this same category also includes the genes involved in Bruch's membrane (BM) composition. Among the top 10 percent of genes with low interindividual differences, there was an overrepresentation of genes involved in local glycosaminoglycan turnover. Conclusion Our study expands current knowledge of the RPE transcriptome by assigning new genes, and adding data about expression level and interindividual variation. Functional annotation suggests that the RPE has high levels of protein synthesis, strong energy demands, and is exposed to high levels of oxidative stress and a variable degree of inflammation. Our data sheds new light on the molecular composition of BM, adjacent to the RPE, and is useful for candidate retinal disease gene identification or gene dose-dependent therapeutic studies.

Booij, Judith C; van Soest, Simone; Swagemakers, Sigrid MA; Essing, Anke HW; Verkerk, Annemieke JMH; van der Spek, Peter J; Gorgels, Theo GMF; Bergen, Arthur AB

2009-01-01

73

dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts  

PubMed Central

The functional annotation of genes based on sequence homology with genes from model species genomes is time-consuming because it is necessary to mine several unrelated databases. The aim of the present work was to develop a functional annotation database for common wheat Triticum aestivum (L.). The database, named dbWFA, is based on the reference NCBI UniGene set, an expressed gene catalogue built by expressed sequence tag clustering, and on full-length coding sequences retrieved from the TriFLDB database. Information from good-quality heterogeneous sources, including annotations for model plant species Arabidopsis thaliana (L.) Heynh. and Oryza sativa L., was gathered and linked to T. aestivum sequences through BLAST-based homology searches. Even though the complexity of the transcriptome cannot yet be fully appreciated, we developed a tool to easily and promptly obtain information from multiple functional annotation systems (Gene Ontology, MapMan bin codes, MIPS Functional Categories, PlantCyc pathway reactions and TAIR gene families). The use of dbWFA is illustrated here with several query examples. We were able to assign a putative function to 45% of the UniGenes and 81% of the full-length coding sequences from TriFLDB. Moreover, comparison of the annotation of the whole T. aestivum UniGene set along with curated annotations of the two model species assessed the accuracy of the annotation provided by dbWFA. To further illustrate the use of dbWFA, genes specifically expressed during the early cell division or late storage polymer accumulation phases of T. aestivum grain development were identified using a clustering analysis and then annotated using dbWFA. The annotation of these two sets of genes was consistent with previous analyses of T. aestivum grain transcriptomes and proteomes. Database URL: urgi.versailles.inra.fr/dbWFA/

Vincent, Jonathan; Dai, Zhanwu; Ravel, Catherine; Choulet, Frederic; Mouzeyar, Said; Bouzidi, M. Fouad; Agier, Marie; Martre, Pierre

2013-01-01

74

dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts.  

PubMed

The functional annotation of genes based on sequence homology with genes from model species genomes is time-consuming because it is necessary to mine several unrelated databases. The aim of the present work was to develop a functional annotation database for common wheat Triticum aestivum (L.). The database, named dbWFA, is based on the reference NCBI UniGene set, an expressed gene catalogue built by expressed sequence tag clustering, and on full-length coding sequences retrieved from the TriFLDB database. Information from good-quality heterogeneous sources, including annotations for model plant species Arabidopsis thaliana (L.) Heynh. and Oryza sativa L., was gathered and linked to T. aestivum sequences through BLAST-based homology searches. Even though the complexity of the transcriptome cannot yet be fully appreciated, we developed a tool to easily and promptly obtain information from multiple functional annotation systems (Gene Ontology, MapMan bin codes, MIPS Functional Categories, PlantCyc pathway reactions and TAIR gene families). The use of dbWFA is illustrated here with several query examples. We were able to assign a putative function to 45% of the UniGenes and 81% of the full-length coding sequences from TriFLDB. Moreover, comparison of the annotation of the whole T. aestivum UniGene set along with curated annotations of the two model species assessed the accuracy of the annotation provided by dbWFA. To further illustrate the use of dbWFA, genes specifically expressed during the early cell division or late storage polymer accumulation phases of T. aestivum grain development were identified using a clustering analysis and then annotated using dbWFA. The annotation of these two sets of genes was consistent with previous analyses of T. aestivum grain transcriptomes and proteomes. Database URL: urgi.versailles.inra.fr/dbWFA/ PMID:23660284

Vincent, Jonathan; Dai, Zhanwu; Ravel, Catherine; Choulet, Frédéric; Mouzeyar, Said; Bouzidi, M Fouad; Agier, Marie; Martre, Pierre

2013-05-09

75

Construction of coffee transcriptome networks based on gene annotation semantics.  

PubMed

Gene annotation is a process that encompasses multiple approaches on the analysis of nucleic acids or protein sequences in order to assign structural and functional characteristics to gene models. When thousands of gene models are being described in an organism genome, construction and visualization of gene networks impose novel challenges in the understanding of complex expression patterns and the generation of new knowledge in genomics research. In order to take advantage of accumulated text data after conventional gene sequence analysis, this work applied semantics in combination with visualization tools to build transcriptome networks from a set of coffee gene annotations. A set of selected coffee transcriptome sequences, chosen by the quality of the sequence comparison reported by Basic Local Alignment Search Tool (BLAST) and Interproscan, were filtered out by coverage, identity, length of the query, and e-values. Meanwhile, term descriptors for molecular biology and biochemistry were obtained along the Wordnet dictionary in order to construct a Resource Description Framework (RDF) using Ruby scripts and Methontology to find associations between concepts. Relationships between sequence annotations and semantic concepts were graphically represented through a total of 6845 oriented vectors, which were reduced to 745 non-redundant associations. A large gene network connecting transcripts by way of relational concepts was created where detailed connections remain to be validated for biological significance based on current biochemical and genetics frameworks. Besides reusing text information in the generation of gene connections and for data mining purposes, this tool development opens the possibility to visualize complex and abundant transcriptome data, and triggers the formulation of new hypotheses in metabolic pathways analysis. PMID:22829576

Castillo, Luis F; Galeano, Narmer; Isaza, Gustavo A; Gaitán, Alvaro

2012-07-24

76

Use of Adenovirus-Mediated Gene Transfer to Facilitate Biological Annotation of Novel Genes  

Microsoft Academic Search

As part of a large program of gene annotation, use of adenovirus-mediated gene transfer facilitated rapid progress in the\\u000a functional evaluation of more than 100 genes. Localized or systemic exposure to gene products expressed by adenovirus-transduced\\u000a cells led to the discovery of several novel activities through analysis of resulting physiochemical or histological changes.\\u000a In this summary of the work, we

Jeff L. Ellsworth; Andrew Feldhaus; Steven D. Hughes

77

Protein Annotation from Protein Interaction Networks and Gene Ontology  

PubMed Central

We introduce a novel method for annotating protein function that combines Naïve Bayes and association rules, and takes advantage of the underlying topology in protein interaction networks and the structure of graphs in the Gene Ontology. We apply our method to proteins from the Human Protein Reference Database (HPRD) and show that, in comparison with other approaches, it predicts protein functions with significantly higher recall with no loss of precision. Specifically, it achieves 51% precision and 60% recall versus 45% and 26% for Majority and 24% and 61% for ?2–Statistics, respectively.

Gardiner, Katheleen J.; Cios, Krzysztof J.

2011-01-01

78

A towards-multidimensional screening approach to predict candidate genes of rheumatoid arthritis based on SNP, structural and functional annotations  

Microsoft Academic Search

BACKGROUND: According to the Genetic Analysis Workshops (GAW), hundreds of thousands of SNPs have been tested for association with rheumatoid arthritis. Traditional genome-wide association studies (GWAS) have been developed to identify susceptibility genes using a \\

Liangcai Zhang; Wan Li; Leilei Song; Lina Chen

2010-01-01

79

Experimental-confirmation and functional-annotation of predicted proteins in the chicken genome  

PubMed Central

Background The chicken genome was sequenced because of its phylogenetic position as a non-mammalian vertebrate, its use as a biomedical model especially to study embryology and development, its role as a source of human disease organisms and its importance as the major source of animal derived food protein. However, genomic sequence data is, in itself, of limited value; generally it is not equivalent to understanding biological function. The benefit of having a genome sequence is that it provides a basis for functional genomics. However, the sequence data currently available is poorly structurally and functionally annotated and many genes do not have standard nomenclature assigned. Results We analysed eight chicken tissues and improved the chicken genome structural annotation by providing experimental support for the in vivo expression of 7,809 computationally predicted proteins, including 30 chicken proteins that were only electronically predicted or hypothetical translations in human. To improve functional annotation (based on Gene Ontology), we mapped these identified proteins to their human and mouse orthologs and used this orthology to transfer Gene Ontology (GO) functional annotations to the chicken proteins. The 8,213 orthology-based GO annotations that we produced represent an 8% increase in currently available chicken GO annotations. Orthologous chicken products were also assigned standardized nomenclature based on current chicken nomenclature guidelines. Conclusion We demonstrate the utility of high-throughput expression proteomics for rapid experimental structural annotation of a newly sequenced eukaryote genome. These experimentally-supported predicted proteins were further annotated by assigning the proteins with standardized nomenclature and functional annotation. This method is widely applicable to a diverse range of species. Moreover, information from one genome can be used to improve the annotation of other genomes and inform gene prediction algorithms.

Buza, Teresia J; McCarthy, Fiona M; Burgess, Shane C

2007-01-01

80

HMM-Based Gene Annotation Methods  

SciTech Connect

Development of new statistical methods and computational tools to identify genes in human genomic DNA, and to provide clues to their functions by identifying features such as transcription factor binding sites, tissue, specific expression and splicing patterns, and remove homologies at the protein level with genes of known function.

Haussler, David; Hughey, Richard; Karplus, Keven

1999-09-20

81

Large-Scale Protein Annotation through Gene Ontology  

Microsoft Academic Search

Recent progress in genomic sequencing, computational biology, and ontology development has presented an opportunity to investigate biological systems from a unique perspective, that is, examining genomes and transcriptomes through the multiple and hierarchical structure of Gene Ontology (GO). We report here our development of GO Engine, a computational platform for GO annotation, and analysis of the resultant GO annotations of

Hanqing Xie; Alon Wasserman; Zurit Levine; Amit Novik; Vladimir Grebinskiy; Avi Shoshan; Liat Mintz

2002-01-01

82

Assessing functional annotation transfers with inter-species conserved coexpression: application to Plasmodium falciparum  

PubMed Central

Background Plasmodium falciparum is the main causative agent of malaria. Of the 5 484 predicted genes of P. falciparum, about 57% do not have sufficient sequence similarity to characterized genes in other species to warrant functional assignments. Non-homology methods are thus needed to obtain functional clues for these uncharacterized genes. Gene expression data have been widely used in the recent years to help functional annotation in an intra-species way via the so-called Guilt By Association (GBA) principle. Results We propose a new method that uses gene expression data to assess inter-species annotation transfers. Our approach starts from a set of likely orthologs between a reference species (here S. cerevisiae and D. melanogaster) and a query species (P. falciparum). It aims at identifying clusters of coexpressed genes in the query species whose coexpression has been conserved in the reference species. These conserved clusters of coexpressed genes are then used to assess annotation transfers between genes with low sequence similarity, enabling reliable transfers of annotations from the reference to the query species. The approach was used with transcriptomic data sets of P. falciparum, S. cerevisiae and D. melanogaster, and enabled us to propose with high confidence new/refined annotations for several dozens hypothetical/putative P. falciparum genes. Notably, we revised the annotation of genes involved in ribosomal proteins and ribosome biogenesis and assembly, thus highlighting several potential drug targets. Conclusions Our approach uses both sequence similarity and gene expression data to help inter-species gene annotation transfers. Experiments show that this strategy improves the accuracy achieved when using solely sequence similarity and outperforms the accuracy of the GBA approach. In addition, our experiments with P. falciparum show that it can infer a function for numerous hypothetical genes.

2010-01-01

83

Evolutionary trace annotation of protein function in the structural proteome.  

PubMed

By design, structural genomics (SG) solves many structures that cannot be assigned function based on homology to known proteins. Alternative function annotation methods are therefore needed and this study focuses on function prediction with three-dimensional (3D) templates: small structural motifs built of just a few functionally critical residues. Although experimentally proven functional residues are scarce, we show here that Evolutionary Trace (ET) rankings of residue importance are sufficient to build 3D templates, match them, and then assign Gene Ontology (GO) functions in enzymes and non-enzymes alike. In a high-specificity mode, this Evolutionary Trace Annotation (ETA) method covered half (53%) of the 2384 annotated SG protein controls. Three-quarters (76%) of predictions were both correct and complete. The positive predictive value for all GO depths (all-depth PPV) was 84%, and it rose to 94% over GO depths 1-3 (depth 3 PPV). In a high-sensitivity mode, coverage rose significantly (84%), while accuracy fell moderately: 68% of predictions were both correct and complete, all-depth PPV was 75%, and depth 3 PPV was 86%. These data concur with prior mutational experiments showing that ET rank information identifies key functional determinants in proteins. In practice, ETA predicted functions in 42% of 3461 unannotated SG proteins. In 529 cases--including 280 non-enzymes and 21 for metal ion ligands--the expected accuracy is 84% at any GO depth and 94% down to GO depth 3, while for the remaining 931 the expected accuracies are 60% and 71%, respectively. Thus, local structural comparisons of evolutionarily important residues can help decipher protein functions to known reliability levels and without prior assumption on functional mechanisms. ETA is available at http://mammoth.bcm.tmc.edu/eta. PMID:20036248

Erdin, Serkan; Ward, R Matthew; Venner, Eric; Lichtarge, Olivier

2009-12-28

84

Visual Presentation as a Welcome Alternative to Textual Presentation of Gene Annotation Information  

PubMed Central

The functions of a gene are traditionally annotated textually using either free text (Gene Reference Into Function or GeneRIF) or controlled vocabularies (e.g., Gene Ontology or Disease Ontology). Inspired by the latest word cloud tools developed by the Information Visualization Group at IBM Research, we have prototyped a visual system for capturing gene annotations, which we named Gene Graph Into Function or GeneGIF. Fully developing the GeneGIF system would be a significant effort. To justify the necessity and to specify the design requirements of GeneGIF, we first surveyed the end-user preferences. From 53 responses, we found that a majority (64%, p < 0.05) of the users were either positive or neutral toward using GeneGIF in their daily work (acceptance); in terms of preference, a slight majority (51%, p > 0.05) of the users favored visual presentation of information (GeneGIF) compared to textual (GeneRIF) information. The results of this study indicate that a visual presentation tool, such as GeneGIF, can complement standard textual presentation of gene annotations. Moreover, the survey participants provided many constructive comments that will specify the development of a phase-two project (http://128.248.174.241/) to visually annotate each gene in the human genome.

Desai, Jairav; Flatow, Jared M.; Song, Jie; Zhu, Lihua J.; Du, Pan; Huang, Chiang-Ching; Lu, Hui; Lin, Simon M.

2010-01-01

85

HMM-Based Gene Annotation Methods.  

National Technical Information Service (NTIS)

Development of new statistical methods and computational tools to identify genes in human genomic DNA, and to provide clues to their functions by identifying features such as transcription factor binding sites, tissue, specific expression and splicing pat...

D. Haussler R. Hughey K. Karplus

1999-01-01

86

Functional Annotation of Class I Lysyl-tRNA Synthetase Phylogeny Indicates a Limited Role for Gene Transfer  

Microsoft Academic Search

Functional and comparative genomic studies have previously shown that the essential protein lysyl-tRNA synthetase (LysRS) exists in two unrelated forms. Most prokaryotes and all eukaryotes contain a class II LysRS, whereas most archaea and a few bacteria contain a less common class I LysRS. In bacteria the class I LysRS is only found in the -proteobacteria and a scattering of

Alexandre Ambrogelly; Dragana Korencic; Michael Ibba

2002-01-01

87

Functional annotation of a full-length mouse cDNA collection  

Microsoft Academic Search

The RIKEN Mouse Gene Encyclopaedia Project, a systematic approach to determining the full coding potential of the mouse genome, involves collection and sequencing of full-length complementary DNAs and physical mapping of the corresponding genes to the mouse genome. We organized an international functional annotation meeting (FANTOM) to annotate the first 21,076 cDNAs to be analysed in this project. Here we

J. Kawai; A. Shinagawa; K. Shibata; M. Yoshino; M. Itoh; Y. Ishii; T. Arakawa; A. Hara; Y. Fukunishi; H. Konno; J. Adachi; S. Fukuda; K. Aizawa; M. Izawa; K. Nishi; H. Kiyosawa; S. Kondo; I. Yamanaka; T. Saito; Y. Okazaki; T. Gojobori; H. Bono; T. Kasukawa; R. Saito; K. Kadota; H. Matsuda; M. Ashburner; S. Batalov; T. Casavant; W. Fleischmann; T. Gaasterland; C. Gissi; B. King; H. Kochiwa; P. Kuehl; S. Lewis; Y. Matsuo; I. Nikaido; G. Pesole; J. Quackenbush; L. M. Schriml; F. Staubli; R. Suzuki; M. Tomita; L. Wagner; T. Washio; K. Sakai; T. Okido; M. Furuno; H. Aono; R. Baldarelli; G. Barsh; J. Blake; D. Boffelli; N. Bojunga; P. Carninci; M. F. de Bonaldo; M. J. Brownstein; C. Bult; C. Fletcher; M. Fujita; M. Gariboldi; S. Gustincich; D. Hill; M. Hofmann; D. A. Hume; M. Kamiya; N. H. Lee; P. Lyons; L. Marchionni; J. Mashima; J. Mazzarelli; P. Mombaerts; P. Nordone; B. Ring; M. Ringwald; I. Rodriguez; N. Sakamoto; H. Sasaki; K. Sato; C. Schönbach; T. Seya; Y. Shibata; K.-F. Storch; H. Suzuki; K. Toyo-oka; K. H. Wang; C. Weitz; C. Whittaker; L. Wilming; A. Wynshaw-Boris; K. Yoshida; Y. Hasegawa; H. Kawaji; S. Kohtsuki; Y. Hayashizaki

2001-01-01

88

CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations.  

PubMed

The set of annotations at the Saccharomyces Genome Database (SGD) that classifies the cellular function of S. cerevisiae gene products using Gene Ontology (GO) terms has become an important resource for facilitating experimental analysis. In addition to capturing and summarizing experimental results, the structured nature of GO annotations allows for functional comparison across organisms as well as propagation of functional predictions between related gene products. Due to their relevance to many areas of research, ensuring the accuracy and quality of these annotations is a priority at SGD. GO annotations are assigned either manually, by biocurators extracting experimental evidence from the scientific literature, or through automated methods that leverage computational algorithms to predict functional information. Here, we discuss the relationship between literature-based and computationally predicted GO annotations in SGD and extend a strategy whereby comparison of these two types of annotation identifies genes whose annotations need review. Our method, CvManGO (Computational versus Manual GO annotations), pairs literature-based GO annotations with computational GO predictions and evaluates the relationship of the two terms within GO, looking for instances of discrepancy. We found that this method will identify genes that require annotation updates, taking an important step towards finding ways to prioritize literature review. Additionally, we explored factors that may influence the effectiveness of CvManGO in identifying relevant gene targets to find in particular those genes that are missing literature-supported annotations, but our survey found that there are no immediately identifiable criteria by which one could enrich for these under-annotated genes. Finally, we discuss possible ways to improve this strategy, and the applicability of this method to other projects that use the GO for curation. DATABASE URL: http://www.yeastgenome.org. PMID:22434836

Park, Julie; Costanzo, Maria C; Balakrishnan, Rama; Cherry, J Michael; Hong, Eurie L

2012-03-20

89

High-throughput functional annotation and data mining with the Blast2GO suite.  

PubMed

Functional genomics technologies have been widely adopted in the biological research of both model and non-model species. An efficient functional annotation of DNA or protein sequences is a major requirement for the successful application of these approaches as functional information on gene products is often the key to the interpretation of experimental results. Therefore, there is an increasing need for bioinformatics resources which are able to cope with large amount of sequence data, produce valuable annotation results and are easily accessible to laboratories where functional genomics projects are being undertaken. We present the Blast2GO suite as an integrated and biologist-oriented solution for the high-throughput and automatic functional annotation of DNA or protein sequences based on the Gene Ontology vocabulary. The most outstanding Blast2GO features are: (i) the combination of various annotation strategies and tools controlling type and intensity of annotation, (ii) the numerous graphical features such as the interactive GO-graph visualization for gene-set function profiling or descriptive charts, (iii) the general sequence management features and (iv) high-throughput capabilities. We used the Blast2GO framework to carry out a detailed analysis of annotation behaviour through homology transfer and its impact in functional genomics research. Our aim is to offer biologists useful information to take into account when addressing the task of functionally characterizing their sequence data. PMID:18445632

Götz, Stefan; García-Gómez, Juan Miguel; Terol, Javier; Williams, Tim D; Nagaraj, Shivashankar H; Nueda, María José; Robles, Montserrat; Talón, Manuel; Dopazo, Joaquín; Conesa, Ana

2008-04-29

90

Missing genes in the annotation of prokaryotic genomes  

PubMed Central

Background Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question arises as to whether current genome annotations have systematically missing, small genes. Results We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs. Conclusions Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.

2010-01-01

91

The Renal Gene Ontology Annotation Initiative  

PubMed Central

The gene ontology (go) resource provides dynamic controlled vocabularies to aid in the description of the functional attributes and subcellular locations of gene products from all taxonomic groups (www.geneontology.org). A renal-focused curation initiative, funded by Kidney Research UK and supported by the GO Consortium, has started at the European Bioinformatics Institute and aims to provide a detailed GO resource for mammalian proteins implicated in renal development and function. This report outlines the aims of this initiative and explains how the renal community can become involved to help improve the availability, quality and quantity of GO terms and their association to specific proteins.

Dimmer, Emily C; Huntley, Rachael P; O'Donovan, Claire; Scambler, Peter; Apweiler, Rolf

2010-01-01

92

Saccharomyces cerevisiae: gene annotation and genome variability, state of the art through comparative genomics.  

PubMed

In the early days of the yeast genome sequencing project, gene annotation was in its infancy and suffered the problem of many false positive annotations as well as missed genes. The lack of other sequences for comparison also prevented the annotation of conserved, functional sequences that were not coding. We are now in an era of comparative genomics where many closely related as well as more distantly related genomes are available for direct sequence and synteny comparisons allowing for more probable predictions of genes and other functional sequences due to conservation. We also have a plethora of functional genomics data which helps inform gene annotation for previously uncharacterised open reading frames (ORFs)/genes. For Saccharomyces cerevisiae this has resulted in a continuous updating of the gene and functional sequence annotations in the reference genome helping it retain its position as the best characterized eukaryotic organism's genome. A single reference genome for a species does not accurately describe the species and this is quite clear in the case of S. cerevisiae where the reference strain is not ideal for brewing or baking due to missing genes. Recent surveys of numerous isolates, from a variety of sources, using a variety of technologies have revealed a great deal of variation amongst isolates with genome sequence surveys providing information on novel genes, undetectable by other means. We now have a better understanding of the extant variation in S. cerevisiae as a species as well as some idea of how much we are missing from this understanding. As with gene annotation, comparative genomics enhances the discovery and description of genome variation and is providing us with the tools for understanding genome evolution, adaptation and selection, and underlying genetics of complex traits. PMID:21863479

Louis, Ed

2011-01-01

93

Overcoming function annotation errors in the Gram-positive pathogen Streptococcus suis by a proteomics-driven approach  

Microsoft Academic Search

BACKGROUND: Annotation of protein-coding genes is a key step in sequencing projects. Protein functions are mainly assigned on the basis of the amino acid sequence alone by searching of homologous proteins. However, fully automated annotation processes often lead to wrong prediction of protein functions, and therefore time-intensive manual curation is often essential. Here we describe a fast and reliable way

Manuel J Rodríguez-Ortega; Inmaculada Luque; Carmen Tarradas; José A Bárcena

2008-01-01

94

SUS-BAR: a database of pig proteins with statistically validated structural and functional annotation  

PubMed Central

Given the relevance of the pig proteome in different studies, including human complex maladies, a statistical validation of the annotation is required for a better understanding of the role of specific genes and proteins in the complex networks underlying biological processes in the animal. Presently, approximately 80% of the pig proteome is still poorly annotated, and the existence of protein sequences is routinely inferred automatically by sequence alignment towards preexisting sequences. In this article, we introduce SUS-BAR, a database that derives information mainly from UniProt Knowledgebase and that includes 26 206 pig protein sequences. In SUS-BAR, 16 675 of the pig protein sequences are endowed with statistically validated functional and structural annotation. Our statistical validation is determined by adopting a cluster-centric annotation procedure that allows transfer of different types of annotation, including structure and function. Each sequence in the database can be associated with a set of statistically validated Gene Ontologies (GOs) of the three main sub-ontologies (Molecular Function, Biological Process and Cellular Component), with Pfam functional domains, and when possible, with a cluster Hidden Markov Model that allows modelling the 3D structure of the protein. A database search allows some statistics demonstrating the enrichment in both GO and Pfam annotations of the pig proteins as compared with UniProt Knowledgebase annotation. Searching in SUS-BAR allows retrieval of the pig protein annotation for further analysis. The search is also possible on the basis of specific GO terms and this allows retrieval of all the pig sequences participating into a given biological process, after annotation with our system. Alternatively, the search is possible on the basis of structural information, allowing retrieval of all the pig sequences with the same structural characteristics. Database URL: http://bar.biocomp.unibo.it/pig/

Piovesan, Damiano; Profiti, Giuseppe; Martelli, Pier Luigi; Fariselli, Piero; Fontanesi, Luca; Casadio, Rita

2013-01-01

95

High-throughput functional annotation and data mining with the Blast2GO suite  

Microsoft Academic Search

Functional genomics technologies have been widely adopted in the biological research of both model and non-model species. An efficient functional annotation of DNA or protein sequences is a major requirement for the successful application of these approaches as functional information on gene products is often the key to the interpretation of experimental results. Therefore, there is an increas- ing need

Stefan Gotz; J. M. Garcia-Gomez; Javier Terol; Tim D. Williams; Shivashankar H. Nagaraj; M. J. Nueda; M. Robles; M. Talon; J. Dopazo; A. Conesa

2008-01-01

96

Integration of sequence and functional information to identify and annotate neuropeptides in the pig genome  

Microsoft Academic Search

Neuropeptides support inter-cell communication and have a role in many diverse biological processes. For pig, a biomedical model, few prohormones from which neuropeptides result after convertase processing are listed in the UniProt database. Therefore, our goals are to obtain an in silico library of pig prohormone and convertase genes and to functionally annotate these genes based on a large number

Kenneth I. Porter; Bruce R. Southey; Malik N. Akhtar; Jonathan V. Sweedler; Sandra L. Rodriguez-Zas

2011-01-01

97

Drosophila gene expression pattern annotation through multi-instance multi-label learning.  

PubMed

In the studies of Drosophila embryogenesis, a large number of two-dimensional digital images of gene expression patterns have been produced to build an atlas of spatio-temporal gene expression dynamics across developmental time. Gene expressions captured in these images have been manually annotated with anatomical and developmental ontology terms using a controlled vocabulary (CV), which are useful in research aimed at understanding gene functions, interactions, and networks. With the rapid accumulation of images, the process of manual annotation has become increasingly cumbersome, and computational methods to automate this task are urgently needed. However, the automated annotation of embryo images is challenging. This is because the annotation terms spatially correspond to local expression patterns of images, yet they are assigned collectively to groups of images and it is unknown which term corresponds to which region of which image in the group. In this paper, we address this problem using a new machine learning framework, Multi-Instance Multi-Label (MIML) learning. We first show that the underlying nature of the annotation task is a typical MIML learning problem. Then, we propose two support vector machine algorithms under the MIML framework for the task. Experimental results on the FlyExpress database (a digital library of standardized Drosophila gene expression pattern images) reveal that the exploitation of MIML framework leads to significant performance improvement over state-of-the-art approaches. PMID:21519115

Li, Ying-Xin; Ji, Shuiwang; Kumar, Sudhir; Ye, Jieping; Zhou, Zhi-Hua

2011-04-15

98

Gaining confidence in cross-species annotation transfer: from simple molecular function to complex phenotypic traits  

PubMed Central

Summary Cross-species annotation transfer is a widely used approach for transferring information about simple molecular functions or pathways from one protein in one species to its ortholog in another species. In crop species, the phenotypic traits of interest, such as grain yield, are very complex and are often related to multiple biological processes and systems. It is still unclear to what extent the high level annotations describing phenotypic traits can also be reliably transferred across species. In this work, we have developed a procedure to measure precisely the transferability of these functional annotations from one species to another and demonstrate its application to Arabidopsis and several crop species. This comparative analysis is a step towards assigning higher level biological function to genes and gene networks as part of the wider genotype to phenotype challenge.

Defoin-Platel, Michael; Hassani-Pak, Keywan; Rawlings, Chris

2011-01-01

99

Proteomic Detection of Non-Annotated Protein-Coding Genes in Pseudomonas fluorescens Pf0-1  

PubMed Central

Genome sequences are annotated by computational prediction of coding sequences, followed by similarity searches such as BLAST, which provide a layer of possible functional information. While the existence of processes such as alternative splicing complicates matters for eukaryote genomes, the view of bacterial genomes as a linear series of closely spaced genes leads to the assumption that computational annotations that predict such arrangements completely describe the coding capacity of bacterial genomes. We undertook a proteomic study to identify proteins expressed by Pseudomonas fluorescens Pf0-1 from genes that were not predicted during the genome annotation. Mapping peptides to the Pf0-1 genome sequence identified sixteen non-annotated protein-coding regions, of which nine were antisense to predicted genes, six were intergenic, and one read in the same direction as an annotated gene but in a different frame. The expression of all but one of the newly discovered genes was verified by RT-PCR. Few clues as to the function of the new genes were gleaned from informatic analyses, but potential orthologs in other Pseudomonas genomes were identified for eight of the new genes. The 16 newly identified genes improve the quality of the Pf0-1 genome annotation, and the detection of antisense protein-coding genes indicates the under-appreciated complexity of bacterial genome organization.

Kim, Wook; Silby, Mark W.; Purvine, Sam O.; Nicoll, Julie S.; Hixson, Kim K.; Monroe, Matt; Nicora, Carrie D.; Lipton, Mary S.; Levy, Stuart B.

2009-01-01

100

Proteomic Detection of Non-Annotated Protein-Coding Genes in Pseudomonas fluorescens Pf0-1  

SciTech Connect

Genome sequences are annotated by computational prediction of coding sequences, followed by similarity searches such as BLAST, which provide a layer of (possible) functional information. While the existence of processes such as alternative splicing complicates matters for eukaryote genomes, the view of bacterial genomes as a linear series of closely spaced genes leads to the assumption that computational annotations which predict such arrangements completely describe the coding capacity of bacterial genomes. We undertook a proteomic study to identify proteins expressed by Pseudomonas fluorescens Pf0-1 from genes which were not predicted during the genome annotation. Mapping peptides to the Pf0-1 genome sequence identified sixteen non-annotated protein-coding regions, of which nine were antisense to predicted genes, six were intergenic, and one read in the same direction as an annotated gene but in a different frame. The expression of all but one of the newly discovered genes was verified by RT-PCR. Few clues as to the function of the new genes were gleaned from informatic analyses, but potential orthologues in other Pseudomonas genomes were identified for eight of the new genes. The 16 newly identified genes improve the quality of the Pf0-1 genome annotation, and the detection of antisense protein-coding genes indicates the under-appreciated complexity of bacterial genome organization.

Kim, Wook; Silby, Mark W.; Purvine, Samuel O.; Nicoll, Julie S.; Hixson, Kim K.; Monroe, Matthew E.; Nicora, Carrie D.; Lipton, Mary S.; Levy, Stuart B.

2009-12-24

101

Functional Annotation and Comparative Analysis of a Zygopteran Transcriptome  

PubMed Central

In this paper we present a de novo assembly of the transcriptome of the damselfly (Enallagma hageni) through the use of 454 pyrosequencing. E. hageni is a member of the suborder Zygoptera, in the order Odonata, and Odonata organisms form the basal lineage of the winged insects (Pterygota). To date, sequence data used in phylogenetic analysis of Enallagma species have been derived from either mitochondrial DNA or ribosomal nuclear DNA. This Enallagma transcriptome contained 31,661 contigs that were assembled and translated into 14,813 individual open reading frames. Using these data, we constructed an extensive dataset of 634 orthologous nuclear protein-encoding genes across 11 species of Arthropoda and used Bayesian techniques to elucidate the position of Enallagma in the arthropod phylogenetic tree. Additionally, we demonstrated that the Enallagma transcriptome contains 169 genes that are evolving at rates that differ relative to those of the rest of the transcriptome (29 accelerated and 140 decreased), and, through multiple Gene Ontology searches and clustering methods, we present the first functional annotation of any palaeopteran’s transcriptome in the literature.

Shanku, Alexander G.; McPeek, Mark A.; Kern, Andrew D.

2013-01-01

102

AIGO: Towards a unified framework for the Analysis and the Inter-comparison of GO functional annotations  

PubMed Central

Background In response to the rapid growth of available genome sequences, efforts have been made to develop automatic inference methods to functionally characterize them. Pipelines that infer functional annotation are now routinely used to produce new annotations at a genome scale and for a broad variety of species. These pipelines differ widely in their inference algorithms, confidence thresholds and data sources for reasoning. This heterogeneity makes a comparison of the relative merits of each approach extremely complex. The evaluation of the quality of the resultant annotations is also challenging given there is often no existing gold-standard against which to evaluate precision and recall. Results In this paper, we present a pragmatic approach to the study of functional annotations. An ensemble of 12 metrics, describing various aspects of functional annotations, is defined and implemented in a unified framework, which facilitates their systematic analysis and inter-comparison. The use of this framework is demonstrated on three illustrative examples: analysing the outputs of state-of-the-art inference pipelines, comparing electronic versus manual annotation methods, and monitoring the evolution of publicly available functional annotations. The framework is part of the AIGO library (http://code.google.com/p/aigo) for the Analysis and the Inter-comparison of the products of Gene Ontology (GO) annotation pipelines. The AIGO library also provides functionalities to easily load, analyse, manipulate and compare functional annotations and also to plot and export the results of the analysis in various formats. Conclusions This work is a step toward developing a unified framework for the systematic study of GO functional annotations. This framework has been designed so that new metrics on GO functional annotations can be added in a very straightforward way.

2011-01-01

103

Suppression subtractive hybridization (SSH) combined with bioinformatics method: an integrated functional annotation approach for analysis of differentially expressed immune-genes in insects  

PubMed Central

The suppression subtractive hybridization (SSH) approach, a PCR based approach which amplifies differentially expressed cDNAs (complementary DNAs), while simultaneously suppressing amplification of common cDNAs, was employed to identify immuneinducible genes in insects. This technique has been used as a suitable tool for experimental identification of novel genes in eukaryotes as well as prokaryotes; whose genomes have been sequenced, or the species whose genomes have yet to be sequenced. In this article, I have proposed a method for in silico functional characterization of immune-inducible genes from insects. Apart from immune-inducible genes from insects, this method can be applied for the analysis of genes from other species, starting from bacteria to plants and animals. This article is provided with a background of SSH-based method taking specific examples from innate immune-inducible genes in insects, and subsequently a bioinformatics pipeline is proposed for functional characterization of newly sequenced genes. The proposed workflow presented here, can also be applied for any newly sequenced species generated from Next Generation Sequencing (NGS) platforms.

Badapanda, Chandan

2013-01-01

104

Annotator: Post-processing Software for generating function-based signatures from quantitative mass spectrometry  

PubMed Central

Mass spectrometry is used to investigate global changes in protein abundance in cell lysates. Increasingly powerful methods of data collection have emerged over the past decade, but this has left researchers with the task of sifting through mountains of data for biologically significant results. Often, the end result is a list of proteins with no obvious quantitative relationships to define the larger context of changes in cell behavior. Researchers are often forced to perform a manual analysis from this list or to fall back on a range of disparate tools, which can hinder the communication of results and their reproducibility. To address these methodological problems we developed Annotator, an application that filters validated mass spectrometry data and applies a battery of standardized heuristic and statistical tests to determine significance. To address systems-level interpretations we incorporated UniProt and Gene Ontology keywords as statistical units of analysis, yielding quantitative information about changes in abundance for an entire functional category. This provides a consistent and quantitative method for formulating conclusions about cellular behavior, independent of network models or standard enrichment analyses. Annotator allows for “bottom-up” annotations that are based on experimental data and not inferred by comparison to external or hypothetical models. Annotator was developed as an independent post-processing platform that runs on all common operating systems, thereby providing a useful tool for establishing the inherently dynamic nature of functional annotations, which depend on results from on-going proteomic experiments. Annotator is available for download at http://people.cs.uchicago.edu/~tyler/annotator/annotator_desktop_0.1.tar.gz.

Sylvester, Juliesta E.; Bray, Tyler S.; Kron, Stephen J.

2012-01-01

105

The Mouse Functional Genome Database (MfunGD): functional annotation of proteins in the light of their cellular context.  

PubMed

MfunGD (http://mips.gsf.de/genre/proj/mfungd/) provides a resource for annotated mouse proteins and their occurrence in protein networks. Manual annotation concentrates on proteins which are found to interact physically with other proteins. Accordingly, manually curated information from a protein-protein interaction database (MPPI) and a database of mammalian protein complexes is interconnected with MfunGD. Protein function annotation is performed using the Functional Catalogue (FunCat) annotation scheme which is widely used for the analysis of protein networks. The dataset is also supplemented with information about the literature that was used in the annotation process as well as links to the SIMAP Fasta database, the Pedant protein analysis system and cross-references to external resources. Proteins that so far were not manually inspected are annotated automatically by a graphical probabilistic model and/or superparamagnetic clustering. The database is continuously expanding to include the rapidly growing amount of functional information about gene products from mouse. MfunGD is implemented in GenRE, a J2EE-based component-oriented multi-tier architecture following the separation of concern principle. PMID:16381934

Ruepp, Andreas; Doudieu, Octave Noubibou; van den Oever, Jos; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Fobo, Gisela; Frishman, Goar; Montrone, Corinna; Skornia, Christine; Wanka, Steffi; Rattei, Thomas; Pagel, Philipp; Riley, Louise; Frishman, Dmitrij; Surmeli, Dimitrij; Tetko, Igor V; Oesterheld, Matthias; Stümpflen, Volker; Mewes, H Werner

2006-01-01

106

Learning Sparse Representations for Fruit-Fly Gene Expression Pattern Image Annotation and Retrieval  

PubMed Central

Background Fruit fly embryogenesis is one of the best understood animal development systems, and the spatiotemporal gene expression dynamics in this process are captured by digital images. Analysis of these high-throughput images will provide novel insights into the functions, interactions, and networks of animal genes governing development. To facilitate comparative analysis, web-based interfaces have been developed to conduct image retrieval based on body part keywords and images. Currently, the keyword annotation of spatiotemporal gene expression patterns is conducted manually. However, this manual practice does not scale with the continuously expanding collection of images. In addition, existing image retrieval systems based on the expression patterns may be made more accurate using keywords. Results In this article, we adapt advanced data mining and computer vision techniques to address the key challenges in annotating and retrieving fruit fly gene expression pattern images. To boost the performance of image annotation and retrieval, we propose representations integrating spatial information and sparse features, overcoming the limitations of prior schemes. Conclusions We perform systematic experimental studies to evaluate the proposed schemes in comparison with current methods. Experimental results indicate that the integration of spatial information and sparse features lead to consistent performance improvement in image annotation, while for the task of retrieval, sparse features alone yields better results.

2012-01-01

107

Improved Gene Ontology Annotation for Biofilm Formation, Filamentous Growth, and Phenotypic Switching in Candida albicans  

PubMed Central

The opportunistic fungal pathogen Candida albicans is a significant medical threat, especially for immunocompromised patients. Experimental research has focused on specific areas of C. albicans biology, with the goal of understanding the multiple factors that contribute to its pathogenic potential. Some of these factors include cell adhesion, invasive or filamentous growth, and the formation of drug-resistant biofilms. The Gene Ontology (GO) (www.geneontology.org) is a standardized vocabulary that the Candida Genome Database (CGD) (www.candidagenome.org) and other groups use to describe the functions of gene products. To improve the breadth and accuracy of pathogenicity-related gene product descriptions and to facilitate the description of as yet uncharacterized but potentially pathogenicity-related genes in Candida species, CGD undertook a three-part project: first, the addition of terms to the biological process branch of the GO to improve the description of fungus-related processes; second, manual recuration of gene product annotations in CGD to use the improved GO vocabulary; and third, computational ortholog-based transfer of GO annotations from experimentally characterized gene products, using these new terms, to uncharacterized orthologs in other Candida species. Through genome annotation and analysis, we identified candidate pathogenicity genes in seven non-C. albicans Candida species and in one additional C. albicans strain, WO-1. We also defined a set of C. albicans genes at the intersection of biofilm formation, filamentous growth, pathogenesis, and phenotypic switching of this opportunistic fungal pathogen, which provides a compelling list of candidates for further experimentation.

Inglis, Diane O.; Skrzypek, Marek S.; Arnaud, Martha B.; Binkley, Jonathan; Shah, Prachi; Wymore, Farrell

2013-01-01

108

Bioinformatics approaches for functional annotation of membrane proteins.  

PubMed

Membrane proteins perform diverse functions in living organisms such as transporters, receptors and channels. The functions of membrane proteins have been investigated with several computational approaches, such as developing databases, analyzing the structure-function relationship and establishing algorithms to discriminate different type of membrane proteins. However, compilation of bioinformatics resources for the functions of membrane proteins is not well documented compared with their structural aspects. In this comprehensive review, we elaborately focus on three aspects of membrane protein functions: (i) databases for different types of membrane proteins based on their functions including transporters, receptors and ion channels, annotated functional data for genomes, as well as functionally important amino acid residues in membrane proteins obtained from experimental data, (ii) analysis of membrane protein functions based on their structures, motifs, amino acid properties and other features and (iii) algorithms for discriminating different types of membrane proteins and annotating them in genomic sequences. In addition, we provide a list of online resources for the databases and web servers for functional annotation of membrane proteins. PMID:23524979

Gromiha, M Michael; Ou, Yu-Yen

2013-03-23

109

Functional Annotation of a Full-Length Arabidopsis cDNA Collection  

Microsoft Academic Search

Full-length complementary DNAs (cDNAs) are essential for the correct annotation of genomic sequences and for the functional analysis of genes and their products. We isolated 155,144 RIKEN Arabidopsis full-length (RAFL) cDNA clones. The 3'-end expressed sequence tags (ESTs) of 155,144 RAFL cDNAs were clustered into 14,668 nonredundant cDNA groups, about 60% of predicted genes. We also obtained 5' ESTs from

Motoaki Seki; Mari Narusaka; Asako Kamiya; Junko Ishida; Masakazu Satou; Tetsuya Sakurai; Maiko Nakajima; Akiko Enju; Kenji Akiyama; Youko Oono; Masami Muramatsu; Yoshihide Hayashizaki; Jun Kawai; Piero Carninci; Masayoshi Itoh; Yoshiyuki Ishii; Takahiro Arakawa; Kazuhiro Shibata; Akira Shinagawa; Kazuo Shinozaki

2002-01-01

110

DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists  

PubMed Central

All tools in the DAVID Bioinformatics Resources aim to provide functional interpretation of large lists of genes derived from genomic studies. The newly updated DAVID Bioinformatics Resources consists of the DAVID Knowledgebase and five integrated, web-based functional annotation tool suites: the DAVID Gene Functional Classification Tool, the DAVID Functional Annotation Tool, the DAVID Gene ID Conversion Tool, the DAVID Gene Name Viewer and the DAVID NIAID Pathogen Genome Browser. The expanded DAVID Knowledgebase now integrates almost all major and well-known public bioinformatics resources centralized by the DAVID Gene Concept, a single-linkage method to agglomerate tens of millions of diverse gene/protein identifiers and annotation terms from a variety of public bioinformatics databases. For any uploaded gene list, the DAVID Resources now provides not only the typical gene-term enrichment analysis, but also new tools and functions that allow users to condense large gene lists into gene functional groups, convert between gene/protein identifiers, visualize many-genes-to-many-terms relationships, cluster redundant and heterogeneous terms into groups, search for interesting and related genes or terms, dynamically view genes from their lists on bio-pathways and more. With DAVID (http://david.niaid.nih.gov), investigators gain more power to interpret the biological mechanisms associated with large gene lists.

Huang, Da Wei; Sherman, Brad T.; Tan, Qina; Kir, Joseph; Liu, David; Bryant, David; Guo, Yongjian; Stephens, Robert; Baseler, Michael W.; Lane, H. Clifford; Lempicki, Richard A.

2007-01-01

111

OryzaExpress: an integrated database of gene expression networks and omics annotations in rice.  

PubMed

Similarity of gene expression profiles provides important clues for understanding the biological functions of genes, biological processes and metabolic pathways related to genes. A gene expression network (GEN) is an ideal choice to grasp such expression profile similarities among genes simultaneously. For GEN construction, the Pearson correlation coefficient (PCC) has been widely used as an index to evaluate the similarities of expression profiles for gene pairs. However, calculation of PCCs for all gene pairs requires large amounts of both time and computer resources. Based on correspondence analysis, we developed a new method for GEN construction, which takes minimal time even for large-scale expression data with general computational circumstances. Moreover, our method requires no prior parameters to remove sample redundancies in the data set. Using the new method, we constructed rice GENs from large-scale microarray data stored in a public database. We then collected and integrated various principal rice omics annotations in public and distinct databases. The integrated information contains annotations of genome, transcriptome and metabolic pathways. We thus developed the integrated database OryzaExpress for browsing GENs with an interactive and graphical viewer and principal omics annotations (http://riceball.lab.nig.ac.jp/oryzaexpress/). With integration of Arabidopsis GEN data from ATTED-II, OryzaExpress also allows us to compare GENs between rice and Arabidopsis. Thus, OryzaExpress is a comprehensive rice database that exploits powerful omics approaches from all perspectives in plant science and leads to systems biology. PMID:21186175

Hamada, Kazuki; Hongo, Kohei; Suwabe, Keita; Shimizu, Akifumi; Nagayama, Taishi; Abe, Reina; Kikuchi, Shunsuke; Yamamoto, Naoki; Fujii, Takaaki; Yokoyama, Koji; Tsuchida, Hiroko; Sano, Kazumi; Mochizuki, Takako; Oki, Nobuhiko; Horiuchi, Youko; Fujita, Masahiro; Watanabe, Masao; Matsuoka, Makoto; Kurata, Nori; Yano, Kentaro

2010-12-23

112

Automated Eukaryotic Gene Structure Annotation Using EVidenceModeler and the Program to Assemble Spliced Alignments  

SciTech Connect

EVidenceModeler (EVM) is presented as an automated eukaryotic gene structure annotation tool that reports eukaryotic gene structures as a weighted consensus of all available evidence. EVM, when combined with the Program to Assemble Spliced Alignments (PASA), yields a comprehensive, configurable annotation system that predicts protein-coding genes and alternatively spliced isoforms. Our experiments on both rice and human genome sequences demonstrate that EVM produces automated gene structure annotation approaching the quality of manual curation.

Haas, B J; Salzberg, S L; Zhu, W; Pertea, M; Allen, J E; Orvis, J; White, O; Buell, C R; Wortman, J R

2007-12-10

113

Synergistic use of plant-prokaryote comparative genomics for functional annotations  

Microsoft Academic Search

Background  Identifying functions for all gene products in all sequenced organisms is a central challenge of the post-genomic era. However,\\u000a at least 30-50% of the proteins encoded by any given genome are of unknown or vaguely known function, and a large number are\\u000a wrongly annotated. Many of these ‘unknown’ proteins are common to prokaryotes and plants. We set out to predict

Svetlana Gerdes; Basma El Yacoubi; Marc Bailly; Ian K Blaby; Crysten E Blaby-Haas; Linda Jeanguenin; Aurora Lara-Núñez; Anne Pribat; Jeffrey C Waller; Andreas Wilke; Ross Overbeek; Andrew D Hanson; Valérie de Crécy-Lagard

2011-01-01

114

Function annotation for pseudoknot using structure similarity.  

PubMed

Many raw biological sequence data have been generated by the human genome project and related efforts. The understanding of structural information encoded by biological sequences is important to acquire knowledge of their biochemical functions but remains a fundamental challenge. Recent interest in RNA regulation has resulted in a rapid growth of deposited RNA secondary structures in varied databases. However, a functional classification and characterization of the RNA structure have only been partially addressed. This article aims to introduce a novel interval-based distance metric for structure-based RNA function assignment. The characterization of RNA structures relies on distance vectors learned from a collection of predicted structures. The distance measure considers the intersected, disjoint, and inclusion between intervals. A set of RNA pseudoknotted structures with known function are applied and the function of the query structure is determined by measuring structure similarity. This not only offers sequence distance criteria to measure the similarity of secondary structures but also aids the functional classification of RNA structures with pesudoknots. PMID:21383413

Chen, Qingfeng; Chen, Yi-Ping Phoebe

115

Automated annotation of gene expression image sequences via non-parametric factor analysis and conditional random fields  

PubMed Central

Motivation: Computational approaches for the annotation of phenotypes from image data have shown promising results across many applications, and provide rich and valuable information for studying gene function and interactions. While data are often available both at high spatial resolution and across multiple time points, phenotypes are frequently annotated independently, for individual time points only. In particular, for the analysis of developmental gene expression patterns, it is biologically sensible when images across multiple time points are jointly accounted for, such that spatial and temporal dependencies are captured simultaneously. Methods: We describe a discriminative undirected graphical model to label gene-expression time-series image data, with an efficient training and decoding method based on the junction tree algorithm. The approach is based on an effective feature selection technique, consisting of a non-parametric sparse Bayesian factor analysis model. The result is a flexible framework, which can handle large-scale data with noisy incomplete samples, i.e. it can tolerate data missing from individual time points. Results: Using the annotation of gene expression patterns across stages of Drosophila embryonic development as an example, we demonstrate that our method achieves superior accuracy, gained by jointly annotating phenotype sequences, when compared with previous models that annotate each stage in isolation. The experimental results on missing data indicate that our joint learning method successfully annotates genes for which no expression data are available for one or more stages. Contact: uwe.ohler@duke.edu

Pruteanu-Malinici, Iulian; Majoros, William H.; Ohler, Uwe

2013-01-01

116

GOToolBox: functional analysis of gene datasets based on Gene Ontology  

Microsoft Academic Search

We have developed methods and tools based on the Gene Ontology (GO) resource allowing the identification of statistically over- or under-represented terms in a gene dataset; the clustering of functionally related genes within a set; and the retrieval of genes sharing annotations with a query gene. GO annotations can also be constrained to a slim hierarchy or a given level

David Martin; Christine Brun; Elisabeth Remy; Pierre Mouren; Denis Thieffry; Bernard Jacq

2004-01-01

117

Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database  

PubMed Central

The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as ‘GO’ or ‘homology’ or ‘phenotype’. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as ‘papers selected for GO that refer to genes with NO GO annotation’. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications.

Drabkin, Harold J.; Blake, Judith A.

2012-01-01

118

Use of Gene Ontology Annotation to understand the peroxisome proteome in humans  

PubMed Central

The Gene Ontology (GO) is the de facto standard for the functional description of gene products, providing a consistent, information-rich terminology applicable across species and information repositories. The UniProt Consortium uses both manual and automatic GO annotation approaches to curate UniProt Knowledgebase (UniProtKB) entries. The selection of a protein set prioritized for manual annotation has implications for the characteristics of the information provided to users working in a specific field or interested in particular pathways or processes. In this article, we describe an organelle-focused, manual curation initiative targeting proteins from the human peroxisome. We discuss the steps taken to define the peroxisome proteome and the challenges encountered in defining the boundaries of this protein set. We illustrate with the use of examples how GO annotations now capture cell and tissue type information and the advantages that such an annotation approach provides to users. Database URL: http://www.ebi.ac.uk/GOA/ and http://www.uniprot.org

Mutowo-Meullenet, Prudence; Huntley, Rachael P.; Dimmer, Emily C.; Alam-Faruque, Yasmin; Sawford, Tony; Jesus Martin, Maria; O'Donovan, Claire; Apweiler, Rolf

2013-01-01

119

Functional annotation of 19,841 Populus nigra full-length enriched cDNA clones  

PubMed Central

Background Populus is one of favorable model plants because of its small genome. Structural genomics of Populus has reached a breakpoint as nucleotides of the entire genome have been determined. Reaching the post genome era, functional genomics of Populus is getting more important for well-comprehended plant science. Development of bioresorce serving functional genomics is making rapid progress. Huge efforts have achieved deposits of expressed sequence tags (ESTs) in various plant species consequently accelerating functional analysis of genes. ESTs from full-length cDNA clones are especially powerful for accurate molecular annotation. We promoted collection and annotation of the ESTs from Populus full-length enriched cDNA clones as part of functional genomics of tree species. Results We have been collecting the full-length enriched cDNA of the female poplar (Populus nigra var. italica) for years. By sequencing P. nigra full-length (PnFL) cDNA libraries, we generated about 116,000 5'-end or 3'-end ESTs corresponding to 19,841 nonredundant PnFL clones. Population of PnFL cDNA clones represents 44% of the predicted genes in the Populus genome. Conclusion Our resource of P. nigra full-length enriched clones is expected to provide valuable tools to gain further insight into genome annotation and functional genomics in Populus.

Nanjo, Tokihiko; Sakurai, Tetsuya; Totoki, Yasushi; Toyoda, Atsushi; Nishiguchi, Mitsuru; Kado, Tomoyuki; Igasaki, Tomohiro; Futamura, Norihiro; Seki, Motoaki; Sakaki, Yoshiyuki; Shinozaki, Kazuo; Shinohara, Kenji

2007-01-01

120

Structural and Functional Analysis of Rv3214 from Mycobacterium tuberculosis, a Protein with Conflicting Functional Annotations, Leads to Its Characterization as a Phosphatase  

Microsoft Academic Search

The availability of complete genome sequences has highlighted the problems of functional annotation of the many gene products that have only limited sequence similarity with proteins of known function. The predicted protein encoded by open reading frame Rv3214 from the Mycobacterium tuberculosis H37Rv genome was originally annotated as EntD through sequence similarity with the Escherichia coli EntD, a 4-phosphopante- theinyl

Harriet A. Watkins; Edward N. Baker

2006-01-01

121

A rigorous method for multigenic families' functional annotation: the peptidyl arginine deiminase (PADs) proteins family example  

Microsoft Academic Search

BACKGROUND: large scale and reliable proteins' functional annotation is a major challenge in modern biology. Phylogenetic analyses have been shown to be important for such tasks. However, up to now, phylogenetic annotation did not take into account expression data (i.e. ESTs, Microarrays, SAGE, ...). Therefore, integrating such data, like ESTs in phylogenetic annotation could be a major advance in post

N Balandraud; P Gouret; EGJ Danchin; M Blanc; D Zinn; J Roudier; P Pontarotti

2005-01-01

122

Functional annotation of the human chromosome 7 "missing" proteins: a bioinformatics approach.  

PubMed

The chromosome-centric human proteome project aims to systematically map all human proteins, chromosome by chromosome, in a gene-centric manner through dedicated efforts from national and international teams. This mapping will lead to a knowledge-based resource defining the full set of proteins encoded in each chromosome and laying the foundation for the development of a standardized approach to analyze the massive proteomic data sets currently being generated. The neXtProt database lists 946 proteins as the human proteome of chromosome 7. However, 170 (18%) proteins of human chromosome 7 have no evidence at the proteomic, antibody, or structural levels and are considered "missing" in this study as they lack experimental support. We have developed a protocol for the functional annotation of these "missing" proteins by integrating several bioinformatics analysis and annotation tools, sequential BLAST homology searches, protein domain/motif and gene ontology (GO) mapping, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. Using the BLAST search strategy, homologues for reviewed non-human mammalian proteins with protein evidence were identified for 90 "missing" proteins while another 38 had reviewed non-human mammalian homologues. Putative functional annotations were assigned to 27 of the remaining 43 novel proteins. Proteotypic peptides have been computationally generated to facilitate rapid identification of these proteins. Four of the "missing" chromosome 7 proteins have been substantiated by the ENCODE proteogenomic peptide data. PMID:23308364

Ranganathan, Shoba; Khan, Javed M; Garg, Gagan; Baker, Mark S

2013-01-11

123

Drosophila Gene Expression Pattern Annotation Using Sparse Features and Term-Term Interactions.  

PubMed

The Drosophila gene expression pattern images document the spatial and temporal dynamics of gene expression and they are valuable tools for explicating the gene functions, interaction, and networks during Drosophila embryogenesis. To provide text-based pattern searching, the images in the Berkeley Drosophila Genome Project (BDGP) study are annotated with ontology terms manually by human curators. We present a systematic approach for automating this task, because the number of images needing text descriptions is now rapidly increasing. We consider both improved feature representation and novel learning formulation to boost the annotation performance. For feature representation, we adapt the bag-of-words scheme commonly used in visual recognition problems so that the image group information in the BDGP study is retained. Moreover, images from multiple views can be integrated naturally in this representation. To reduce the quantization error caused by the bag-of-words representation, we propose an improved feature representation scheme based on the sparse learning technique. In the design of learning formulation, we propose a local regularization framework that can incorporate the correlations among terms explicitly. We further show that the resulting optimization problem admits an analytical solution. Experimental results show that the representation based on sparse learning outperforms the bag-of-words representation significantly. Results also show that incorporation of the term-term correlations improves the annotation performance consistently. PMID:21614142

Ji, Shuiwang; Yuan, Lei; Li, Ying-Xin; Zhou, Zhi-Hua; Kumar, Sudhir; Ye, Jieping

2009-06-28

124

Mining GO Annotations for Improving Annotation Consistency  

PubMed Central

Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.

Faria, Daniel; Schlicker, Andreas; Pesquita, Catia; Bastos, Hugo; Ferreira, Antonio E. N.; Albrecht, Mario; Falcao, Andre O.

2012-01-01

125

BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources  

PubMed Central

Online gene annotation resources are indispensable for analysis of genomics data. However, the landscape of these online resources is highly fragmented, and scientists often visit dozens of these sites for each gene in a candidate gene list. Here, we introduce BioGPS http://biogps.gnf.org, a centralized gene portal for aggregating distributed gene annotation resources. Moreover, BioGPS embraces the principle of community intelligence, enabling any user to easily and directly contribute to the BioGPS platform.

2009-01-01

126

Mining phenotypes for gene function prediction  

PubMed Central

Background Health and disease of organisms are reflected in their phenotypes. Often, a genetic component to a disease is discovered only after clearly defining its phenotype. In the past years, many technologies to systematically generate phenotypes in a high-throughput manner, such as RNA interference or gene knock-out, have been developed and used to decipher functions for genes. However, there have been relatively few efforts to make use of phenotype data beyond the single genotype-phenotype relationships. Results We present results on a study where we use a large set of phenotype data – in textual form – to predict gene annotation. To this end, we use text clustering to group genes based on their phenotype descriptions. We show that these clusters correlate well with several indicators for biological coherence in gene groups, such as functional annotations from the Gene Ontology (GO) and protein-protein interactions. We exploit these clusters for predicting gene function by carrying over annotations from well-annotated genes to other, less-characterized genes in the same cluster. For a subset of groups selected by applying objective criteria, we can predict GO-term annotations from the biological process sub-ontology with up to 72.6% precision and 16.7% recall, as evaluated by cross-validation. We manually verified some of these clusters and found them to exhibit high biological coherence, e.g. a group containing all available antennal Drosophila odorant receptors despite inconsistent GO-annotations. Conclusion The intrinsic nature of phenotypes to visibly reflect genetic activity underlines their usefulness in inferring new gene functions. Thus, systematically analyzing these data on a large scale offers many possibilities for inferring functional annotation of genes. We show that text clustering can play an important role in this process.

Groth, Philip; Weiss, Bertram; Pohlenz, Hans-Dieter; Leser, Ulf

2008-01-01

127

FastAnnotator- an efficient transcript annotation web tool  

PubMed Central

Background Recent developments in high-throughput sequencing (HTS) technologies have made it feasible to sequence the complete transcriptomes of non-model organisms or metatranscriptomes from environmental samples. The challenge after generating hundreds of millions of sequences is to annotate these transcripts and classify the transcripts based on their putative functions. Because many biological scientists lack the knowledge to install Linux-based software packages or maintain databases used for transcript annotation, we developed an automatic annotation tool with an easy-to-use interface. Methods To elucidate the potential functions of gene transcripts, we integrated well-established annotation tools: Blast2GO, PRIAM and RPS BLAST in a web-based service, FastAnnotator, which can assign Gene Ontology (GO) terms, Enzyme Commission numbers (EC numbers) and functional domains to query sequences. Results Using six transcriptome sequence datasets as examples, we demonstrated the ability of FastAnnotator to assign functional annotations. FastAnnotator annotated 88.1% and 81.3% of the transcripts from the well-studied organisms Caenorhabditis elegans and Streptococcus parasanguinis, respectively. Furthermore, FastAnnotator annotated 62.9%, 20.4%, 53.1% and 42.0% of the sequences from the transcriptomes of sweet potato, clam, amoeba, and Trichomonas vaginalis, respectively, which lack reference genomes. We demonstrated that FastAnnotator can complete the annotation process in a reasonable amount of time and is suitable for the annotation of transcriptomes from model organisms or organisms for which annotated reference genomes are not avaiable. Conclusions The sequencing process no longer represents the bottleneck in the study of genomics, and automatic annotation tools have become invaluable as the annotation procedure has become the limiting step. We present FastAnnotator, which was an automated annotation web tool designed to efficiently annotate sequences with their gene functions, enzyme functions or domains. FastAnnotator is useful in transcriptome studies and especially for those focusing on non-model organisms or metatranscriptomes. FastAnnotator does not require local installation and is freely available at http://fastannotator.cgu.edu.tw.

2012-01-01

128

Combining heterogeneous data sources for accurate functional annotation of proteins.  

PubMed

Combining heterogeneous sources of data is essential for accurate prediction of protein function. The task is complicated by the fact that while sequence-based features can be readily compared across species, most other data are species-specific. In this paper, we present a multi-view extension to GOstruct, a structured-output framework for function annotation of proteins. The extended framework can learn from disparate data sources, with each data source provided to the framework in the form of a kernel. Our empirical results demonstrate that the multi-view framework is able to utilize all available information, yielding better performance than sequence-based models trained across species and models trained from collections of data within a given species. This version of GOstruct participated in the recent Critical Assessment of Functional Annotations (CAFA) challenge; since then we have significantly improved the natural language processing component of the method, which now provides performance that is on par with that provided by sequence information. The GOstruct framework is available for download at http://strut.sourceforge.net. PMID:23514123

Sokolov, Artem; Funk, Christopher; Graim, Kiley; Verspoor, Karin; Ben-Hur, Asa

2013-02-28

129

Gene discovery in the hamster: a comparative genomics approach for gene annotation by sequencing of hamster testis cDNAs  

PubMed Central

Background Complete genome annotation will likely be achieved through a combination of computer-based analysis of available genome sequences combined with direct experimental characterization of expressed regions of individual genomes. We have utilized a comparative genomics approach involving the sequencing of randomly selected hamster testis cDNAs to begin to identify genes not previously annotated on the human, mouse, rat and Fugu (pufferfish) genomes. Results 735 distinct sequences were analyzed for their relatedness to known sequences in public databases. Eight of these sequences were derived from previously unidentified genes and expression of these genes in testis was confirmed by Northern blotting. The genomic locations of each sequence were mapped in human, mouse, rat and pufferfish, where applicable, and the structure of their cognate genes was derived using computer-based predictions, genomic comparisons and analysis of uncharacterized cDNA sequences from human and macaque. Conclusion The use of a comparative genomics approach resulted in the identification of eight cDNAs that correspond to previously uncharacterized genes in the human genome. The proteins encoded by these genes included a new member of the kinesin superfamily, a SET/MYND-domain protein, and six proteins for which no specific function could be predicted. Each gene was expressed primarily in testis, suggesting that they may play roles in the development and/or function of testicular cells.

Oduru, Sreedhar; Campbell, Janee L; Karri, SriTulasi; Hendry, William J; Khan, Shafiq A; Williams, Simon C

2003-01-01

130

Computational analysis of transcriptome of Indian major carp, Labeo rohita (Hamilton-Buchanan, 1822) for functional annotation  

PubMed Central

A total of 1671 ESTs of Labeo rohita were retrieved from dbEST database and analysed for functional annotation using various computational approaches. The result indicated 1387 non-redundant (184 contigs and 1203 singletons) putative transcripts with an average length of 542 bp. These 1387 transcript sequences were matched with Refseq_RNA, UniGene and Swiss-Prot on high threshold cut-off for functional annotation along with help of gene ontology and SSRs markers. We developed extensive Perl programming based modules for processing all alignment files, comparing and extracting common hits from all files on a threshold, evaluating statistics for alignment results and assigning gene ontology terms. In this study, 92 putative transcripts predicted as orthologous genes and among those, 44 putative transcripts were annotated with gene ontology terms. The annotated orthologous gene of our result associated with some very important proteins of L. rohita involved in biotic and abiotic stresses and glucose metabolism of spermatogenic cells etc. The unidentified transcripts, if found important in expression profiling can be vital resource after re-sequencing. The predicted genes can further be used for enhancing productivity and controlling disease of L. rohita.

Nagpure, Naresh Sahebrao; Rashid, Iliyas; Pathak, Ajey Kumar; Singh, Mahender; Singh, Shri Prakash; Sarkar, Uttam Kumar

2012-01-01

131

Cloning, analysis and functional annotation of expressed sequence tags from the Earthworm Eisenia fetida  

PubMed Central

Background Eisenia fetida, commonly known as red wiggler or compost worm, belongs to the Lumbricidae family of the Annelida phylum. Little is known about its genome sequence although it has been extensively used as a test organism in terrestrial ecotoxicology. In order to understand its gene expression response to environmental contaminants, we cloned 4032 cDNAs or expressed sequence tags (ESTs) from two E. fetida libraries enriched with genes responsive to ten ordnance related compounds using suppressive subtractive hybridization-PCR. Results A total of 3144 good quality ESTs (GenBank dbEST accession number EH669363–EH672369 and EL515444–EL515580) were obtained from the raw clone sequences after cleaning. Clustering analysis yielded 2231 unique sequences including 448 contigs (from 1361 ESTs) and 1783 singletons. Comparative genomic analysis showed that 743 or 33% of the unique sequences shared high similarity with existing genes in the GenBank nr database. Provisional function annotation assigned 830 Gene Ontology terms to 517 unique sequences based on their homology with the annotated genomes of four model organisms Drosophila melanogaster, Mus musculus, Saccharomyces cerevisiae, and Caenorhabditis elegans. Seven percent of the unique sequences were further mapped to 99 Kyoto Encyclopedia of Genes and Genomes pathways based on their matching Enzyme Commission numbers. All the information is stored and retrievable at a highly performed, web-based and user-friendly relational database called EST model database or ESTMD version 2. Conclusion The ESTMD containing the sequence and annotation information of 4032 E. fetida ESTs is publicly accessible at .

Pirooznia, Mehdi; Gong, Ping; Guan, Xin; Inouye, Laura S; Yang, Kuan; Perkins, Edward J; Deng, Youping

2007-01-01

132

Synergistic use of plant-prokaryote comparative genomics for functional annotations  

PubMed Central

Background Identifying functions for all gene products in all sequenced organisms is a central challenge of the post-genomic era. However, at least 30-50% of the proteins encoded by any given genome are of unknown or vaguely known function, and a large number are wrongly annotated. Many of these ‘unknown’ proteins are common to prokaryotes and plants. We set out to predict and experimentally test the functions of such proteins. Our approach to functional prediction integrates comparative genomics based mainly on microbial genomes with functional genomic data from model microorganisms and post-genomic data from plants. This approach bridges the gap between automated homology-based annotations and the classical gene discovery efforts of experimentalists, and is more powerful than purely computational approaches to identifying gene-function associations. Results Among Arabidopsis genes, we focused on those (2,325 in total) that (i) are unique or belong to families with no more than three members, (ii) occur in prokaryotes, and (iii) have unknown or poorly known functions. Computer-assisted selection of promising targets for deeper analysis was based on homology-independent characteristics associated in the SEED database with the prokaryotic members of each family. In-depth comparative genomic analysis was performed for 360 top candidate families. From this pool, 78 families were connected to general areas of metabolism and, of these families, specific functional predictions were made for 41. Twenty-one predicted functions have been experimentally tested or are currently under investigation by our group in at least one prokaryotic organism (nine of them have been validated, four invalidated, and eight are in progress). Ten additional predictions have been independently validated by other groups. Discovering the function of very widespread but hitherto enigmatic proteins such as the YrdC or YgfZ families illustrates the power of our approach. Conclusions Our approach correctly predicted functions for 19 uncharacterized protein families from plants and prokaryotes; none of these functions had previously been correctly predicted by computational methods. The resulting annotations could be propagated with confidence to over six thousand homologous proteins encoded in over 900 bacterial, archaeal, and eukaryotic genomes currently available in public databases.

2011-01-01

133

A New Strategy to Identify and Annotate Human RPE-Specific Gene Expression  

PubMed Central

Background To identify and functionally annotate cell type-specific gene expression in the human retinal pigment epithelium (RPE), a key tissue involved in age-related macular degeneration and retinitis pigmentosa. Methodology RPE, photoreceptor and choroidal cells were isolated from selected freshly frozen healthy human donor eyes using laser microdissection. RNA isolation, amplification and hybridization to 44 k microarrays was carried out according to Agilent specifications. Bioinformatics was carried out using Rosetta Resolver, David and Ingenuity software. Principal Findings Our previous 22 k analysis of the RPE transcriptome showed that the RPE has high levels of protein synthesis, strong energy demands, is exposed to high levels of oxidative stress and a variable degree of inflammation. We currently use a complementary new strategy aimed at the identification and functional annotation of RPE-specific expressed transcripts. This strategy takes advantage of the multilayered cellular structure of the retina and overcomes a number of limitations of previous studies. In triplicate, we compared the transcriptomes of RPE, photoreceptor and choroidal cells and we deduced RPE specific expression. We identified at least 114 entries with RPE-specific gene expression. Thirty-nine of these 114 genes also show high expression in the RPE, comparison with the literature showed that 85% of these 39 were previously identified to be expressed in the RPE. In the group of 114 RPE specific genes there was an overrepresentation of genes involved in (membrane) transport, vision and ophthalmic disease. More fundamentally, we found RPE-specific involvement in the RAR-activation, retinol metabolism and GABA receptor signaling pathways. Conclusions In this study we provide a further specification and understanding of the RPE transcriptome by identifying and analyzing genes that are specifically expressed in the RPE.

Booij, Judith C.; ten Brink, Jacoline B.; Swagemakers, Sigrid M. A.; Verkerk, Annemieke J. M. H.; Essing, Anke H. W.; van der Spek, Peter J.; Bergen, Arthur A. B.

2010-01-01

134

Transcript-level annotation of Affymetrix probesets improves the interpretation of gene expression data  

PubMed Central

Background The wide use of Affymetrix microarray in broadened fields of biological research has made the probeset annotation an important issue. Standard Affymetrix probeset annotation is at gene level, i.e. a probeset is precisely linked to a gene, and probeset intensity is interpreted as gene expression. The increased knowledge that one gene may have multiple transcript variants clearly brings up the necessity of updating this gene-level annotation to a refined transcript-level. Results Through performing rigorous alignments of the Affymetrix probe sequences against a comprehensive pool of currently available transcript sequences, and further linking the probesets to the International Protein Index, we generated transcript-level or protein-level annotation tables for two popular Affymetrix expression arrays, Mouse Genome 430A 2.0 Array and Human Genome U133A Array. Application of our new annotations in re-examining existing expression data sets shows increased expression consistency among synonymous probesets and strengthened expression correlation between interacting proteins. Conclusion By refining the standard Affymetrix annotation of microarray probesets from the gene level to the transcript level and protein level, one can achieve a more reliable interpretation of their experimental data, which may lead to discovery of more profound regulatory mechanism.

Yu, Hui; Wang, Feng; Tu, Kang; Xie, Lu; Li, Yuan-Yuan; Li, Yi-Xue

2007-01-01

135

A nuclear magnetic resonance based approach to accurate functional annotation of putative enzymes in the methanogen Methanosarcina acetivorans  

PubMed Central

Background Correct annotation of function is essential if one is to take full advantage of the vast amounts of genomic sequence data. The accuracy of sequence-based functional annotations is often variable, particularly if the sequence homology to a known function is low. Indeed recent work has shown that even proteins with very high sequence identity can have different folds and functions, and therefore caution is needed in assigning functions by sequence homology in the absence of experimental validation. Experimental methods are therefore needed to efficiently evaluate annotations in a way that complements current high throughput technologies. Here, we describe the use of nuclear magnetic resonance (NMR)-based ligand screening as a tool for testing functional assignments of putative enzymes that may be of variable reliability. Results The target genes for this study are putative enzymes from the methanogenic archaeon Methanosarcina acetivorans (MA) that have been selected after manual genome re-annotation and demonstrate detectable in vivo expression at the level of the transcriptome. The experimental approach begins with heterologous E. coli expression and purification of individual MA gene products. An NMR-based ligand screen of the purified protein then identifies possible substrates or products from a library of candidate compounds chosen from the putative pathway and other related pathways. These data are used to determine if the current sequence-based annotation is likely to be correct. For a number of case studies, additional experiments (such as in vivo genetic complementation) were performed to determine function so that the reliability of the NMR screen could be independently assessed. Conclusions In all examples studied, the NMR screen was indicative of whether the functional annotation was correct. Thus, the case studies described demonstrate that NMR-based ligand screening is an effective and rapid tool for confirming or negating the annotated gene function of putative enzymes. In particular, no protein-specific assay needs to be developed, which makes the approach broadly applicable for validating putative functions using an automated pipeline strategy.

2011-01-01

136

Initiating the Mollusk Genomics Annotation Community: Toward Creating the Complete Curated Gene-Set of the Japanese Pearl Oyster, Pinctada fucata.  

PubMed

The genome sequence of the Japanese pearl oyster, the first draft genome from a mollusk, was published in February 2012. In order to curate the draft genome assemblies and annotate the predicted gene models, two annotation Jamborees were held in Okinawa and Tokyo. To date, 761 genes have been surveyed and curated. A preparatory meeting and a debriefing were held at the Misaki Marine Biological Station before and after the Jamborees. These four events, in conjunction with the sequence-decoding project, have facilitated the first series of gene annotations. Genome annotators among the Jamboree participants added 22 functional categories to the annotation system to date. Of these, 17 are included in Generic Gene Ontology. The other five categories are specific to molluskan biology, such as "Byssus Formation" and "Shell Formation", including Biomineralization and Acidic Proteins. A total of 731 genes from our latest version of gene models are annotated and classified into these 22 categories. The resulting data will serve as a useful reference for future genomic analyses of this species as well as comparative analyses among mollusks. PMID:24125643

Kawashima, Takeshi; Takeuchi, Takeshi; Koyanagi, Ryo; Kinoshita, Shigeharu; Endo, Hirotoshi; Endo, Kazuyoshi

2013-10-01

137

Annotation of a 95-kb Populus deltoides genomic sequence reveals a disease resistance gene cluster and novel class I and class II transposable elements  

Microsoft Academic Search

Poplar has become a model system for functional genomics in woody plants. Here, we report the sequencing and annotation of the first large contiguous stretch of genomic sequence (95 kb) of poplar, corresponding to a bacterial artificial chromosome clone mapped 0.6 centiMorgan from the Melampsora larici-populina resistance locus. The annotation revealed 15 putative genetic objects, of which five were classified as hypothetical genes

M. Lescot; S. Rombauts; J. Zhang; S. Aubourg; C. Mathé; S. Jansson; P. Rouzé; W. Boerjan

2004-01-01

138

Large-scale collection and annotation of gene models for date palm (Phoenix dactylifera, L.).  

PubMed

The date palm (Phoenix dactylifera L.), famed for its sugar-rich fruits (dates) and cultivated by humans since 4,000 B.C., is an economically important crop in the Middle East, Northern Africa, and increasingly other places where climates are suitable. Despite a long history of human cultivation, the understanding of P. dactylifera genetics and molecular biology are rather limited, hindered by lack of basic data in high quality from genomics and transcriptomics. Here we report a large-scale effort in generating gene models (assembled expressed sequence tags or ESTs and mapped to a genome assembly) for P. dactylifera, using the long-read pyrosequencing platform (Roche/454 GS FLX Titanium) in high coverage. We built fourteen cDNA libraries from different P. dactylifera tissues (cultivar Khalas) and acquired 15,778,993 raw sequencing reads-about one million sequencing reads per library-and the pooled sequences were assembled into 67,651 non-redundant contigs and 301,978 singletons. We annotated 52,725 contigs based on the plant databases and 45 contigs based on functional domains referencing to the Pfam database. From the annotated contigs, we assigned GO (Gene Ontology) terms to 36,086 contigs and KEGG pathways to 7,032 contigs. Our comparative analysis showed that 70.6 % (47,930), 69.4 % (47,089), 68.4 % (46,441), and 69.3 % (47,048) of the P. dactylifera gene models are shared with rice, sorghum, Arabidopsis, and grapevine, respectively. We also assigned our gene models into house-keeping and tissue-specific genes based on their tissue specificity. PMID:22736259

Zhang, Guangyu; Pan, Linlin; Yin, Yuxin; Liu, Wanfei; Huang, Dawei; Zhang, Tongwu; Wang, Lei; Xin, Chengqi; Lin, Qiang; Sun, Gaoyuan; Ba Abdullah, Mohammed M; Zhang, Xiaowei; Hu, Songnian; Al-Mssallem, Ibrahim S; Yu, Jun

2012-06-27

139

Overcoming function annotation errors in the Gram-positive pathogen Streptococcus suis by a proteomics-driven approach  

PubMed Central

Background Annotation of protein-coding genes is a key step in sequencing projects. Protein functions are mainly assigned on the basis of the amino acid sequence alone by searching of homologous proteins. However, fully automated annotation processes often lead to wrong prediction of protein functions, and therefore time-intensive manual curation is often essential. Here we describe a fast and reliable way to correct function annotation in sequencing projects, focusing on surface proteomes. We use a proteomics approach, previously proven to be very powerful for identifying new vaccine candidates against Gram-positive pathogens. It consists of shaving the surface of intact cells with two proteases, the specific cleavage-site trypsin and the unspecific proteinase K, followed by LC/MS/MS analysis of the resulting peptides. The identified proteins are contrasted by computational analysis and their sequences are inspected to correct possible errors in function prediction. Results When applied to the zoonotic pathogen Streptococcus suis, of which two strains have been recently sequenced and annotated, we identified a set of surface proteins without cytoplasmic contamination: all the proteins identified had exporting or retention signals towards the outside and/or the cell surface, and viability of protease-treated cells was not affected. The combination of both experimental evidences and computational methods allowed us to determine that two of these proteins are putative extracellular new adhesins that had been previously attributed a wrong cytoplasmic function. One of them is a putative component of the pilus of this bacterium. Conclusion We illustrate the complementary nature of laboratory-based and computational methods to examine in concert the localization of a set of proteins in the cell, and demonstrate the utility of this proteomics-based strategy to experimentally correct function annotation errors in sequencing projects. This approach also contributes to provide strong experimental evidences that can be used to annotate those proteins for which a Gene Ontology (GO) term has not been assigned so far. Function annotation correction would then improve the identification of surface-associated proteins in bacterial pathogens, thus accelerating the discovery of new vaccines in infectious disease research.

Rodriguez-Ortega, Manuel J; Luque, Inmaculada; Tarradas, Carmen; Barcena, Jose A

2008-01-01

140

Structural and Functional Annotation of the Porcine Immunome  

Technology Transfer Automated Retrieval System (TEKTRAN)

The domestic pig is known as an excellent model for human immunology and the two species share many pathogens. Susceptibility to infectious disease is one of the major constraints on swine performance, yet the structure and function of genes comprising the pig immunome are not well-characterized. H...

141

Functional annotation from the genome sequence of the giant panda.  

PubMed

The giant panda is one of the most critically endangered species due to the fragmentation and loss of its habitat. Studying the functions of proteins in this animal, especially specific trait-related proteins, is therefore necessary to protect the species. In this work, the functions of these proteins were investigated using the genome sequence of the giant panda. Data on 21,001 proteins and their functions were stored in the Giant Panda Protein Database, in which the proteins were divided into two groups: 20,179 proteins whose functions can be predicted by GeneScan formed the known-function group, whereas 822 proteins whose functions cannot be predicted by GeneScan comprised the unknown-function group. For the known-function group, we further classified the proteins by molecular function, biological process, cellular component, and tissue specificity. For the unknown-function group, we developed a strategy in which the proteins were filtered by cross-Blast to identify panda-specific proteins under the assumption that proteins related to the panda-specific traits in the unknown-function group exist. After this filtering procedure, we identified 32 proteins (2 of which are membrane proteins) specific to the giant panda genome as compared against the dog and horse genomes. Based on their amino acid sequences, these 32 proteins were further analyzed by functional classification using SVM-Prot, motif prediction using MyHits, and interacting protein prediction using the Database of Interacting Proteins. Nineteen proteins were predicted to be zinc-binding proteins, thus affecting the activities of nucleic acids. The 32 panda-specific proteins will be further investigated by structural and functional analysis. PMID:22865348

Huo, Tong; Zhang, Yinjie; Lin, Jianping

2012-07-21

142

Exploiting proteomic data for genome annotation and gene model validation in Aspergillus niger  

Microsoft Academic Search

BACKGROUND: Proteomic data is a potentially rich, but arguably unexploited, data source for genome annotation. Peptide identifications from tandem mass spectrometry provide prima facie evidence for gene predictions and can discriminate over a set of candidate gene models. Here we apply this to the recently sequenced Aspergillus niger fungal genome from the Joint Genome Institutes (JGI) and another predicted protein

James C Wright; Deana Sugden; Sue Francis-McIntyre; Isabel Riba-Garcia; Simon J Gaskell; Igor V Grigoriev; Scott E Baker; Robert J Beynon; Simon J Hubbard

2009-01-01

143

Text-based over-representation analysis of microarray gene lists with annotation bias  

PubMed Central

A major challenge in microarray data analysis is the functional interpretation of gene lists. A common approach to address this is over-representation analysis (ORA), which uses the hypergeometric test (or its variants) to evaluate whether a particular functionally defined group of genes is represented more than expected by chance within a gene list. Existing applications of ORA have been largely limited to pre-defined terminologies such as GO and KEGG. We report our explorations of whether ORA can be applied to a wider mining of free-text. We found that a hitherto underappreciated feature of experimentally derived gene lists is that the constituents have substantially more annotation associated with them, as they have been researched upon for a longer period of time. This bias, a result of patterns of research activity within the biomedical community, is a major problem for classical hypergeometric test-based ORA approaches, which cannot account for such bias. We have therefore developed three approaches to overcome this bias, and demonstrate their usability in a wide range of published datasets covering different species. A comparison with existing tools that use GO terms suggests that mining PubMed abstracts can reveal additional biological insight that may not be possible by mining pre-defined ontologies alone.

Leong, Hui Sun; Kipling, David

2009-01-01

144

BrEPS: a flexible and automatic protocol to compute enzyme-specific sequence profiles for functional annotation  

PubMed Central

Background Models for the simulation of metabolic networks require the accurate prediction of enzyme function. Based on a genomic sequence, enzymatic functions of gene products are today mainly predicted by sequence database searching and operon analysis. Other methods can support these techniques: We have developed an automatic method "BrEPS" that creates highly specific sequence patterns for the functional annotation of enzymes. Results The enzymes in the UniprotKB are identified and their sequences compared against each other with BLAST. The enzymes are then clustered into a number of trees, where each tree node is associated with a set of EC-numbers. The enzyme sequences in the tree nodes are aligned with ClustalW. The conserved columns of the resulting multiple alignments are used to construct sequence patterns. In the last step, we verify the quality of the patterns by computing their specificity. Patterns with low specificity are omitted and recomputed further down in the tree. The final high-quality patterns can be used for functional annotation. We ran our protocol on a recent Swiss-Prot release and show statistics, as well as a comparison to PRIAM, a probabilistic method that is also specialized on the functional annotation of enzymes. We determine the amount of true positive annotations for five common microorganisms with data from BRENDA and AMENDA serving as standard of truth. BrEPS is almost on par with PRIAM, a fact which we discuss in the context of five manually investigated cases. Conclusions Our protocol computes highly specific sequence patterns that can be used to support the functional annotation of enzymes. The main advantages of our method are that it is automatic and unsupervised, and quite fast once the patterns are evaluated. The results show that BrEPS can be a valuable addition to the reconstruction of metabolic networks.

2010-01-01

145

Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research  

Microsoft Academic Search

Summary: We present here Blast2GO (B2G), a research tool designed with the main purpose of enabling Gene Ontology (GO) based data mining on sequence data for which no GO annotation is yet available. B2G joints in one application GO annotation based on similarity searches with statistical analysis and highlighted visualiza- tion on directed acyclic graphs. This tool offers a suitable

Ana Conesa; Stefan Götz; Juan Miguel García-gómez; Javier Terol; Manuel Talón; Montserrat Robles

2005-01-01

146

Re-annotation of the CAZy genes of Trichoderma reesei and transcription in the presence of lignocellulosic substrates  

PubMed Central

Background Trichoderma reesei is a soft rot Ascomycota fungus utilised for industrial production of secreted enzymes, especially lignocellulose degrading enzymes. About 30 carbohydrate active enzymes (CAZymes) of T. reesei have been biochemically characterised. Genome sequencing has revealed a large number of novel candidates for CAZymes, thus increasing the potential for identification of enzymes with novel activities and properties. Plenty of data exists on the carbon source dependent regulation of the characterised hydrolytic genes. However, information on the expression of the novel CAZyme genes, especially on complex biomass material, is very limited. Results In this study, the CAZyme gene content of the T. reesei genome was updated and the annotations of the genes refined using both computational and manual approaches. Phylogenetic analysis was done to assist the annotation and to identify functionally diversified CAZymes. The analyses identified 201 glycoside hydrolase genes, 22 carbohydrate esterase genes and five polysaccharide lyase genes. Updated or novel functional predictions were assigned to 44 genes, and the phylogenetic analysis indicated further functional diversification within enzyme families or groups of enzymes. GH3 ?-glucosidases, GH27 ?-galactosidases and GH18 chitinases were especially functionally diverse. The expression of the lignocellulose degrading enzyme system of T. reesei was studied by cultivating the fungus in the presence of different inducing substrates and by subjecting the cultures to transcriptional profiling. The substrates included both defined and complex lignocellulose related materials, such as pretreated bagasse, wheat straw, spruce, xylan, Avicel cellulose and sophorose. The analysis revealed co-regulated groups of CAZyme genes, such as genes induced in all the conditions studied and also genes induced preferentially by a certain set of substrates. Conclusions In this study, the CAZyme content of the T. reesei genome was updated, the discrepancies between the different genome versions and published literature were removed and the annotation of many of the genes was refined. Expression analysis of the genes gave information on the enzyme activities potentially induced by the presence of the different substrates. Comparison of the expression profiles of the CAZyme genes under the different conditions identified co-regulated groups of genes, suggesting common regulatory mechanisms for the gene groups.

2012-01-01

147

Joint stage recognition and anatomical annotation of drosophila gene expression patterns  

PubMed Central

Motivation: Staining the mRNA of a gene via in situ hybridization (ISH) during the development of a Drosophila melanogaster embryo delivers the detailed spatio-temporal patterns of the gene expression. Many related biological problems such as the detection of co-expressed genes, co-regulated genes and transcription factor binding motifs rely heavily on the analysis of these image patterns. To provide the text-based pattern searching for facilitating related biological studies, the images in the Berkeley Drosophila Genome Project (BDGP) study are annotated with developmental stage term and anatomical ontology terms manually by domain experts. Due to the rapid increase in the number of such images and the inevitable bias annotations by human curators, it is necessary to develop an automatic method to recognize the developmental stage and annotate anatomical terms. Results: In this article, we propose a novel computational model for jointly stage classification and anatomical terms annotation of Drosophila gene expression patterns. We propose a novel Tri-Relational Graph (TG) model that comprises the data graph, anatomical term graph, developmental stage term graph, and connect them by two additional graphs induced from stage or annotation label assignments. Upon the TG model, we introduce a Preferential Random Walk (PRW) method to jointly recognize developmental stage and annotate anatomical terms by utilizing the interrelations between two tasks. The experimental results on two refined BDGP datasets demonstrate that our joint learning method can achieve superior prediction results on both tasks than the state-of-the-art methods. Availability: http://ranger.uta.edu/%7eheng/Drosophila/ Contact: heng@uta.edu

Cai, Xiao; Wang, Hua; Huang, Heng; Ding, Chris

2012-01-01

148

Analysis of EST-driven gene annotation in human genomic sequence.  

PubMed

We have performed a systematic analysis of gene identification in genomic sequence by similarity search against expressed sequence tags (ESTs) to assess the suitability of this method for automated annotation of the human genome. A BLAST-based strategy was constructed to examine the potential of this approach, and was applied to test sets containing all human genomic sequences longer than 5 kb in public databases, plus 300 kb of exhaustively characterized benchmark sequence. At high stringency, 70%-90% of all annotated genes are detected by near-identity to EST sequence; >95% of ESTs aligning with well-annotated sequences overlap a gene. These ESTs provide immediate access to the corresponding cDNA clones for follow-up laboratory verification and subsequent biologic analysis. At lower stringency, up to 97% of annotated genes were identified by similarity to ESTs. The apparent false-positive rate rose to 55% of ESTs among all sequences and 20% among benchmark sequences at the lowest stringency, indicating that many genes in public database entries are unannotated. Approximately half of the alignments span multiple exons, and thus aid in the construction of gene predictions and elucidation of alternative splicing. In addition, ESTs from multiple cDNA libraries frequently cluster over genes, providing a starting point for crude expression profiles. Clone IDs may be used to form EST pairs, and particularly to extend models by associating alignments of lower stringency with high-quality alignments. These results demonstrate that EST similarity search is a practical general-purpose annotation technique that complements pattern recognition methods as a tool for gene characterization. PMID:9548972

Bailey, L C; Searls, D B; Overton, G C

1998-04-01

149

Re-Annotation of Protein-Coding Genes in 10 Complete Genomes of Neisseriaceae Family by Combining Similarity-Based and Composition-Based Methods  

PubMed Central

In this paper, we performed a comprehensive re-annotation of protein-coding genes by a systematic method combining composition- and similarity-based approaches in 10 complete bacterial genomes of the family Neisseriaceae. First, 418 hypothetical genes were predicted as non-coding using the composition-based method and 413 were eliminated from the gene list. Both the scatter plot and cluster of orthologous groups (COG) fraction analyses supported the result. Second, from 20 to 400 hypothetical proteins were assigned with functions in each of the 10 strains based on the homology search. Among newly assigned functions, 397 are so detailed to have definite gene names. Third, 106 genes missed by the original annotations were picked up by an ab initio gene finder combined with similarity alignment. Transcriptional experiments validated the effectiveness of this method in Laribacter hongkongensis and Chromobacterium violaceum. Among the 106 newly found genes, some deserve particular interests. For example, 27 transposases were newly found in Neiserria meningitidis alpha14. In Neiserria gonorrhoeae NCCP11945, four new genes with putative functions and definite names (nusG, rpsN, rpmD and infA) were found and homologues of them usually are essential for survival in bacteria. The updated annotations for the 10 Neisseriaceae genomes provide a more accurate prediction of protein-coding genes and a more detailed functional information of hypothetical proteins. It will benefit research into the lifestyle, metabolism, environmental adaption and pathogenicity of the Neisseriaceae species. The re-annotation procedure could be used directly, or after the adaption of detailed methods, for checking annotations of any other bacterial or archaeal genomes.

Guo, Feng-Biao; Xiong, Lifeng; Teng, Jade L. L.; Yuen, Kwok-Yung; Lau, Susanna K. P.; Woo, Patrick C. Y.

2013-01-01

150

A Statistical Framework for Improving Genomic Annotations of Prokaryotic Essential Genes  

PubMed Central

Large-scale systematic analysis of gene essentiality is an important step closer toward unraveling the complex relationship between genotypes and phenotypes. Such analysis cannot be accomplished without unbiased and accurate annotations of essential genes. In current genomic databases, most of the essential gene annotations are derived from whole-genome transposon mutagenesis (TM), the most frequently used experimental approach for determining essential genes in microorganisms under defined conditions. However, there are substantial systematic biases associated with TM experiments. In this study, we developed a novel Poisson model–based statistical framework to simulate the TM insertion process and subsequently correct the experimental biases. We first quantitatively assessed the effects of major factors that potentially influence the accuracy of TM and subsequently incorporated relevant factors into the framework. Through iteratively optimizing parameters, we inferred the actual insertion events occurred and described each gene’s essentiality on probability measure. Evaluated by the definite mapping of essential gene profile in Escherichia coli, our model significantly improved the accuracy of original TM datasets, resulting in more accurate annotations of essential genes. Our method also showed encouraging results in improving subsaturation level TM datasets. To test our model’s broad applicability to other bacteria, we applied it to Pseudomonas aeruginosa PAO1 and Francisella tularensis novicida TM datasets. We validated our predictions by literature as well as allelic exchange experiments in PAO1. Our model was correct on six of the seven tested genes. Remarkably, among all three cases that our predictions contradicted the TM assignments, experimental validations supported our predictions. In summary, our method will be a promising tool in improving genomic annotations of essential genes and enabling large-scale explorations of gene essentiality. Our contribution is timely considering the rapidly increasing essential gene sets. A Webserver has been set up to provide convenient access to this tool. All results and source codes are available for download upon publication at http://research.cchmc.org/essentialgene/.

Deng, Jingyuan; Su, Shengchang; Lin, Xiaodong; Hassett, Daniel J.; Lu, Long Jason

2013-01-01

151

Application and evaluation of automated semantic annotation of gene expression experiments  

PubMed Central

Motivation: Many microarray datasets are available online with formalized standards describing the probe sequences and expression values. Unfortunately, the description, conditions and parameters of the experiments are less commonly formalized and often occur as natural language text. This hinders searching, high-throughput analysis, organization and integration of the datasets. Results: We use the lexical resources and software tools from the Unified Medical Language System (UMLS) to extract concepts from text. We then link the UMLS concepts to classes in open biomedical ontologies. The result is accessible and clear semantic annotations of gene expression experiments. We applied the method to 595 expression experiments from Gemma, a resource for re-use and meta-analysis of gene expression profiling data. We evaluated and corrected all stages of the annotation process. The majority of missed annotations were due to a lack of cross-references. The most error-prone stage was the extraction of concepts from phrases. Final review of the annotations in context of the experiments revealed 89% precision. A naive system, lacking the phrase to concept corrections is 68% precise. We have integrated this annotation pipeline into Gemma. Availability: The source code, documentation and Supplementary Materials are available at http://www.chibi.ubc.ca/GEOMMTX. The results of the manual evaluations are provided as Supplementary Material. Both manual and predicted annotations can be viewed and searched via the Gemma website at http://www.chibi.ubc.ca/Gemma. The complete set of predicted annotations is available as a machine readable resource description framework graph. Contact: paul@chibi.ubc.ca

French, Leon; Lane, Suzanne; Law, Tamryn; Xu, Lydia; Pavlidis, Paul

2009-01-01

152

A statistical framework for improving genomic annotations of prokaryotic essential genes.  

PubMed

Large-scale systematic analysis of gene essentiality is an important step closer toward unraveling the complex relationship between genotypes and phenotypes. Such analysis cannot be accomplished without unbiased and accurate annotations of essential genes. In current genomic databases, most of the essential gene annotations are derived from whole-genome transposon mutagenesis (TM), the most frequently used experimental approach for determining essential genes in microorganisms under defined conditions. However, there are substantial systematic biases associated with TM experiments. In this study, we developed a novel Poisson model-based statistical framework to simulate the TM insertion process and subsequently correct the experimental biases. We first quantitatively assessed the effects of major factors that potentially influence the accuracy of TM and subsequently incorporated relevant factors into the framework. Through iteratively optimizing parameters, we inferred the actual insertion events occurred and described each gene's essentiality on probability measure. Evaluated by the definite mapping of essential gene profile in Escherichia coli, our model significantly improved the accuracy of original TM datasets, resulting in more accurate annotations of essential genes. Our method also showed encouraging results in improving subsaturation level TM datasets. To test our model's broad applicability to other bacteria, we applied it to Pseudomonas aeruginosa PAO1 and Francisella tularensis novicida TM datasets. We validated our predictions by literature as well as allelic exchange experiments in PAO1. Our model was correct on six of the seven tested genes. Remarkably, among all three cases that our predictions contradicted the TM assignments, experimental validations supported our predictions. In summary, our method will be a promising tool in improving genomic annotations of essential genes and enabling large-scale explorations of gene essentiality. Our contribution is timely considering the rapidly increasing essential gene sets. A Webserver has been set up to provide convenient access to this tool. All results and source codes are available for download upon publication at http://research.cchmc.org/essentialgene/. PMID:23520492

Deng, Jingyuan; Su, Shengchang; Lin, Xiaodong; Hassett, Daniel J; Lu, Long Jason

2013-03-08

153

Long non-coding RNAs function annotation: a global prediction method based on bi-colored networks.  

PubMed

More and more evidences demonstrate that the long non-coding RNAs (lncRNAs) play many key roles in diverse biological processes. There is a critical need to annotate the functions of increasing available lncRNAs. In this article, we try to apply a global network-based strategy to tackle this issue for the first time. We develop a bi-colored network based global function predictor, long non-coding RNA global function predictor ('lnc-GFP'), to predict probable functions for lncRNAs at large scale by integrating gene expression data and protein interaction data. The performance of lnc-GFP is evaluated on protein-coding and lncRNA genes. Cross-validation tests on protein-coding genes with known function annotations indicate that our method can achieve a precision up to 95%, with a suitable parameter setting. Among the 1713 lncRNAs in the bi-colored network, the 1625 (94.9%) lncRNAs in the maximum connected component are all functionally characterized. For the lncRNAs expressed in mouse embryo stem cells and neuronal cells, the inferred putative functions by our method highly match those in the known literature. PMID:23132350

Guo, Xingli; Gao, Lin; Liao, Qi; Xiao, Hui; Ma, Xiaoke; Yang, Xiaofei; Luo, Haitao; Zhao, Guoguang; Bu, Dechao; Jiao, Fei; Shao, Qixiang; Chen, RunSheng; Zhao, Yi

2012-11-05

154

TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant Genomes  

PubMed Central

In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5?days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8?h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future.

Leroy, Philippe; Guilhot, Nicolas; Sakai, Hiroaki; Bernard, Aurelien; Choulet, Frederic; Theil, Sebastien; Reboux, Sebastien; Amano, Naoki; Flutre, Timothee; Pelegrin, Celine; Ohyanagi, Hajime; Seidel, Michael; Giacomoni, Franck; Reichstadt, Mathieu; Alaux, Michael; Gicquello, Emmanuelle; Legeai, Fabrice; Cerutti, Lorenzo; Numa, Hisataka; Tanaka, Tsuyoshi; Mayer, Klaus; Itoh, Takeshi; Quesneville, Hadi; Feuillet, Catherine

2012-01-01

155

De Novo Assembly, Characterization and Functional Annotation of Pineapple Fruit Transcriptome through Massively Parallel Sequencing  

PubMed Central

Background Pineapple (Ananas comosus var. comosus), is an important tropical non-climacteric fruit with high commercial potential. Understanding the mechanism and processes underlying fruit ripening would enable scientists to enhance the improvement of quality traits such as, flavor, texture, appearance and fruit sweetness. Although, the pineapple is an important fruit, there is insufficient transcriptomic or genomic information that is available in public databases. Application of high throughput transcriptome sequencing to profile the pineapple fruit transcripts is therefore needed. Methodology/Principal Findings To facilitate this, we have performed transcriptome sequencing of ripe yellow pineapple fruit flesh using Illumina technology. About 4.7 millions Illumina paired-end reads were generated and assembled using the Velvet de novo assembler. The assembly produced 28,728 unique transcripts with a mean length of approximately 200 bp. Sequence similarity search against non-redundant NCBI database identified a total of 16,932 unique transcripts (58.93%) with significant hits. Out of these, 15,507 unique transcripts were assigned to gene ontology terms. Functional annotation against Kyoto Encyclopedia of Genes and Genomes pathway database identified 13,598 unique transcripts (47.33%) which were mapped to 126 pathways. The assembly revealed many transcripts that were previously unknown. Conclusions The unique transcripts derived from this work have rapidly increased of the number of the pineapple fruit mRNA transcripts as it is now available in public databases. This information can be further utilized in gene expression, genomics and other functional genomics studies in pineapple.

Ong, Wen Dee; Voo, Lok-Yung Christopher; Kumar, Vijay Subbiah

2012-01-01

156

The DAWGPAWS pipeline for the annotation of genes and transposable elements in plant genomes  

Microsoft Academic Search

BACKGROUND: High quality annotation of the genes and transposable elements in complex genomes requires a human-curated integration of multiple sources of computational evidence. These evidences include results from a diversity of ab initio prediction programs as well as homology-based searches. Most of these programs operate on a single contiguous sequence at a time, and the results are generated in a

James C Estill; Jeffrey L Bennetzen

2009-01-01

157

Assessment of Gene Annotation Accuracy by Inferring Transcripts from RNA-Seq  

Microsoft Academic Search

Next generation sequencing is quickly changing long standing paradigms of genomics in terms of what is feasible to accomplish within a ldquoresearch life timerdquo and what is supposed to remain beyond limits of reliable experimental analysis. Sequencing and mapping of a prokaryote transcriptome can provide experimental validation for computationally predicted genes annotated in a prokaryotic genome. In this study we

Jeffrey Martin; Wenhan Zhu; Nicholas H. Bergman; Mark Borodovsky

2009-01-01

158

The Physalis peruviana leaf transcriptome: assembly, annotation and gene model prediction  

PubMed Central

Background Physalis peruviana commonly known as Cape gooseberry is a member of the Solanaceae family that has an increasing popularity due to its nutritional and medicinal values. A broad range of genomic tools is available for other Solanaceae, including tomato and potato. However, limited genomic resources are currently available for Cape gooseberry. Results We report the generation of a total of 652,614 P. peruviana Expressed Sequence Tags (ESTs), using 454 GS FLX Titanium technology. ESTs, with an average length of 371?bp, were obtained from a normalized leaf cDNA library prepared using a Colombian commercial variety. De novo assembling was performed to generate a collection of 24,014 isotigs and 110,921 singletons, with an average length of 1,638?bp and 354?bp, respectively. Functional annotation was performed using NCBI’s BLAST tools and Blast2GO, which identified putative functions for 21,191 assembled sequences, including gene families involved in all the major biological processes and molecular functions as well as defense response and amino acid metabolism pathways. Gene model predictions in P. peruviana were obtained by using the genomes of Solanum lycopersicum (tomato) and Solanum tuberosum (potato). We predict 9,436 P. peruviana sequences with multiple-exon models and conserved intron positions with respect to the potato and tomato genomes. Additionally, to study species diversity we developed 5,971 SSR markers from assembled ESTs. Conclusions We present the first comprehensive analysis of the Physalis peruviana leaf transcriptome, which will provide valuable resources for development of genetic tools in the species. Assembled transcripts with gene models could serve as potential candidates for marker discovery with a variety of applications including: functional diversity, conservation and improvement to increase productivity and fruit quality. P. peruviana was estimated to be phylogenetically branched out before the divergence of five other Solanaceae family members, S. lycopersicum, S. tuberosum, Capsicum spp, S. melongena and Petunia spp.

2012-01-01

159

Evolutionary Trace Annotation Server: automated enzyme function prediction in protein structures using 3D templates  

PubMed Central

Summary:The Evolutionary Trace Annotation (ETA) Server predicts enzymatic activity. ETA starts with a structure of unknown function, such as those from structural genomics, and with no prior knowledge of its mechanism uses the phylogenetic Evolutionary Trace (ET) method to extract key functional residues and propose a function-associated 3D motif, called a 3D template. ETA then searches previously annotated structures for geometric template matches that suggest molecular and thus functional mimicry. In order to maximize the predictive value of these matches, ETA next applies distinctive specificity filters—evolutionary similarity, function plurality and match reciprocity. In large scale controls on enzymes, prediction coverage is 43% but the positive predictive value rises to 92%, thus minimizing false annotations. Users may modify any search parameter, including the template. ETA thus expands the ET suite for protein structure annotation, and can contribute to the annotation efforts of metaservers. Availability:The ETA Server is a web application available at http://mammoth.bcm.tmc.edu/eta/. Contact: lichtarge@bcm.edu

Matthew Ward, R.; Venner, Eric; Daines, Bryce; Murray, Stephen; Erdin, Serkan; Kristensen, David M.; Lichtarge, Olivier

2009-01-01

160

The DAVID Gene Functional Classification Tool: a novel biological module-centric algorithm to functionally analyze large gene lists  

PubMed Central

The DAVID Gene Functional Classification Tool uses a novel agglomeration algorithm to condense a list of genes or associated biological terms into organized classes of related genes or biology, called biological modules. This organization is accomplished by mining the complex biological co-occurrences found in multiple sources of functional annotation. It is a powerful method to group functionally related genes and terms into a manageable number of biological modules for efficient interpretation of gene lists in a network context.

Huang, Da Wei; Sherman, Brad T; Tan, Qina; Collins, Jack R; Alvord, W Gregory; Roayaei, Jean; Stephens, Robert; Baseler, Michael W; Lane, H Clifford; Lempicki, Richard A

2007-01-01

161

Differential annotation of tRNA genes with anticodon CAT in bacterial genomes  

Microsoft Academic Search

We have developed three strategies to discriminate among the three types of tRNA genes with anticodon CAT (tRNAIle, elongator tRNAMet and initiator tRNAfMet) in bacterial genomes. With these strategies, we have classified the tRNA genes from 234 bacterial and several organellar genomes. These sequences, in an aligned or unaligned format, may be used for the identification and annotation of tRNA

Francisco J. Silva; Eugeni Belda; Santiago E. Talens

2006-01-01

162

BioBuilder as a database development and functional annotation platform for proteins  

PubMed Central

Background The explosion in biological information creates the need for databases that are easy to develop, easy to maintain and can be easily manipulated by annotators who are most likely to be biologists. However, deployment of scalable and extensible databases is not an easy task and generally requires substantial expertise in database development. Results BioBuilder is a Zope-based software tool that was developed to facilitate intuitive creation of protein databases. Protein data can be entered and annotated through web forms along with the flexibility to add customized annotation features to protein entries. A built-in review system permits a global team of scientists to coordinate their annotation efforts. We have already used BioBuilder to develop Human Protein Reference Database , a comprehensive annotated repository of the human proteome. The data can be exported in the extensible markup language (XML) format, which is rapidly becoming as the standard format for data exchange. Conclusions As the proteomic data for several organisms begins to accumulate, BioBuilder will prove to be an invaluable platform for functional annotation and development of customizable protein centric databases. BioBuilder is open source and is available under the terms of LGPL.

Navarro, J Daniel; Talreja, Naveen; Peri, Suraj; Vrushabendra, BM; Rashmi, BP; Padma, N; Surendranath, Vineeth; Jonnalagadda, Chandra Kiran; Kousthub, PS; Deshpande, Nandan; Shanker, K; Pandey, Akhilesh

2004-01-01

163

Towards a semi-automatic functional annotation tool based on decision-tree techniques  

PubMed Central

Background Due to the continuous improvements of high throughput technologies and experimental procedures, the number of sequenced genomes is increasing exponentially. Ultimately, the task of annotating these data relies on the expertise of biologists. The necessity for annotation to be supervised by human experts is the rate limiting step of the data analysis. To face the deluge of new genomic data, the need for automating, as much as possible, the annotation process becomes critical. Results We consider annotation of a protein with terms of the functional hierarchy that has been used to annotate Bacillus subtilis and propose a set of rules that predict classes in terms of elements of the functional hierarchy, i.e., a class is a node or a leaf of the hierarchy tree. The rules are obtained through two decision-trees techniques: first-order decision-trees and multilabel attribute-value decision-trees, by using as training data the proteins from two lactic bacteria: Lactobacillus sakei and Lactobacillus bulgaricus. We tested the two methods, first independently, then in a combined approach, and evaluated the obtained results using hierarchical evaluation measures. Results obtained for the two approaches on both genomes are comparable and show a good precision together with a high prediction rate. Using combined approaches increases the recall and the prediction rate. Conclusion The combination of the two approaches is very encouraging and we will further refine these combinations in order to get rules even more useful for the annotators. This first study is a crucial step towards designing a semi-automatic functional annotation tool.

Aze, Jerome; Gentils, Lucie; Toffano-Nioche, Claire; Loux, Valentin; Gibrat, Jean-Francois; Bessieres, Philippe; Rouveirol, Celine; Poupon, Anne; Froidevaux, Christine

2008-01-01

164

A computational approach to candidate gene prioritization for X-linked mental retardation using annotation-based binary filtering and motif-based linear discriminatory analysis  

PubMed Central

Background Several computational candidate gene selection and prioritization methods have recently been developed. These in silico selection and prioritization techniques are usually based on two central approaches - the examination of similarities to known disease genes and/or the evaluation of functional annotation of genes. Each of these approaches has its own caveats. Here we employ a previously described method of candidate gene prioritization based mainly on gene annotation, in accompaniment with a technique based on the evaluation of pertinent sequence motifs or signatures, in an attempt to refine the gene prioritization approach. We apply this approach to X-linked mental retardation (XLMR), a group of heterogeneous disorders for which some of the underlying genetics is known. Results The gene annotation-based binary filtering method yielded a ranked list of putative XLMR candidate genes with good plausibility of being associated with the development of mental retardation. In parallel, a motif finding approach based on linear discriminatory analysis (LDA) was employed to identify short sequence patterns that may discriminate XLMR from non-XLMR genes. High rates (>80%) of correct classification was achieved, suggesting that the identification of these motifs effectively captures genomic signals associated with XLMR vs. non-XLMR genes. The computational tools developed for the motif-based LDA is integrated into the freely available genomic analysis portal Galaxy (http://main.g2.bx.psu.edu/). Nine genes (APLN, ZC4H2, MAGED4, MAGED4B, RAP2C, FAM156A, FAM156B, TBL1X, and UXT) were highlighted as highly-ranked XLMR methods. Conclusions The combination of gene annotation information and sequence motif-orientated computational candidate gene prediction methods highlight an added benefit in generating a list of plausible candidate genes, as has been demonstrated for XLMR. Reviewers: This article was reviewed by Dr Barbara Bardoni (nominated by Prof Juergen Brosius); Prof Neil Smalheiser and Dr Dustin Holloway (nominated by Prof Charles DeLisi).

2011-01-01

165

Annotation and comparative analysis of the glycoside hydrolase genes in Brachypodium distachyon  

PubMed Central

Background Glycoside hydrolases cleave the bond between a carbohydrate and another carbohydrate, a protein, lipid or other moiety. Genes encoding glycoside hydrolases are found in a wide range of organisms, from archea to animals, and are relatively abundant in plant genomes. In plants, these enzymes are involved in diverse processes, including starch metabolism, defense, and cell-wall remodeling. Glycoside hydrolase genes have been previously cataloged for Oryza sativa (rice), the model dicotyledonous plant Arabidopsis thaliana, and the fast-growing tree Populus trichocarpa (poplar). To improve our understanding of glycoside hydrolases in plants generally and in grasses specifically, we annotated the glycoside hydrolase genes in the grasses Brachypodium distachyon (an emerging monocotyledonous model) and Sorghum bicolor (sorghum). We then compared the glycoside hydrolases across species, at the levels of the whole genome and individual glycoside hydrolase families. Results We identified 356 glycoside hydrolase genes in Brachypodium and 404 in sorghum. The corresponding proteins fell into the same 34 families that are represented in rice, Arabidopsis, and poplar, helping to define a glycoside hydrolase family profile which may be common to flowering plants. For several glycoside hydrolase familes (GH5, GH13, GH18, GH19, GH28, and GH51), we present a detailed literature review together with an examination of the family structures. This analysis of individual families revealed both similarities and distinctions between monocots and eudicots, as well as between species. Shared evolutionary histories appear to be modified by lineage-specific expansions or deletions. Within GH families, the Brachypodium and sorghum proteins generally cluster with those from other monocots. Conclusions This work provides the foundation for further comparative and functional analyses of plant glycoside hydrolases. Defining the Brachypodium glycoside hydrolases sets the stage for Brachypodium to be a grass model for investigations of these enzymes and their diverse roles in planta. Insights gained from Brachypodium will inform translational research studies, with applications for the improvement of cereal crops and bioenergy grasses.

2010-01-01

166

A practical guide for the functional annotation of genetic variations using SNPnexus.  

PubMed

Broader functional annotation of known as well as putative genetic variations is a valuable mean for prioritizing targets in disease studies and large-scale genotyping projects. In this article, we present a practical guide to SNPnexus, a web-based tool that provides an aggregate set of functional annotations for genomic variation data by characterizing related consequences at the transcriptome/proteome levels with in-depth analysis of potential deleterious effects, inferring physical and cytogenetic mapping, reporting related HapMap data, finding overlaps with potential regulatory, structural as well as conserved elements and retrieving links with previously reported genetic disease studies. We focus on the SNPnexus query system, its annotation categories and the biological interpretation of results. PMID:23395730

Dayem Ullah, Abu Z; Lemoine, Nicholas R; Chelala, Claude

2013-02-08

167

VAT: a computational framework to functionally annotate variants in personal genomes within a cloud-computing environment  

PubMed Central

Summary: The functional annotation of variants obtained through sequencing projects is generally assumed to be a simple intersection of genomic coordinates with genomic features. However, complexities arise for several reasons, including the differential effects of a variant on alternatively spliced transcripts, as well as the difficulty in assessing the impact of small insertions/deletions and large structural variants. Taking these factors into consideration, we developed the Variant Annotation Tool (VAT) to functionally annotate variants from multiple personal genomes at the transcript level as well as obtain summary statistics across genes and individuals. VAT also allows visualization of the effects of different variants, integrates allele frequencies and genotype data from the underlying individuals and facilitates comparative analysis between different groups of individuals. VAT can either be run through a command-line interface or as a web application. Finally, in order to enable on-demand access and to minimize unnecessary transfers of large data files, VAT can be run as a virtual machine in a cloud-computing environment. Availability and Implementation: VAT is implemented in C and PHP. The VAT web service, Amazon Machine Image, source code and detailed documentation are available at vat.gersteinlab.org. Contact: lukas.habegger@yale.edu or mark.gerstein@yale.edu Supplementary Information: Supplementary data are available at Bioinformatics online.

Habegger, Lukas; Balasubramanian, Suganthi; Chen, David Z.; Khurana, Ekta; Sboner, Andrea; Harmanci, Arif; Rozowsky, Joel; Clarke, Declan; Snyder, Michael; Gerstein, Mark

2012-01-01

168

Genome-wide metabolic (re-) annotation of Kluyveromyces lactis  

PubMed Central

Background Even before having its genome sequence published in 2004, Kluyveromyces lactis had long been considered a model organism for studies in genetics and physiology. Research on Kluyveromyces lactis is quite advanced and this yeast species is one of the few with which it is possible to perform formal genetic analysis. Nevertheless, until now, no complete metabolic functional annotation has been performed to the proteins encoded in the Kluyveromyces lactis genome. Results In this work, a new metabolic genome-wide functional re-annotation of the proteins encoded in the Kluyveromyces lactis genome was performed, resulting in the annotation of 1759 genes with metabolic functions, and the development of a methodology supported by merlin (software developed in-house). The new annotation includes novelties, such as the assignment of transporter superfamily numbers to genes identified as transporter proteins. Thus, the genes annotated with metabolic functions could be exclusively enzymatic (1410 genes), transporter proteins encoding genes (301 genes) or have both metabolic activities (48 genes). The new annotation produced by this work largely surpassed the Kluyveromyces lactis currently available annotations. A comparison with KEGG’s annotation revealed a match with 844 (~90%) of the genes annotated by KEGG, while adding 850 new gene annotations. Moreover, there are 32 genes with annotations different from KEGG. Conclusions The methodology developed throughout this work can be used to re-annotate any yeast or, with a little tweak of the reference organism, the proteins encoded in any sequenced genome. The new annotation provided by this study offers basic knowledge which might be useful for the scientific community working on this model yeast, because new functions have been identified for the so-called metabolic genes. Furthermore, it served as the basis for the reconstruction of a compartmentalized, genome-scale metabolic model of Kluyveromyces lactis, which is currently being finished.

2012-01-01

169

Rapid Annotation of Anonymous Sequences from Genome Projects Using Semantic Similarities and a Weighting Scheme in Gene Ontology  

PubMed Central

Background Large-scale sequencing projects have now become routine lab practice and this has led to the development of a new generation of tools involving function prediction methods, bringing the latter back to the fore. The advent of Gene Ontology, with its structured vocabulary and paradigm, has provided computational biologists with an appropriate means for this task. Methodology We present here a novel method called ARGOT (Annotation Retrieval of Gene Ontology Terms) that is able to process quickly thousands of sequences for functional inference. The tool exploits for the first time an integrated approach which combines clustering of GO terms, based on their semantic similarities, with a weighting scheme which assesses retrieved hits sharing a certain number of biological features with the sequence to be annotated. These hits may be obtained by different methods and in this work we have based ARGOT processing on BLAST results. Conclusions The extensive benchmark involved 10,000 protein sequences, the complete S. cerevisiae genome and a small subset of proteins for purposes of comparison with other available tools. The algorithm was proven to outperform existing methods and to be suitable for function prediction of single proteins due to its high degree of sensitivity, specificity and coverage.

Fontana, Paolo; Cestaro, Alessandro; Velasco, Riccardo; Formentin, Elide; Toppo, Stefano

2009-01-01

170

Annotating novel genes by integrating synthetic lethals and genomic information  

Microsoft Academic Search

BACKGROUND: Large scale screening for synthetic lethality serves as a common tool in yeast genetics to systematically search for genes that play a role in specific biological processes. Often the amounts of data resulting from a single large scale screen far exceed the capacities of experimental characterization of every identified target. Thus, there is need for computational tools that select

Daniel Schöner; Markus Kalisch; Christian Leisner; Lukas Meier; Marc Sohrmann; Mahamadou Faty; Yves Barral; Matthias Peter; Wilhelm Gruissem; Peter Bühlmann

2008-01-01

171

GOblet: a platform for Gene Ontology annotation of anonymous sequence data  

Microsoft Academic Search

GOblet is a comprehensive web server application providing the annotation of anonymous sequence data with Gene Ontology (GO) terms. It uses a variety ofdifferentproteindatabases(human,murines,inver- tebrates, plants, sp-trembl) and their respective GO mappings. The user selects the appropriate database and alignment threshold and thereafter submits single or multiple nucleotide or protein sequences. Results are shown in different ways, e.g. as survey

Detlef Groth; Hans Lehrach; Steffen Hennig

2004-01-01

172

Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome  

Microsoft Academic Search

Background: It is widely accepted that comparative sequence data can aid the functional annotation of genome sequences; however, the most informative species and features of genome evolution for comparison remain to be determined. Results: We analyzed conservation in eight genomic regions (apterous, even-skipped, fushi tarazu, twist, and Rhodopsins 1, 2, 3 and 4) from four Drosophila species (D. erecta, D.

Casey M Bergman; Barret D Pfeiffer; Diego E Rincón-Limas; Roger A Hoskins; Andreas Gnirke; Chris J Mungall; Adrienne M Wang; Brent Kronmiller; Joanne Pacleb; Soo Park; Mark Stapleton; Kenneth Wan; Reed A George; Pieter J de Jong; Juan Botas; Gerald M Rubin; Susan E Celniker

2002-01-01

173

ANNOTATION OF CADHERIN GENES IN THE TRIBOLIUM GENOME  

Technology Transfer Automated Retrieval System (TEKTRAN)

Cadherins represent a highly conserved family of proteins that mediate calcium-dependent cell-cell adhesion. Cadherins also function as receptors for the insecticidal proteins from Bacillus thuringiensis (Bt), and some examples of insect resistance to Bt have been attributed to disruptions in cadhe...

174

Identification and annotation of abiotic stress responsive candidate genes in peanut ESTs  

PubMed Central

Peanut (Arachis hypogaea L.) ranks fifth among the world oil crops and is widely grown in India and neighbouring countries. Due to its large and unknown genome size, studies on genomics and genetic modification of peanut are still scanty as compared to other model crops like Arabidopsis, rice, cotton and soybean. Because of its favourable cultivation in semi-arid regions, study on abiotic stress responsive genes and its regulation in peanut is very much important. Therefore, we aim to identify and annotate the abiotic stress responsive candidate genes in peanut ESTs. Expression data of drought stress responsive corresponding genes and EST sequences were screened from dot blot experiments shown as heat maps and supplementary tables, respectively as reported by Govind et al. (2009). Some of the screened genes having no information about their ESTs in above mentioned supplementary tables were retrieved from NCBI. A phylogenetic analysis was performed to find a group of utmost similar ESTs for each selected gene. Individual EST of the said group were further searched in peanut ESTs (1,78,490 whole EST sequences) using stand alone BLAST. For the prediction as well as annotation of abiotic stress responsive selected genes, various tools (like Vec-Screen, Repeat Masker, EST-Trimmer, DNA Baser, WISE2 and I-TASSER) were used. Here we report the predicted result of Contigs, domain as well as 3D structure for HSP 17.3KDa protein, DnaJ protein and Type 2 Metallothionein protein.

Kumari, Archana; Kumar, Ashutosh; Wany, Aakanksha; Prajapati, Gopal Kumar; Pandey, Dev Mani

2012-01-01

175

Identification and annotation of abiotic stress responsive candidate genes in peanut ESTs.  

PubMed

Peanut (Arachis hypogaea L.) ranks fifth among the world oil crops and is widely grown in India and neighbouring countries. Due to its large and unknown genome size, studies on genomics and genetic modification of peanut are still scanty as compared to other model crops like Arabidopsis, rice, cotton and soybean. Because of its favourable cultivation in semi-arid regions, study on abiotic stress responsive genes and its regulation in peanut is very much important. Therefore, we aim to identify and annotate the abiotic stress responsive candidate genes in peanut ESTs. Expression data of drought stress responsive corresponding genes and EST sequences were screened from dot blot experiments shown as heat maps and supplementary tables, respectively as reported by Govind et al. (2009). Some of the screened genes having no information about their ESTs in above mentioned supplementary tables were retrieved from NCBI. A phylogenetic analysis was performed to find a group of utmost similar ESTs for each selected gene. Individual EST of the said group were further searched in peanut ESTs (1,78,490 whole EST sequences) using stand alone BLAST. For the prediction as well as annotation of abiotic stress responsive selected genes, various tools (like Vec-Screen, Repeat Masker, EST-Trimmer, DNA Baser, WISE2 and I-TASSER) were used. Here we report the predicted result of Contigs, domain as well as 3D structure for HSP 17.3KDa protein, DnaJ protein and Type 2 Metallothionein protein. PMID:23275722

Kumari, Archana; Kumar, Ashutosh; Wany, Aakanksha; Prajapati, Gopal Kumar; Pandey, Dev Mani

2012-12-08

176

Linking enzyme sequence to function using conserved property difference locator to identify and annotate positions likely to control specific functionality  

Microsoft Academic Search

BACKGROUND: Families of homologous enzymes evolved from common progenitors. The availability of multiple sequences representing each activity presents an opportunity for extracting information specifying the functionality of individual homologs. We present a straightforward method for the identification of residues likely to determine class specific functionality in which multiple sequence alignments are converted to an annotated graphical form by the Conserved

Kimberly M. Mayer; Sean R. Mccorkle; John Shanklin

2005-01-01

177

DISCLOSE : DISsection of CLusters Obtained by SEries of transcriptome data using functional annotations and putative transcription factor binding sites  

PubMed Central

Background A typical step in the analysis of gene expression data is the determination of clusters of genes that exhibit similar expression patterns. Researchers are confronted with the seemingly arbitrary choice between numerous algorithms to perform cluster analysis. Results We developed an exploratory application that benchmarks the results of clustering methods using functional annotations. In addition, a de novo DNA motif discovery algorithm is integrated in our program which identifies overrepresented DNA binding sites in the upstream DNA sequences of genes from the clusters that are indicative of sites of transcriptional control. The performance of our program was evaluated by comparing the original results of a time course experiment with the findings of our application. Conclusion DISCLOSE assists researchers in the prokaryotic research community in systematically evaluating results of the application of a range of clustering algorithms to transcriptome data. Different performance measures allow to quickly and comprehensively determine the best suited clustering approach for a given dataset.

Blom, Evert-Jan; van Hijum, Sacha AFT; Hofstede, Klaas J; Silvis, Remko; Roerdink, Jos BTM; Kuipers, Oscar P

2008-01-01

178

Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins  

PubMed Central

Annotation transfer is a principal process in genome annotation. It involves “transferring” structural and functional annotation to uncharacterized open reading frames (ORFs) in a newly completed genome from experimentally characterized proteins similar in sequence. To prevent errors in genome annotation, it is important that this process be robust and statistically well-characterized, especially with regard to how it depends on the degree of sequence similarity. Previously, we and others have analyzed annotation transfer in single-domain proteins. Multi-domain proteins, which make up the bulk of the ORFs in eukaryotic genomes, present more complex issues in functional conservation. Here we present a large-scale survey of annotation transfer in these proteins, using scop superfamilies to define domain folds and a thesaurus based on SWISS-PROT keywords to define functional categories. Our survey reveals that multi-domain proteins have significantly less functional conservation than single-domain ones, except when they share the exact same combination of domain folds. In particular, we find that for multi-domain proteins, approximate function can be accurately transferred with only 35% certainty for pairs of proteins sharing one structural superfamily. In contrast, this value is 67% for pairs of single-domain proteins sharing the same structural superfamily. On the other hand, if two multi-domain proteins contain the same combination of two structural superfamilies the probability of their sharing the same function increases to 80% in the case of complete coverage along the full length of both proteins, this value increases further to >?90%. Moreover, we found that only 70 of the current total of 455 structural superfamilies are found in both single and multi-domain proteins and only 14 of these were associated with the same function in both categories of proteins. We also investigated the degree to which function could be transferred between pairs of multi-domain proteins with respect to the degree of sequence similarity between them, finding that functional divergence at a given amount of sequence similarity is always about two-fold greater for pairs of multi-domain proteins (sharing similarity over a single domain) in comparison to pairs of single-domain ones, though the overall shape of the relationship is quite similar. Further information is available at http://partslist.org/func or http://bioinfo.mbb.yale.edu/partslist/func.

Hegyi, Hedi; Gerstein, Mark

2001-01-01

179

The proteome of Toxoplasma gondii: integration with the genome provides novel insights into gene expression and annotation  

PubMed Central

Background Although the genomes of many of the most important human and animal pathogens have now been sequenced, our understanding of the actual proteins expressed by these genomes and how well they predict protein sequence and expression is still deficient. We have used three complementary approaches (two-dimensional electrophoresis, gel-liquid chromatography linked tandem mass spectrometry and MudPIT) to analyze the proteome of Toxoplasma gondii, a parasite of medical and veterinary significance, and have developed a public repository for these data within ToxoDB, making for the first time proteomics data an integral part of this key genome resource. Results The draft genome for Toxoplasma predicts around 8,000 genes with varying degrees of confidence. Our data demonstrate how proteomics can inform these predictions and help discover new genes. We have identified nearly one-third (2,252) of all the predicted proteins, with 2,477 intron-spanning peptides providing supporting evidence for correct splice site annotation. Functional predictions for each protein and key pathways were determined from the proteome. Importantly, we show evidence for many proteins that match alternative gene models, or previously unpredicted genes. For example, approximately 15% of peptides matched more convincingly to alternative gene models. We also compared our data with existing transcriptional data in which we highlight apparent discrepancies between gene transcription and protein expression. Conclusion Our data demonstrate the importance of protein data in expression profiling experiments and highlight the necessity of integrating proteomic with genomic data so that iterative refinements of both annotation and expression models are possible.

Xia, Dong; Sanderson, Sanya J; Jones, Andrew R; Prieto, Judith H; Yates, John R; Bromley, Elizabeth; Tomley, Fiona M; Lal, Kalpana; Sinden, Robert E; Brunk, Brian P; Roos, David S; Wastling, Jonathan M

2008-01-01

180

PIPA: A High-Throughput Pipeline for Protein Function Annotation.  

National Technical Information Service (NTIS)

Traditional experimental methods to determine the functions of proteins encoded in genomic sequences cannot keep pace with the avalanche of sequence data produced by new high-throughput sequencing technologies. This prompted the development of numerous bi...

C. Yu J. Reifman N. Zavaljevski V. Desai

2008-01-01

181

A nuclear magnetic resonance based approach to accurate functional annotation of putative enzymes in the methanogen Methanosarcina acetivorans  

Microsoft Academic Search

Background  Correct annotation of function is essential if one is to take full advantage of the vast amounts of genomic sequence data.\\u000a The accuracy of sequence-based functional annotations is often variable, particularly if the sequence homology to a known\\u000a function is low. Indeed recent work has shown that even proteins with very high sequence identity can have different folds\\u000a and functions,

Yihong Chen; Ethel Apolinario; Libuse Brachova; Zvi Kelman; Zhuo Li; Basil J Nikolau; Lucas Showman; Kevin Sowers; John Orban

2011-01-01

182

Developmental gene discovery in a hemimetabolous insect: de novo assembly and annotation of a transcriptome for the cricket Gryllus bimaculatus.  

PubMed

Most genomic resources available for insects represent the Holometabola, which are insects that undergo complete metamorphosis like beetles and flies. In contrast, the Hemimetabola (direct developing insects), representing the basal branches of the insect tree, have very few genomic resources. We have therefore created a large and publicly available transcriptome for the hemimetabolous insect Gryllus bimaculatus (cricket), a well-developed laboratory model organism whose potential for functional genetic experiments is currently limited by the absence of genomic resources. cDNA was prepared using mRNA obtained from adult ovaries containing all stages of oogenesis, and from embryo samples on each day of embryogenesis. Using 454 Titanium pyrosequencing, we sequenced over four million raw reads, and assembled them into 21,512 isotigs (predicted transcripts) and 120,805 singletons with an average coverage per base pair of 51.3. We annotated the transcriptome manually for over 400 conserved genes involved in embryonic patterning, gametogenesis, and signaling pathways. BLAST comparison of the transcriptome against the NCBI non-redundant protein database (nr) identified significant similarity to nr sequences for 55.5% of transcriptome sequences, and suggested that the transcriptome may contain 19,874 unique transcripts. For predicted transcripts without significant similarity to known sequences, we assessed their similarity to other orthopteran sequences, and determined that these transcripts contain recognizable protein domains, largely of unknown function. We created a searchable, web-based database to allow public access to all raw, assembled and annotated data. This database is to our knowledge the largest de novo assembled and annotated transcriptome resource available for any hemimetabolous insect. We therefore anticipate that these data will contribute significantly to more effective and higher-throughput deployment of molecular analysis tools in Gryllus. PMID:23671567

Zeng, Victor; Ewen-Campen, Ben; Horch, Hadley W; Roth, Siegfried; Mito, Taro; Extavour, Cassandra G

2013-05-06

183

Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones  

SciTech Connect

The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4 percent of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5 percent of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for nonprotein-coding RNA genes . In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.

Imanishi, Tadashi; Itoh, Takeshi; Suzuki, Yutaka; O'Donovan, Claire; Fukuchi, Satoshi; Koyanagi, Kanako O.; Barrero, Roberto A.; Tamura, Takuro; Yamaguchi-Kabata, Yumi; Tanino, Motohiko; Yura, Kei; Miyazaki, Satoru; Ikeo, Kazuho; Homma, Keiichi; Kasprzyk, Arek; Nishikawa, Tetsuo; Hirakawa, Mika; Thierry-Mieg, Jean; Thierry-Mieg, Danielle; Ashurst, Jennifer; Jia, Libin; Nakao, Mitsuteru; Thomas, Michael A.; Mulder, Nicola; Karavidopoulou, Youla; Jin, Lihua; Kim, Sangsoo; Yasuda, Tomohiro; Lenhard, Boris; Eveno, Eric; Suzuki, Yoshiyuki; Yamasaki, Chisato; Takeda, Jun-ichi; Gough, Craig; Hilton, Phillip; Fujii, Yasuyuki; Sakai, Hiroaki; Tanaka, Susumu; Amid, Clara; Bellgard, Matthew; de Fatima Bonaldo, Maria; Bono Hidemasa; Bromberg, Susan K.; Brookes, Anthony J.; Bruford, Elspeth; Carninci Piero; Chelala, Claude; Couillault, Christine; de Souza, Sandro J.; Debily, Marie-Anne; Devignes, Marie-Dominique; Dubchak, Inna; Endo, Toshinori; Estreicher, Anne; Eyras, Eduardo; Fukami-Kobayashi, Kaoru; Gopinath, Gopal R.; Graudens, Esther; Hahn, Yoonsoo; Han, Michael; Han, Ze-Guang; Hanada, Kousuke; Hanaoka, Hideki; Harada, Erimi; Hashimoto, Katsuyuki; Hinz, Ursula; Hirai, Momoki; Hishiki, Teruyoshi; Hopkinson, Ian; Imbeaud, Sandrine; Inoko, Hidetoshi; Kanapin, Alexander; Kaneko, Yayoi; Kasukawa, Takeya; Kelso, Janet; Kersey, Paul; Kikuno Reiko; Kimura, Kouichi; Korn, Bernhard; Kuryshev, Vladimir; Makalowska, Izabela; Makino Takashi; Mano, Shuhei; Mariage-Samson, Regine; Mashima, Jun; Matsuda, Hideo; Mewes, Hans-Werner; Minoshima, Shinsei; Nagai, Keiichi; Nagasaki, Hideki; Nagata, Naoki; Nigam, Rajni; Ogasawara, Osamu; Ohara, Osamu; Ohtsubo, Masafumi; Okada, Norihiro; Okido, Toshihisa; Oota, Satoshi; Ota, Motonori; Ota, Toshio; Otsuki, Tetsuji; Piatier-Tonneau, Dominique; Poustka, Annemarie; Ren, Shuang-Xi; Saitou, Naruya; Sakai, Katsunaga; Sakamoto, Shigetaka; Sakate, Ryuichi; Schupp, Ingo; Servant, Florence; Sherry, Stephen; Shiba Rie; et al.

2004-01-15

184

Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones  

PubMed Central

The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4% of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5% of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for non-protein-coding RNA genes. In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.

2004-01-01

185

Involving Undergraduates in the Annotation and Analysis of Global Gene Expression Studies: Creation of a Maize Shoot Apical Meristem Expression Database  

PubMed Central

Through a multi-university and interdisciplinary project we have involved undergraduate biology and computer science research students in the functional annotation of maize genes and the analysis of their microarray expression patterns. We have created a database to house the results of our functional annotation of >4400 genes identified as being differentially regulated in the maize shoot apical meristem (SAM). This database is located at http://sam.truman.edu and is now available for public use. The undergraduate students involved in constructing this unique SAM database received hands-on training in an intellectually challenging environment, which has prepared them for graduate and professional careers in biological sciences. We describe our experiences with this project as a model for effective research-based teaching of undergraduate biology and computer science students, as well as for a rich professional development experience for faculty at predominantly undergraduate institutions.

Buckner, Brent; Beck, Jon; Browning, Kate; Fritz, Ashleigh; Grantham, Lisa; Hoxha, Eneda; Kamvar, Zhian; Lough, Ashley; Nikolova, Olga; Schnable, Patrick S.; Scanlon, Michael J.; Janick-Buckner, Diane

2007-01-01

186

Generation, analysis and functional annotation of expressed sequence tags from the ectoparasitic mite Psoroptes ovis  

PubMed Central

Background Sheep scab is caused by Psoroptes ovis and is arguably the most important ectoparasitic disease affecting sheep in the UK. The disease is highly contagious and causes and considerable pruritis and irritation and is therefore a major welfare concern. Current methods of treatment are unsustainable and in order to elucidate novel methods of disease control a more comprehensive understanding of the parasite is required. To date, no full genomic DNA sequence or large scale transcript datasets are available and prior to this study only 484 P. ovis expressed sequence tags (ESTs) were accessible in public databases. Results In order to further expand upon the transcriptomic coverage of P. ovis thus facilitating novel insights into the mite biology we undertook a larger scale EST approach, incorporating newly generated and previously described P. ovis transcript data and representing the largest collection of P. ovis ESTs to date. We sequenced 1,574 ESTs and assembled these along with 484 previously generated P. ovis ESTs, which resulted in the identification of 1,545 unique P. ovis sequences. BLASTX searches identified 961 ESTs with significant hits (E-value < 1E-04) and 584 novel P. ovis ESTs. Gene Ontology (GO) analysis allowed the functional annotation of 880 ESTs and included predictions of signal peptide and transmembrane domains; allowing the identification of potential P. ovis excreted/secreted factors, and mapping of metabolic pathways. Conclusions This dataset currently represents the largest collection of P. ovis ESTs, all of which are publicly available in the GenBank EST database (dbEST) (accession numbers FR748230 - FR749648). Functional analysis of this dataset identified important homologues, including house dust mite allergens and tick salivary factors. These findings offer new insights into the underlying biology of P. ovis, facilitating further investigations into mite biology and the identification of novel methods of intervention.

2011-01-01

187

Genome-scale phylogenetic function annotation of large and diverse protein families.  

PubMed

The Statistical Inference of Function Through Evolutionary Relationships (SIFTER) framework uses a statistical graphical model that applies phylogenetic principles to automate precise protein function prediction. Here we present a revised approach (SIFTER version 2.0) that enables annotations on a genomic scale. SIFTER 2.0 produces equivalently precise predictions compared to the earlier version on a carefully studied family and on a collection of 100 protein families. We have added an approximation method to SIFTER 2.0 and show a 500-fold improvement in speed with minimal impact on prediction results in the functionally diverse sulfotransferase protein family. On the Nudix protein family, previously inaccessible to the SIFTER framework because of the 66 possible molecular functions, SIFTER achieved 47.4% accuracy on experimental data (where BLAST achieved 34.0%). Finally, we used SIFTER to annotate all of the Schizosaccharomyces pombe proteins with experimental functional characterizations, based on annotations from proteins in 46 fungal genomes. SIFTER precisely predicted molecular function for 45.5% of the characterized proteins in this genome, as compared with four current function prediction methods that precisely predicted function for 62.6%, 30.6%, 6.0%, and 5.7% of these proteins. We use both precision-recall curves and ROC analyses to compare these genome-scale predictions across the different methods and to assess performance on different types of applications. SIFTER 2.0 is capable of predicting protein molecular function for large and functionally diverse protein families using an approximate statistical model, enabling phylogenetics-based protein function prediction for genome-wide analyses. The code for SIFTER and protein family data are available at http://sifter.berkeley.edu. PMID:21784873

Engelhardt, Barbara E; Jordan, Michael I; Srouji, John R; Brenner, Steven E

2011-07-22

188

Genome-scale phylogenetic function annotation of large and diverse protein families  

PubMed Central

The Statistical Inference of Function Through Evolutionary Relationships (SIFTER) framework uses a statistical graphical model that applies phylogenetic principles to automate precise protein function prediction. Here we present a revised approach (SIFTER version 2.0) that enables annotations on a genomic scale. SIFTER 2.0 produces equivalently precise predictions compared to the earlier version on a carefully studied family and on a collection of 100 protein families. We have added an approximation method to SIFTER 2.0 and show a 500-fold improvement in speed with minimal impact on prediction results in the functionally diverse sulfotransferase protein family. On the Nudix protein family, previously inaccessible to the SIFTER framework because of the 66 possible molecular functions, SIFTER achieved 47.4% accuracy on experimental data (where BLAST achieved 34.0%). Finally, we used SIFTER to annotate all of the Schizosaccharomyces pombe proteins with experimental functional characterizations, based on annotations from proteins in 46 fungal genomes. SIFTER precisely predicted molecular function for 45.5% of the characterized proteins in this genome, as compared with four current function prediction methods that precisely predicted function for 62.6%, 30.6%, 6.0%, and 5.7% of these proteins. We use both precision-recall curves and ROC analyses to compare these genome-scale predictions across the different methods and to assess performance on different types of applications. SIFTER 2.0 is capable of predicting protein molecular function for large and functionally diverse protein families using an approximate statistical model, enabling phylogenetics-based protein function prediction for genome-wide analyses. The code for SIFTER and protein family data are available at http://sifter.berkeley.edu.

Engelhardt, Barbara E.; Jordan, Michael I.; Srouji, John R.; Brenner, Steven E.

2011-01-01

189

NuChart: An R Package to Study Gene Spatial Neighbourhoods with Multi-Omics Annotations.  

PubMed

Long-range chromosomal associations between genomic regions, and their repositioning in the 3D space of the nucleus, are now considered to be key contributors to the regulation of gene expression and important links have been highlighted with other genomic features involved in DNA rearrangements. Recent Chromosome Conformation Capture (3C) measurements performed with high throughput sequencing (Hi-C) and molecular dynamics studies show that there is a large correlation between colocalization and coregulation of genes, but these important researches are hampered by the lack of biologists-friendly analysis and visualisation software. Here, we describe NuChart, an R package that allows the user to annotate and statistically analyse a list of input genes with information relying on Hi-C data, integrating knowledge about genomic features that are involved in the chromosome spatial organization. NuChart works directly with sequenced reads to identify the related Hi-C fragments, with the aim of creating gene-centric neighbourhood graphs on which multi-omics features can be mapped. Predictions about CTCF binding sites, isochores and cryptic Recombination Signal Sequences are provided directly with the package for mapping, although other annotation data in bed format can be used (such as methylation profiles and histone patterns). Gene expression data can be automatically retrieved and processed from the Gene Expression Omnibus and ArrayExpress repositories to highlight the expression profile of genes in the identified neighbourhood. Moreover, statistical inferences about the graph structure and correlations between its topology and multi-omics features can be performed using Exponential-family Random Graph Models. The Hi-C fragment visualisation provided by NuChart allows the comparisons of cells in different conditions, thus providing the possibility of novel biomarkers identification. NuChart is compliant with the Bioconductor standard and it is freely available at ftp://fileserver.itb.cnr.it/nuchart. PMID:24069388

Merelli, Ivan; Liò, Pietro; Milanesi, Luciano

2013-09-19

190

NuChart: An R Package to Study Gene Spatial Neighbourhoods with Multi-Omics Annotations  

PubMed Central

Long-range chromosomal associations between genomic regions, and their repositioning in the 3D space of the nucleus, are now considered to be key contributors to the regulation of gene expression and important links have been highlighted with other genomic features involved in DNA rearrangements. Recent Chromosome Conformation Capture (3C) measurements performed with high throughput sequencing (Hi-C) and molecular dynamics studies show that there is a large correlation between colocalization and coregulation of genes, but these important researches are hampered by the lack of biologists-friendly analysis and visualisation software. Here, we describe NuChart, an R package that allows the user to annotate and statistically analyse a list of input genes with information relying on Hi-C data, integrating knowledge about genomic features that are involved in the chromosome spatial organization. NuChart works directly with sequenced reads to identify the related Hi-C fragments, with the aim of creating gene-centric neighbourhood graphs on which multi-omics features can be mapped. Predictions about CTCF binding sites, isochores and cryptic Recombination Signal Sequences are provided directly with the package for mapping, although other annotation data in bed format can be used (such as methylation profiles and histone patterns). Gene expression data can be automatically retrieved and processed from the Gene Expression Omnibus and ArrayExpress repositories to highlight the expression profile of genes in the identified neighbourhood. Moreover, statistical inferences about the graph structure and correlations between its topology and multi-omics features can be performed using Exponential-family Random Graph Models. The Hi-C fragment visualisation provided by NuChart allows the comparisons of cells in different conditions, thus providing the possibility of novel biomarkers identification. NuChart is compliant with the Bioconductor standard and it is freely available at ftp://fileserver.itb.cnr.it/nuchart.

Merelli, Ivan; Lio, Pietro; Milanesi, Luciano

2013-01-01

191

Introduction to the Proceedings of the Avian Genomics and Gene Ontology Annotation Workshop  

PubMed Central

The Avian Genomics Conference and Gene Ontology Annotation Workshop brought together researchers and students from around the world to present their latest research addressing the delivery of value from the billions of base-pairs of Archosaur sequence that have become available in the last few years. This editorial describes the conference itself and introduces the ten peer-reviewed manuscripts accepted for publications in the proceedings. These manuscripts address issues ranging from the poultry industry view of USDA genomics policy to the genomics of a wide variety of Archeosaur species including chicken, duck, alligator, and condors and their pathogens.

2009-01-01

192

Automatic annotation techniques for gene expression images of the fruit fly embryo  

NASA Astrophysics Data System (ADS)

We present an application of image analysis techniques to automatically annotate biological images depicting gene expression patterns in developing embryos of fruit fly (Drosophila melanogaster), a model organism to study gene interaction. The aim is to determine the view (lateral versus dorsal/ventral [non-lateral]), orientation (anterior-left or anterior-right), and the developmental stage of the embryo. We employed contour curvature analysis, symmetry of the gene expression patterns, and shape differences at the anterior and posterior ends of the embryo, among others, for these purposes. An analysis of a pilot database of 3500 images indicates that view was correctly identified in 62%, orientation in 85%, and developmental stage in 73% of the images. We observed that correct inferences had better separation in feature space than incorrect inferences. This means that, although these methods do not exhibit very high classification accuracy, they could be employed to identify images which need manual intervention, thereby reducing the target set for biologists. The novelty in this work is in the integration of well-established image analysis with the biological knowledge for annotating the embryos. Our examinations show that features that provide discrimination ability among different views, different orientations, and different developmental stages are often restricted to certain regions of the embryo, which agrees with the longstanding knowledge in the developmental biological community.

Gargesha, Madhusudhana; Yang, Jian; Van Emden, Bernard; Panchanathan, Sethuraman; Kumar, Sudhir

2005-07-01

193

Functional Annotation Analytics of Bacillus Genomes Reveals Stress Responsive Acetate Utilization and Sulfate Uptake in the Biotechnologically Relevant Bacillus megaterium  

PubMed Central

Bacillus species form an heterogeneous group of Gram-positive bacteria that include members that are disease-causing, biotechnologically-relevant, and can serve as biological research tools. A common feature of Bacillus species is their ability to survive in harsh environmental conditions by formation of resistant endospores. Genes encoding the universal stress protein (USP) domain confer cellular and organismal survival during unfavorable conditions such as nutrient depletion. As of February 2012, the genome sequences and a variety of functional annotations for at least 123 Bacillus isolates including 45 Bacillus cereus isolates were available in public domain bioinformatics resources. Additionally, the genome sequencing status of 10 of the B. cereus isolates were annotated as finished with each genome encoded 3 USP genes. The conservation of gene neighborhood of the 140 aa universal stress protein in the B. cereus genomes led to the identification of a predicted plasmid-encoded transcriptional unit that includes a USP gene and a sulfate uptake gene in the soil-inhabiting Bacillus megaterium. Gene neighborhood analysis combined with visual analytics of chemical ligand binding sites data provided knowledge-building biological insights on possible cellular functions of B. megaterium universal stress proteins. These functions include sulfate and potassium uptake, acid extrusion, cellular energy-level sensing, survival in high oxygen conditions and acetate utilization. Of particular interest was a two-gene transcriptional unit that consisted of genes for a universal stress protein and a sirtuin Sir2 (deacetylase enzyme for NAD+-dependent acetate utilization). The predicted transcriptional units for stress responsive inorganic sulfate uptake and acetate utilization could explain biological mechanisms for survival of soil-inhabiting Bacillus species in sulfate and acetate limiting conditions. Considering the key role of sirtuins in mammalian physiology additional research on the USP-Sir2 transcriptional unit of B. megaterium could help explain mammalian acetate metabolism in glucose-limiting conditions such as caloric restriction. Finally, the deep-rooted position of B. megaterium in the phylogeny of Bacillus species makes the investigation of the functional coupling acetate utilization and stress response compelling.

Williams, Baraka S.; Isokpehi, Raphael D.; Mbah, Andreas N.; Hollman, Antoinesha L.; Bernard, Christina O.; Simmons, Shaneka S.; Ayensu, Wellington K.; Garner, Bianca L.

2012-01-01

194

Functional annotation of risk loci identified through genome-wide association studies for prostate cancer  

PubMed Central

Background The majority of established prostate cancer risk-associated Single Nucleotide Polymorphisms (SNPs) identified from genome-wide association studies do not fall into protein coding regions. Therefore, the mechanisms by which these SNPs affect prostate cancer risk remain unclear. Here, we used a series of bioinformatic tools and databases to provide possible molecular insights into the actions of risk SNPs. Methodology/Principal Findings We performed a comprehensive assessment of the potential functional impact of 33 SNPs that were identified and confirmed as associated with PCa risk in previous studies. For these 33 SNPs and additional SNPs in Linkage Disequilibrium (LD) (r2 ? 0.5), we first mapped them to genomic functional annotation databases, including the Encyclopedia of DNA Elements (ENCODE), eleven genomic regulatory elements databases defined by the University of California Santa Cruz (UCSC) table browser, and Androgen Receptor (AR) binding sites defined by a ChIP-chip technique. Enrichment analysis was then carried out to assess whether the risk SNP blocks were enriched in the various annotation sets. Risk SNP blocks were significantly enriched over that expected by chance in two annotation sets, including AR binding sites (p=0.003), and FoxA1 binding sites (p=0.05). About one third of the 33 risk SNP blocks are located within AR binding regions. Conclusions/Significance The significant enrichment of risk SNPs in AR binding sites may suggest a potential molecular mechanism for these SNPs in prostate cancer initiation, and provide guidance for future functional studies.

Lu, Yizhen; Zhang, Zheng; Yu, Hongjie; Zheng, S. Lily; Isaacs, William B.; Xu, Jianfeng; Sun, Jielin

2010-01-01

195

Annotated genetic linkage maps of Pinus pinaster Ait. from a Central Spain population using microsatellite and gene based markers  

PubMed Central

Background Pinus pinaster Ait. is a major resin producing species in Spain. Genetic linkage mapping can facilitate marker-assisted selection (MAS) through the identification of Quantitative Trait Loci and selection of allelic variants of interest in breeding populations. In this study, we report annotated genetic linkage maps for two individuals (C14 and C15) belonging to a breeding program aiming to increase resin production. We use different types of DNA markers, including last-generation molecular markers. Results We obtained 13 and 14 linkage groups for C14 and C15 maps, respectively. A total of 211 and 215 markers were positioned on each map and estimated genome length was between 1,870 and 2,166 cM respectively, which represents near 65% of genome coverage. Comparative mapping with previously developed genetic linkage maps for P. pinaster based on about 60 common markers enabled aligning linkage groups to this reference map. The comparison of our annotated linkage maps and linkage maps reporting QTL information revealed 11 annotated SNPs in candidate genes that co-localized with previously reported QTLs for wood properties and water use efficiency. Conclusions This study provides genetic linkage maps from a Spanish population that shows high levels of genetic divergence with French populations from which segregating progenies have been previously mapped. These genetic maps will be of interest to construct a reliable consensus linkage map for the species. The importance of developing functional genetic linkage maps is highlighted, especially when working with breeding populations for its future application in MAS for traits of interest.

2012-01-01

196

Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation  

PubMed Central

We compare several commonly used expression-based gene clustering algorithms using a figure of merit based on the mutual information between cluster membership and known gene attributes. By studying various publicly available expression data sets we conclude that enrichment of clusters for biological function is, in general, highest at rather low cluster numbers. As a measure of dissimilarity between the expression patterns of two genes, no method outperforms Euclidean distance for ratio-based measurements, or Pearson distance for non-ratio-based measurements at the optimal choice of cluster number. We show the self-organized-map approach to be best for both measurement types at higher numbers of clusters. Clusters of genes derived from single- and average-linkage hierarchical clustering tend to produce worse-than-random results. [The algorithm described is available at http://llama.med.harvard.edu, under Software.

Gibbons, Francis D.; Roth, Frederick P.

2002-01-01

197

Avoiding inconsistencies over time and tracking difficulties in Applied Biosystems AB1700™\\/Panther™ probe-to-gene annotations  

Microsoft Academic Search

Background  Significant inconsistencies between probe-to-gene annotations between different releases of probe set identifiers by commercial\\u000a microarray platform solutions have been reported. Such inconsistencies lead to misleading or ambiguous interpretation of published\\u000a gene expression results.\\u000a \\u000a \\u000a \\u000a \\u000a Results  We report here similar inconsistencies in the probe-to-gene annotation of Applied Biosystems AB1700 data, demonstrating that\\u000a this is not an isolated concern. Moreover, the online information source

Sebastian Noth; Arndt Benecke

2005-01-01

198

Generation, analysis and functional annotation of expressed sequence tags from the sheepshead minnow (Cyprinodon variegatus)  

PubMed Central

Background Sheepshead minnow (Cyprinodon variegatus) are small fish capable of withstanding exposure to very low levels of dissolved oxygen, as well as extreme temperatures and salinities. It is an important model in understanding the impacts and biological response to hypoxia and co-occurring compounding stressors such as polycyclic aromatic hydrocarbons, endocrine disrupting chemicals, metals and herbicides. Here, we initiated a project to sequence and analyze over 10,000 ESTs generated from the Sheepshead minnow (Cyprinodon variegatus) as a resource for investigating stressor responses. Results We sequenced 10,858 EST clones using a normalized cDNA library made from larval, embryonic and adult suppression subtractive hybridization-PCR (SSH) libraries. Post- sequencing processing led to 8,099 high quality sequences. Clustering analysis of these ESTs indentified 4,223 unique sequences containing 1,053 contigs and 3,170 singletons. BLASTX searches produced 1,394 significant (E-value < 10-5) hits and further Gene Ontology (GO) analysis annotated 388 of these genes. All the EST sequences were deposited by Expressed Sequence Tags database (dbEST) in GenBank (GenBank: GE329585 to GE337683). Gene discovery and annotations are presented and discussed. This set of ESTs represents a significant proportion of the Sheepshead minnow (Cyprinodon variegatus) transcriptome, and provides a material basis for the development of microarrays useful for further gene expression studies in association with stressors such as hypoxia, cadmium, chromium and pyrene.

2010-01-01

199

Annotation of cis-regulatory elements by identification, subclassification, and functional assessment of multispecies conserved sequences  

PubMed Central

An important step toward improving the annotation of the human genome is to identify cis-acting regulatory elements from primary DNA sequence. One approach is to compare sequences from multiple, divergent species. This approach distinguishes multispecies conserved sequences (MCS) in noncoding regions from more rapidly evolving neutral DNA. Here, we have analyzed a region of ?238kb containing the human ? globin cluster that was sequenced and/or annotated across the syntenic region in 22 species spanning 500 million years of evolution. Using a variety of bioinformatic approaches and correlating the results with many aspects of chromosome structure and function in this region, we were able to identify and evaluate the importance of 24 individual MCSs. This approach sensitively and accurately identified previously characterized regulatory elements but also discovered unidentified promoters, exons, splicing, and transcriptional regulatory elements. Together, these studies demonstrate an integrated approach by which to identify, subclassify, and predict the potential importance of MCSs.

Hughes, Jim R.; Cheng, Jan-Fang; Ventress, Nicki; Prabhakar, Shyam; Clark, Kevin; Anguita, Eduardo; De Gobbi, Marco; de Jong, Pieter; Rubin, Eddy; Higgs, Douglas R.

2005-01-01

200

Metalloproteomics: High-Throughput Structural and Functional Annotation of Proteins in Structural Genomics  

SciTech Connect

A high-throughput method for measuring transition metal content based on quantitation of X-ray fluorescence signals was used to analyze 654 proteins selected as targets by the New York Structural GenomiX Research Consortium. Over 10% showed the presence of transition metal atoms in stoichiometric amounts; these totals as well as the abundance distribution are similar to those of the Protein Data Bank. Bioinformatics analysis of the identified metalloproteins in most cases supported the metalloprotein annotation; identification of the conserved metal binding motif was also shown to be useful in verifying structural models of the proteins. Metalloproteomics provides a rapid structural and functional annotation for these sequences and is shown to be {approx}95% accurate in predicting the presence or absence of stoichiometric metal content. The project's goal is to assay at least 1 member from each Pfam family; approximately 500 Pfam families have been characterized with respect to transition metal content so far.

Shi,W.; Zhan, C.; Lgnatov, A.; Manjasetty, B.; Marinkovic, N.; Sullivan, M.; Huang, R.; Chance, M.; Li, H.; et al.

2005-01-01

201

MetaSAMS--a novel software platform for taxonomic classification, functional annotation and comparative analysis of metagenome datasets.  

PubMed

Metagenomics aims at exploring microbial communities concerning their composition and functioning. Application of high-throughput sequencing technologies for the analysis of environmental DNA-preparations can generate large sets of metagenome sequence data which have to be analyzed by means of bioinformatics tools to unveil the taxonomic composition of the analyzed community as well as the repertoire of genes and gene functions. A bioinformatics software platform is required that allows the automated taxonomic and functional analysis and interpretation of metagenome datasets without manual effort. To address current demands in metagenome data analyses, the novel platform MetaSAMS was developed. MetaSAMS automatically accomplishes the tasks necessary for analyzing the composition and functional repertoire of a given microbial community from metagenome sequence data by implementing two software pipelines: (i) the first pipeline consists of three different classifiers performing the taxonomic profiling of metagenome sequences and (ii) the second functional pipeline accomplishes region predictions on assembled contigs and assigns functional information to predicted coding sequences. Moreover, MetaSAMS provides tools for statistical and comparative analyses based on the taxonomic and functional annotations. The capabilities of MetaSAMS are demonstrated for two metagenome datasets obtained from a biogas-producing microbial community of a production-scale biogas plant. The MetaSAMS web interface is available at https://metasams.cebitec.uni-bielefeld.de. PMID:23026555

Zakrzewski, Martha; Bekel, Thomas; Ander, Christina; Pühler, Alfred; Rupp, Oliver; Stoye, Jens; Schlüter, Andreas; Goesmann, Alexander

2012-09-29

202

Accurate Protein Structure Annotation through Competitive Diffusion of Enzymatic Functions over a Network of Local Evolutionary Similarities  

PubMed Central

High-throughput Structural Genomics yields many new protein structures without known molecular function. This study aims to uncover these missing annotations by globally comparing select functional residues across the structural proteome. First, Evolutionary Trace Annotation, or ETA, identifies which proteins have local evolutionary and structural features in common; next, these proteins are linked together into a proteomic network of ETA similarities; then, starting from proteins with known functions, competing functional labels diffuse link-by-link over the entire network. Every node is thus assigned a likelihood z-score for every function, and the most significant one at each node wins and defines its annotation. In high-throughput controls, this competitive diffusion process recovered enzyme activity annotations with 99% and 97% accuracy at half-coverage for the third and fourth Enzyme Commission (EC) levels, respectively. This corresponds to false positive rates 4-fold lower than nearest-neighbor and 5-fold lower than sequence-based annotations. In practice, experimental validation of the predicted carboxylesterase activity in a protein from Staphylococcus aureus illustrated the effectiveness of this approach in the context of an increasingly drug-resistant microbe. This study further links molecular function to a small number of evolutionarily important residues recognizable by Evolutionary Tracing and it points to the specificity and sensitivity of functional annotation by competitive global network diffusion. A web server is at http://mammoth.bcm.tmc.edu/networks.

Erdin, Serkan; Ward, R. Matthew; Amin, Shivas R.; Lichtarge, Olivier

2010-01-01

203

Accurate protein structure annotation through competitive diffusion of enzymatic functions over a network of local evolutionary similarities.  

PubMed

High-throughput Structural Genomics yields many new protein structures without known molecular function. This study aims to uncover these missing annotations by globally comparing select functional residues across the structural proteome. First, Evolutionary Trace Annotation, or ETA, identifies which proteins have local evolutionary and structural features in common; next, these proteins are linked together into a proteomic network of ETA similarities; then, starting from proteins with known functions, competing functional labels diffuse link-by-link over the entire network. Every node is thus assigned a likelihood z-score for every function, and the most significant one at each node wins and defines its annotation. In high-throughput controls, this competitive diffusion process recovered enzyme activity annotations with 99% and 97% accuracy at half-coverage for the third and fourth Enzyme Commission (EC) levels, respectively. This corresponds to false positive rates 4-fold lower than nearest-neighbor and 5-fold lower than sequence-based annotations. In practice, experimental validation of the predicted carboxylesterase activity in a protein from Staphylococcus aureus illustrated the effectiveness of this approach in the context of an increasingly drug-resistant microbe. This study further links molecular function to a small number of evolutionarily important residues recognizable by Evolutionary Tracing and it points to the specificity and sensitivity of functional annotation by competitive global network diffusion. A web server is at http://mammoth.bcm.tmc.edu/networks. PMID:21179190

Venner, Eric; Lisewski, Andreas Martin; Erdin, Serkan; Ward, R Matthew; Amin, Shivas R; Lichtarge, Olivier

2010-12-13

204

Annotation of the domestic dog genome sequence: finding the missing genes.  

PubMed

There are over 350 genetically distinct breeds of domestic dog that present considerable variation in morphology, physiology, and disease susceptibility. The genome sequence of the domestic dog was assembled and released in 2005, providing an estimated 20,000 protein-coding genes that are a great asset to the scientific community that uses the dog system as a genetic biomedical model and for comparative and evolutionary studies. Although the canine gene set had been predicted using a combination of ab initio methods, homology studies, motif analysis, and similarity-based programs, it still requires a deep annotation of noncoding genes, alternative splicing, pseudogenes, regulatory regions, and gain and loss events. Such analyses could benefit from new sequencing technologies (RNA-Seq) to better exploit the advantages of the canine genetic system in tracking disease genes. Here, we review the catalog of canine protein-coding genes and the search for missing genes, and we propose rationales for an accurate identification of noncoding genes though next-generation sequencing. PMID:22076420

Derrien, Thomas; Vaysse, Amaury; André, Catherine; Hitte, Christophe

2011-11-11

205

Functional Characterization of Two M42 Aminopeptidases Erroneously Annotated as Cellulases  

PubMed Central

Several aminopeptidases of the M42 family have been described as tetrahedral-shaped dodecameric (TET) aminopeptidases. A current hypothesis suggests that these enzymes are involved, along with the tricorn peptidase, in degrading peptides produced by the proteasome. Yet the M42 family remains ill defined, as some members have been annotated as cellulases because of their homology with CelM, formerly described as an endoglucanase of Clostridium thermocellum. Here we describe the catalytic functions and substrate profiles CelM and of TmPep1050, the latter having been annotated as an endoglucanase of Thermotoga maritima. Both enzymes were shown to catalyze hydrolysis of nonpolar aliphatic L-amino acid-pNA substrates, the L-leucine derivative appearing as the best substrate. No significant endoglucanase activity was measured, either for TmPep1050 or CelM. Addition of cobalt ions enhanced the activity of both enzymes significantly, while both the chelating agent EDTA and bestatin, a specific inhibitor of metalloaminopeptidases, proved inhibitory. Our results strongly suggest that one should avoid annotating members of the M42 aminopeptidase family as cellulases. In an updated assessment of the distribution of M42 aminopeptidases, we found TET aminopeptidases to be distributed widely amongst archaea and bacteria. We additionally observed that several phyla lack both TET and tricorn. This suggests that other complexes may act downstream from the proteasome.

Dutoit, Raphael; Brandt, Nathalie; Legrain, Christianne; Bauvois, Cedric

2012-01-01

206

Functional characterization of two M42 aminopeptidases erroneously annotated as cellulases.  

PubMed

Several aminopeptidases of the M42 family have been described as tetrahedral-shaped dodecameric (TET) aminopeptidases. A current hypothesis suggests that these enzymes are involved, along with the tricorn peptidase, in degrading peptides produced by the proteasome. Yet the M42 family remains ill defined, as some members have been annotated as cellulases because of their homology with CelM, formerly described as an endoglucanase of Clostridium thermocellum. Here we describe the catalytic functions and substrate profiles CelM and of TmPep1050, the latter having been annotated as an endoglucanase of Thermotoga maritima. Both enzymes were shown to catalyze hydrolysis of nonpolar aliphatic L-amino acid-pNA substrates, the L-leucine derivative appearing as the best substrate. No significant endoglucanase activity was measured, either for TmPep1050 or CelM. Addition of cobalt ions enhanced the activity of both enzymes significantly, while both the chelating agent EDTA and bestatin, a specific inhibitor of metalloaminopeptidases, proved inhibitory. Our results strongly suggest that one should avoid annotating members of the M42 aminopeptidase family as cellulases. In an updated assessment of the distribution of M42 aminopeptidases, we found TET aminopeptidases to be distributed widely amongst archaea and bacteria. We additionally observed that several phyla lack both TET and tricorn. This suggests that other complexes may act downstream from the proteasome. PMID:23226342

Dutoit, Raphaël; Brandt, Nathalie; Legrain, Christianne; Bauvois, Cédric

2012-11-30

207

Annotation extension through protein family annotation coherence metrics.  

PubMed

Protein functional annotation consists in associating proteins with textual descriptors elucidating their biological roles. The bulk of annotation is done via automated procedures that ultimately rely on annotation transfer. Despite a large number of existing protein annotation procedures the ever growing protein space is never completely annotated. One of the facets of annotation incompleteness derives from annotation uncertainty. Often when protein function cannot be predicted with enough specificity it is instead conservatively annotated with more generic terms. In a scenario of protein families or functionally related (or even dissimilar) sets this leads to a more difficult task of using annotations to compare the extent of functional relatedness among all family or set members. However, we postulate that identifying sub-sets of functionally coherent proteins annotated at a very specific level, can help the annotation extension of other incompletely annotated proteins within the same family or functionally related set. As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class. We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension. Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families. PMID:24130572

Bastos, Hugo P; Clarke, Luka A; Couto, Francisco M

2013-10-11

208

Annotation extension through protein family annotation coherence metrics  

PubMed Central

Protein functional annotation consists in associating proteins with textual descriptors elucidating their biological roles. The bulk of annotation is done via automated procedures that ultimately rely on annotation transfer. Despite a large number of existing protein annotation procedures the ever growing protein space is never completely annotated. One of the facets of annotation incompleteness derives from annotation uncertainty. Often when protein function cannot be predicted with enough specificity it is instead conservatively annotated with more generic terms. In a scenario of protein families or functionally related (or even dissimilar) sets this leads to a more difficult task of using annotations to compare the extent of functional relatedness among all family or set members. However, we postulate that identifying sub-sets of functionally coherent proteins annotated at a very specific level, can help the annotation extension of other incompletely annotated proteins within the same family or functionally related set. As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class. We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension. Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families.

Bastos, Hugo P.; Clarke, Luka A.; Couto, Francisco M.

2013-01-01

209

A novel genetic island of meningitic Escherichia coli K1 containing the ibeA invasion gene (GimA): functional annotation and carbon-source-regulated invasion of human brain microvascular endothelial cells  

Microsoft Academic Search

.   The IbeA (ibe10) gene is an invasion determinant contributing to E. coli K1 invasion of the blood-brain barrier. This gene has been cloned and characterized from the chromosome of an invasive cerebrospinal\\u000a fluid isolate of E. coli K1, strain RS218 (018:K1: H7). In the present study, a genetic island of meningitic E. coli containing ibeA (GimA) has been identified.

Sheng-He Huang; Yu-Hua Chen; Guoying Kong; Steven H. M. Chen; John Besemer; Mark Borodovsky; Ambrose Jong

2001-01-01

210

Protein function annotation with Structurally Aligned Local Sites of Activity (SALSAs)  

PubMed Central

Background The prediction of biochemical function from the 3D structure of a protein has proved to be much more difficult than was originally foreseen. A reliable method to test the likelihood of putative annotations and to predict function from structure would add tremendous value to structural genomics data. We report on a new method, Structurally Aligned Local Sites of Activity (SALSA), for the prediction of biochemical function based on a local structural match at the predicted catalytic or binding site. Results Implementation of the SALSA method is described. For the structural genomics protein PY01515 (PDB ID 2aqw) from Plasmodium yoelii, it is shown that the putative annotation, Orotidine 5'-monophosphate decarboxylase (OMPDC), is most likely correct. SALSA analysis of YP_001304206.1 (PDB ID 3h3l), a putative sugar hydrolase from Parabacteroides distasonis, shows that its active site does not bear close resemblance to any previously characterized member of its superfamily, the Concanavalin A-like lectins/glucanases. It is noted that three residues in the active site of the thermophilic beta-1,4-xylanase from Nonomuraea flexuosa (PDB ID 1m4w), Y78, E87, and E176, overlap with POOL-predicted residues of similar type, Y168, D153, and E232, in YP_001304206.1. The substrate recognition regions of the two proteins are rather different, suggesting that YP_001304206.1 is a new functional type within the superfamily. A structural genomics protein from Mycobacterium avium (PDB ID 3q1t) has been reported to be an enoyl-CoA hydratase (ECH), but SALSA analysis shows a poor match between the predicted residues for the SG protein and those of known ECHs. A better local structural match is obtained with Anabaena beta-diketone hydrolase (ABDH), a known ?-diketone hydrolase from Cyanobacterium anabaena (PDB ID 2j5s). This suggests that the reported ECH function of the SG protein is incorrect and that it is more likely a ?-diketone hydrolase. Conclusions A local site match provides a more compelling function prediction than that obtainable from a simple 3D structure match. The present method can confirm putative annotations, identify misannotation, and in some cases suggest a more probable annotation.

2013-01-01

211

Assessing functional annotation transfers with inter-species conserved coexpression: application to Plasmodium falciparum  

Microsoft Academic Search

BACKGROUND: Plasmodium falciparum is the main causative agent of malaria. Of the 5 484 predicted genes of P. falciparum, about 57% do not have sufficient sequence similarity to characterized genes in other species to warrant functional assignments. Non-homology methods are thus needed to obtain functional clues for these uncharacterized genes. Gene expression data have been widely used in the recent

Laurent Bréhélin; Isabelle Florent; Olivier Gascuel; Éric Maréchal

2010-01-01

212

VIRGO: computational prediction of gene functions.  

PubMed

Dramatic advances in sequencing technology and sophisticated experimental assays that interrogate the cell, combined with the public availability of the resulting data, herald the era of systems biology. However, the biological functions of more than 40% of the genes in sequenced genomes are unknown, posing a fundamental barrier to progress in systems biology. The large scale and diversity of available data requires the development of techniques that can automatically utilize these datasets to make quantified and robust predictions of gene function that can be experimentally verified. We present a service called the VIRtual Gene Ontology (VIRGO) that (i) constructs a functional linkage network (FLN) from gene expression and molecular interaction data, (ii) labels genes in the FLN with their functional annotations in the Gene Ontology and (iii) systematically propagates these labels across the FLN in order to precisely predict the functions of unlabelled genes. VIRGO assigns confidence estimates to predicted functions so that a biologist can prioritize predictions for further experimental study. For each prediction, VIRGO also provides an informative 'propagation diagram' that traces the flow of information in the FLN that led to the prediction. VIRGO is available at http://whipple.cs.vt.edu:8080/virgo. PMID:16845022

Massjouni, Naveed; Rivera, Corban G; Murali, T M

2006-07-01

213

Systematic genome-wide annotation of spliceosomal proteins reveals differential gene family expansion.  

PubMed

Although more than 200 human spliceosomal and splicing-associated proteins are known, the evolution of the splicing machinery has not been studied extensively. The recent near-complete sequencing and annotation of distant vertebrate and chordate genomes provides the opportunity for an exhaustive comparative analysis of splicing factors across eukaryotes. We describe here our semiautomated computational pipeline to identify and annotate splicing factors in representative species of eukaryotes. We focused on protein families whose role in splicing is confirmed by experimental evidence. We visually inspected 1894 proteins and manually curated 224 of them. Our analysis shows a general conservation of the core spliceosomal proteins across the eukaryotic lineage, contrasting with selective expansions of protein families known to play a role in the regulation of splicing, most notably of SR proteins in metazoans and of heterogeneous nuclear ribonucleoproteins (hnRNP) in vertebrates. We also observed vertebrate-specific expansion of the CLK and SRPK kinases (which phosphorylate SR proteins), and the CUG-BP/CELF family of splicing regulators. Furthermore, we report several intronless genes amongst splicing proteins in mammals, suggesting that retrotransposition contributed to the complexity of the mammalian splicing apparatus. PMID:16344558

Barbosa-Morais, Nuno L; Carmo-Fonseca, Maria; Aparício, Samuel

2005-12-12

214

Pairagon+N-SCAN_EST: a model-based gene annotation pipeline  

PubMed Central

Background This paper describes Pairagon+N-SCAN_EST, a gene annotation pipeline that uses only native alignments. For each expressed sequence it chooses the best genomic alignment. Systems like ENSEMBL and ExoGean rely on trans alignments, in which expressed sequences are aligned to the genomic loci of putative homologs. Trans alignments contain a high proportion of mismatches, gaps, and/or apparently unspliceable introns, compared to alignments of cDNA sequences to their native loci. The Pairagon+N-SCAN_EST pipeline's first stage is Pairagon, a cDNA-to-genome alignment program based on a PairHMM probability model. This model relies on prior knowledge, such as the fact that introns must begin with GT, GC, or AT and end with AG or AC. It produces very precise alignments of high quality cDNA sequences. In the genomic regions between Pairagon's cDNA alignments, the pipeline combines EST alignments with de novo gene prediction by using N-SCAN_EST. N-SCAN_EST is based on a generalized HMM probability model augmented with a phylogenetic conservation model and EST alignments. It can predict complete transcripts by extending or merging EST alignments, but it can also predict genes in regions without EST alignments. Because they are based on probability models, both Pairagon and N-SCAN_EST can be trained automatically for new genomes and data sets. Results On the ENCODE regions of the human genome, Pairagon+N-SCAN_EST was as accurate as any other system tested in the EGASP assessment, including ENSEMBL and ExoGean. Conclusion With sufficient mRNA/EST evidence, genome annotation without trans alignments can compete successfully with systems like ENSEMBL and ExoGean, which use trans alignments.

Arumugam, Manimozhiyan; Wei, Chaochun; Brown, Randall H; Brent, Michael R

2006-01-01

215

Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome  

Microsoft Academic Search

Background  It is widely accepted that comparative sequence data can aid the functional annotation of genome sequences; however, the most\\u000a informative species and features of genome evolution for comparison remain to be determined.\\u000a \\u000a \\u000a \\u000a \\u000a Results  We analyzed conservation in eight genomic regions (apterous, even-skipped, fushi tarazu, twist, and Rhodopsins 1, 2, 3 and 4) from four Drosophila species (D. erecta, D. pseudoobscura, D.

Casey M Bergman; Barret D Pfeiffer; Diego E Rincón-Limas; Roger A Hoskins; Andreas Gnirke; Chris J Mungall; Adrienne M Wang; Brent Kronmiller; Joanne Pacleb; Soo Park; Mark Stapleton; Kenneth Wan; Reed A George; Pieter J de Jong; Juan Botas; Gerald M Rubin; Susan E Celniker

2002-01-01

216

Comparative Analyses Between Lolium/Festuca Introgression Lines and Rice Reveal the Major Fraction of Functionally Annotated Gene Models Is Located in Recombination-Poor/Very Recombination-Poor Regions of the Genome  

PubMed Central

Publication of the rice genome sequence has allowed an in-depth analysis of genome organization in a model monocot plant species. This has provided a powerful tool for genome analysis in large-genome unsequenced agriculturally important monocot species such as wheat, barley, rye, Lolium, etc. Previous data have indicated that the majority of genes in large-genome monocots are located toward the ends of chromosomes in gene-rich regions that undergo high frequencies of recombination. Here we demonstrate that a substantial component of the coding sequences in monocots is localized proximally in regions of very low and even negligible recombination frequencies. The implications of our findings are that during domestication of monocot plant species selection has concentrated on genes located in the terminal regions of chromosomes within areas of high recombination frequency. Thus a large proportion of the genetic variation available for selection of superior plant genotypes has not been exploited. In addition our findings raise the possibility of the evolutionary development of large supergene complexes that confer a selective advantage to the individual.

King, Julie; Armstead, Ian P.; Donnison, S. Iain; Roberts, Luned A.; Harper, John A.; Sk?t, Kirsten; Elborough, Kieran; King, Ian P.

2007-01-01

217

Comparative analyses between lolium/festuca introgression lines and rice reveal the major fraction of functionally annotated gene models is located in recombination-poor/very recombination-poor regions of the genome.  

PubMed

Publication of the rice genome sequence has allowed an in-depth analysis of genome organization in a model monocot plant species. This has provided a powerful tool for genome analysis in large-genome unsequenced agriculturally important monocot species such as wheat, barley, rye, Lolium, etc. Previous data have indicated that the majority of genes in large-genome monocots are located toward the ends of chromosomes in gene-rich regions that undergo high frequencies of recombination. Here we demonstrate that a substantial component of the coding sequences in monocots is localized proximally in regions of very low and even negligible recombination frequencies. The implications of our findings are that during domestication of monocot plant species selection has concentrated on genes located in the terminal regions of chromosomes within areas of high recombination frequency. Thus a large proportion of the genetic variation available for selection of superior plant genotypes has not been exploited. In addition our findings raise the possibility of the evolutionary development of large supergene complexes that confer a selective advantage to the individual. PMID:17603095

King, Julie; Armstead, Ian P; Donnison, S Iain; Roberts, Luned A; Harper, John A; Skøt, Kirsten; Elborough, Kieran; King, Ian P

2007-07-01

218

Automated pipeline for atlas-based annotation of gene expression patterns: Application to postnatal day 7 mouse brain  

PubMed Central

Massive amounts of image data have been collected and continue to be generated for representing cellular gene expression throughout the mouse brain. Critical to exploiting this key effort of the post-genomic era is the ability to place these data into a common spatial reference that enables rapid interactive queries, analysis, data sharing, and visualization. In this paper, we present a set of automated protocols for generating and annotating gene expression patterns suitable for the establishment of a database. The steps include imaging tissue slices, detecting cellular gene expression levels, spatial registration with an atlas, and textual annotation. Using high-throughput in situ hybridization to generate serial sets of tissues displaying gene expression, this process was applied towards the establishment of a database representing over 200 genes in the postnatal day 7 mouse brain. These data using this protocol are now well-suited for interactive comparisons, analysis, queries, and visualization.

Carson, James; Ju, Tao; Bello, Musodiq; Thaller, Christina; Warren, Joe; Kakadiaris, Ioannis A.; Chiu, Wah; Eichele, Gregor

2009-01-01

219

Optimization of de novo transcriptome assembly from high-throughput short read sequencing data improves functional annotation for non-model organisms  

PubMed Central

Background The k-mer hash length is a key factor affecting the output of de novo transcriptome assembly packages using de Bruijn graph algorithms. Assemblies constructed with varying single k-mer choices might result in the loss of unique contiguous sequences (contigs) and relevant biological information. A common solution to this problem is the clustering of single k-mer assemblies. Even though annotation is one of the primary goals of a transcriptome assembly, the success of assembly strategies does not consider the impact of k-mer selection on the annotation output. This study provides an in-depth k-mer selection analysis that is focused on the degree of functional annotation achieved for a non-model organism where no reference genome information is available. Individual k-mers and clustered assemblies (CA) were considered using three representative software packages. Pair-wise comparison analyses (between individual k-mers and CAs) were produced to reveal missing Kyoto Encyclopedia of Genes and Genomes (KEGG) ortholog identifiers (KOIs), and to determine a strategy that maximizes the recovery of biological information in a de novo transcriptome assembly. Results Analyses of single k-mer assemblies resulted in the generation of various quantities of contigs and functional annotations within the selection window of k-mers (k-19 to k-63). For each k-mer in this window, generated assemblies contained certain unique contigs and KOIs that were not present in the other k-mer assemblies. Producing a non-redundant CA of k-mers 19 to 63 resulted in a more complete functional annotation than any single k-mer assembly. However, a fraction of unique annotations remained (~0.19 to 0.27% of total KOIs) in the assemblies of individual k-mers (k-19 to k-63) that were not present in the non-redundant CA. A workflow to recover these unique annotations is presented. Conclusions This study demonstrated that different k-mer choices result in various quantities of unique contigs per single k-mer assembly which affects biological information that is retrievable from the transcriptome. This undesirable effect can be minimized, but not eliminated, with clustering of multi-k assemblies with redundancy removal. The complete extraction of biological information in de novo transcriptomics studies requires both the production of a CA and efforts to identify unique contigs that are present in individual k-mer assemblies but not in the CA.

2012-01-01

220

Integrating genome annotation and QTL position to identify candidate genes for productivity, architecture and water-use efficiency in Populus spp  

PubMed Central

Background Hybrid poplars species are candidates for biomass production but breeding efforts are needed to combine productivity and water use efficiency in improved cultivars. The understanding of the genetic architecture of growth in poplar by a Quantitative Trait Loci (QTL) approach can help us to elucidate the molecular basis of such integrative traits but identifying candidate genes underlying these QTLs remains difficult. Nevertheless, the increase of genomic information together with the accessibility to a reference genome sequence (Populus trichocarpa Nisqually-1) allow to bridge QTL information on genetic maps and physical location of candidate genes on the genome. The objective of the study is to identify QTLs controlling productivity, architecture and leaf traits in a P. deltoides x P. trichocarpa F1 progeny and to identify candidate genes underlying QTLs based on the anchoring of genetic maps on the genome and the gene ontology information linked to genome annotation. The strategy to explore genome annotation was to use Gene Ontology enrichment tools to test if some functional categories are statistically over-represented in QTL regions. Results Four leaf traits and 7 growth traits were measured on 330 F1 P. deltoides x P. trichocarpa progeny. A total of 77 QTLs controlling 11 traits were identified explaining from 1.8 to 17.2% of the variation of traits. For 58 QTLs, confidence intervals could be projected on the genome. An extended functional annotation was built based on data retrieved from the plant genome database Phytozome and from an inference of function using homology between Populus and the model plant Arabidopsis. Genes located within QTL confidence intervals were retrieved and enrichments in gene ontology (GO) terms were determined using different methods. Significant enrichments were found for all traits. Particularly relevant biological processes GO terms were identified for QTLs controlling number of sylleptic branches: intervals were enriched in GO terms of biological process like ‘ripening’ and ‘adventitious roots development’. Conclusion Beyond the simple identification of QTLs, this study is the first to use a global approach of GO terms enrichment analysis to fully explore gene function under QTLs confidence intervals in plants. This global approach may lead to identification of new candidate genes for traits of interest.

2012-01-01

221

The Otter Annotation System  

PubMed Central

With the completion of the human genome sequence and genome sequence available for other vertebrate genomes, the task of manual annotation at the large genome scale has become a priority. Possibly even more important, is the requirement to curate and improve this annotation in the light of future data. For this to be possible, there is a need for tools to access and manage the annotation. Ensembl provides an excellent means for storing gene structures, genome features, and sequence, but it does not support the extra textual data necessary for manual annotation. We have extended Ensembl to create the Otter manual annotation system. This comprises a relational database schema for storing the manual annotation data, an application-programming interface (API) to access it, an extensible markup language (XML) format to allow transfer of the data, and a server to allow multiuser/multimachine access to the data. We have also written a data-adaptor plugin for the Apollo Browser/Editor to enable it to utilize an Otter server. The otter database is currently used by the Vertebrate Genome Annotation (VEGA) site (http://vega.sanger.ac.uk), which provides access to manually curated human chromosomes. Support is also being developed for using the AceDB annotation editor, FMap, via a perl wrapper called Lace. The Human and Vertebrate Annotation (HAVANA) group annotators at the Sanger center are using this to annotate human chromosomes 1 and 20.

Searle, Stephen M.J.; Gilbert, James; Iyer, Vivek; Clamp, Michele

2004-01-01

222

Coordinated and sequential transcription of the cyprinid herpesvirus-3 annotated genes.  

PubMed

Cyprinid herpesvirus-3 (CyHV-3) is the cause of a fatal disease in carp and koi fish. The disease is seasonal and appears when water temperatures range from 18 to 28°C. CyHV-3 is a member of the Alloherpesviridae, a family in the Herpesvirales order that encompasses mammalian, avian and reptilian viruses. CyHV-3 is a large double-stranded DNA (dsDNA) herpesvirus with a genome of approximately 295kbp, divergent from other mammalian, avian and reptilian herpesviruses, but bearing several genes similar to cyprinid herpesvirus-1 (CyHV-1), CyHV-2, anguillid herpesvirus-1 (AngHV-1), ictalurid herpesvirus-1 (IcHV-1) and ranid herpes virus-1 (RaHV-1). Here we show that viral DNA synthesis commences 4-8h post-infection (p.i.), and is completely inhibited by pre-treatment with cytosine ?-d-arabinofuranoside (Ara-C). Transcription of CyHV-3 genes initiates after infection as early as 1-2h p.i., and precedes viral DNA synthesis. All 156 annotated open reading frames (ORFs) of the CyHV-3 genome are transcribed into RNAs, most of which can be classified into immediate early (IE or ?), early (E or ?) and late (L or ?) classes, similar to all other herpesviruses. Several ORFs belonging to these groups are clustered along the viral genome. PMID:22841491

Ilouze, Maya; Dishon, Arnon; Kotler, Moshe

2012-07-24

223

Meta4: a web application for sharing and annotating metagenomic gene predictions using web services.  

PubMed

Whole-genome shotgun metagenomics experiments produce DNA sequence data from entire ecosystems, and provide a huge amount of novel information. Gene discovery projects require up-to-date information about sequence homology and domain structure for millions of predicted proteins to be presented in a simple, easy-to-use system. There is a lack of simple, open, flexible tools that allow the rapid sharing of metagenomics datasets with collaborators in a format they can easily interrogate. We present Meta4, a flexible and extensible web application that can be used to share and annotate metagenomic gene predictions. Proteins and predicted domains are stored in a simple relational database, with a dynamic front-end which displays the results in an internet browser. Web services are used to provide up-to-date information about the proteins from homology searches against public databases. Information about Meta4 can be found on the project website, code is available on Github, a cloud image is available, and an example implementation can be seen at. PMID:24046776

Richardson, Emily J; Escalettes, Franck; Fotheringham, Ian; Wallace, Robert J; Watson, Mick

2013-09-05

224

Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3.  

PubMed

Current sequencing methods produce large amounts of data, but genome assemblies constructed from these data are often fragmented and incomplete. Incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. This means that methods attempting to estimate rates of gene duplication and loss often will be misled by such errors and that rates of gene family evolution will be consistently overestimated. Here, we present a method that takes these errors into account, allowing one to accurately infer rates of gene gain and loss among genomes even with low assembly and annotation quality. The method is implemented in the newest version of the software package CAFE, along with several other novel features. We demonstrate the accuracy of the method with extensive simulations and reanalyze several previously published data sets. Our results show that errors in genome annotation do lead to higher inferred rates of gene gain and loss but that CAFE 3 sufficiently accounts for these errors to provide accurate estimates of important evolutionary parameters. PMID:23709260

Han, Mira V; Thomas, Gregg W C; Lugo-Martinez, Jose; Hahn, Matthew W

2013-05-24

225

Avoiding inconsistencies over time and tracking difficulties in Applied Biosystems AB1700TM\\/PantherTM probe-to-gene annotations  

Microsoft Academic Search

BACKGROUND: Significant inconsistencies between probe-to-gene annotations between different releases of probe set identifiers by commercial microarray platform solutions have been reported. Such inconsistencies lead to misleading or ambiguous interpretation of published gene expression results. RESULTS: We report here similar inconsistencies in the probe-to-gene annotation of Applied Biosystems AB1700 data, demonstrating that this is not an isolated concern. Moreover, the online

Sebastian Noth; Arndt Benecke

2005-01-01

226

Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies  

PubMed Central

Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG) for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%–63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with “overprediction” of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation.

Schnoes, Alexandra M.; Brown, Shoshana D.; Dodevski, Igor; Babbitt, Patricia C.

2009-01-01

227

Analysis of the leaf transcriptome of Musa acuminata during interaction with Mycosphaerella musicola: gene assembly, annotation and marker development  

PubMed Central

Background Although banana (Musa sp.) is an important edible crop, contributing towards poverty alleviation and food security, limited transcriptome datasets are available for use in accelerated molecular-based breeding in this genus. 454 GS-FLX Titanium technology was employed to determine the sequence of gene transcripts in genotypes of Musa acuminata ssp. burmannicoides Calcutta 4 and M. acuminata subgroup Cavendish cv. Grande Naine, contrasting in resistance to the fungal pathogen Mycosphaerella musicola, causal organism of Sigatoka leaf spot disease. To enrich for transcripts under biotic stress responses, full length-enriched cDNA libraries were prepared from whole plant leaf materials, both uninfected and artificially challenged with pathogen conidiospores. Results The study generated 846,762 high quality sequence reads, with an average length of 334 bp and totalling 283 Mbp. De novo assembly generated 36,384 and 35,269 unigene sequences for M. acuminata Calcutta 4 and Cavendish Grande Naine, respectively. A total of 64.4% of the unigenes were annotated through Basic Local Alignment Search Tool (BLAST) similarity analyses against public databases. Assembled sequences were functionally mapped to Gene Ontology (GO) terms, with unigene functions covering a diverse range of molecular functions, biological processes and cellular components. Genes from a number of defense-related pathways were observed in transcripts from each cDNA library. Over 99% of contig unigenes mapped to exon regions in the reference M. acuminata DH Pahang whole genome sequence. A total of 4068 genic-SSR loci were identified in Calcutta 4 and 4095 in Cavendish Grande Naine. A subset of 95 potential defense-related gene-derived simple sequence repeat (SSR) loci were validated for specific amplification and polymorphism across M. acuminata accessions. Fourteen loci were polymorphic, with alleles per polymorphic locus ranging from 3 to 8 and polymorphism information content ranging from 0.34 to 0.82. Conclusions A large set of unigenes were characterized in this study for both M. acuminata Calcutta 4 and Cavendish Grande Naine, increasing the number of public domain Musa ESTs. This transcriptome is an invaluable resource for furthering our understanding of biological processes elicited during biotic stresses in Musa. Gene-based markers will facilitate molecular breeding strategies, forming the basis of genetic linkage mapping and analysis of quantitative trait loci.

2013-01-01

228

An innovative plant genomics and gene annotation program for high school, community college, and university faculty.  

PubMed

Today's biology educators face the challenge of training their students in modern molecular biology techniques including genomics and bioinformatics. The Dolan DNA Learning Center (DNALC) of Cold Spring Harbor Laboratory has developed and disseminated a bench- and computer-based plant genomics curriculum for biology faculty. In 2007, a five-day "Plant Genomics and Gene Annotation" workshop was held at Florida A&M University in Tallahassee, FL, to enhance participants' knowledge and understanding of plant molecular genetics and assist them in developing and honing their laboratory and computer skills. Florida A&M University is a historically black university with over 95% African-American student enrollment. Sixteen participants, including high school (56%) and community college faculty (25%), attended the workshop. Participants carried out in vitro and in silico experiments with maize, Arabidopsis, soybean, and food products to determine the genotype of the samples. Benefits of the workshop included increased awareness of plant biology research for high school and college level students. Participants completed pre- and postworkshop evaluations for the measurement of effectiveness. Participants demonstrated an overall improvement in their postworkshop evaluation scores. This article provides a detailed description of workshop activities, as well as assessment and long-term support for broad classroom implementation. PMID:18765753

Hacisalihoglu, Gokhan; Hilgert, Uwe; Nash, E Bruce; Micklos, David A

2008-01-01

229

The NFI-Regulome Database: A tool for annotation and analysis of control regions of genes regulated by Nuclear Factor I transcription factors  

PubMed Central

Background Genome annotation plays an essential role in the interpretation and use of genome sequence information. While great strides have been made in the annotation of coding regions of genes, less success has been achieved in the annotation of the regulatory regions of genes, including promoters, enhancers/silencers, and other regulatory elements. One reason for this disparity in annotated information is that coding regions can be assessed using high-throughput techniques such as EST sequencing, while annotation of regulatory regions often requires a gene-by-gene approach. Results The NFI-Regulome database http://nfiregulome.ccr.buffalo.edu was designed to promote easy annotation of the regulatory regions of genes that contain binding sites for the NFI (Nuclear Factor I) family of transcription factors, using data from the published literature. Binding sites are annotated together with the sequence of the gene, obtained from the UCSC Genome site, and the locations of all binding sites for multiple genes can be displayed in a number of formats designed to facilitate inter-gene comparisons. Classes of genes based on expression pattern, disease involvement, or types of binding sites present can be readily compared in order to assess common "architectural" structures in the regulatory regions. Conclusions The NFI-Regulome database allows rapid display of the relative locations and number of transcription factor binding sites of individual or defined sets of genes that contain binding sites for NFI transcription factors. This database may in the future be expanded into a distributed database structure including other families of transcription factors. Such databases may be useful for identifying common regulatory structures in genes essential for organ development, tissue-specific gene expression or those genes related to specific diseases.

2011-01-01

230

Using Functional Annotation for the Empirical Determination of Bayes Factors for Genome-Wide Association Study Analysis  

Microsoft Academic Search

A genome wide association study (GWAS) typically results in a few highly significant ‘hits’ and a much larger set of suggestive signals (‘near-hits’). The latter group are expected to be a mixture of true and false associations. One promising strategy to help separate these is to use functional annotations for prioritisation of variants for follow-up. A key task is to

Jo Knight; Michael R. Barnes; Gerome Breen; Michael E. Weale; Thomas Mailund

2011-01-01

231

Evaluation of high-throughput functional categorization of human disease genes  

PubMed Central

Background Biological data that are well-organized by an ontology, such as Gene Ontology, enables high-throughput availability of the semantic web. It can also be used to facilitate high throughput classification of biomedical information. However, to our knowledge, no evaluation has been published on automating classifications of human diseases genes using Gene Ontology. In this study, we evaluate automated classifications of well-defined human disease genes using their Gene Ontology annotations and compared them to a gold standard. This gold standard was independently conceived by Valle's research group, and contains 923 human disease genes organized in 14 categories of protein function. Results Two automated methods were applied to investigate the classification of human disease genes into independently pre-defined categories of protein function. One method used the structure of Gene Ontology by pre-selecting 74 Gene Ontology terms assigned to 11 protein function categories. The second method was based on the similarity of human disease genes clustered according to the information-theoretic distance of their Gene Ontology annotations. Compared to the categorization of human disease genes found in the gold standard, our automated methods can achieve an overall 56% and 47% precision with 62% and 71% recall respectively. However, approximately 15% of the studied human disease genes remain without GO annotations. Conclusion Automated methods can recapitulate a significant portion of classification of the human disease genes. The method using information-theoretic distance performs slightly better on the precision with some loss in recall. For some protein function categories, such as 'hormone' and 'transcription factor', the automated methods perform particularly well, achieving precision and recall levels above 75%. In summary, this study demonstrates that for semantic webs, methods to automatically classify or analyze a majority of human disease genes require significant progress in both the Gene Ontology annotations and particularly in the utilization of these annotations.

Chen, James L; Liu, Yang; Sam, Lee T; Li, Jianrong; Lussier, Yves A

2007-01-01

232

Characterization of Liaoning Cashmere Goat Transcriptome: Sequencing, De Novo Assembly, Functional Annotation and Comparative Analysis  

PubMed Central

Background Liaoning cashmere goat is a famous goat breed for cashmere wool. In order to increase the transcriptome data and accelerate genetic improvement for this breed, we performed de novo transcriptome sequencing to generate the first expressed sequence tag dataset for the Liaoning cashmere goat, using next-generation sequencing technology. Results Transcriptome sequencing of Liaoning cashmere goat on a Roche 454 platform yielded 804,601 high-quality reads. Clustering and assembly of these reads produced a non-redundant set of 117,854 unigenes, comprising 13,194 isotigs and 104,660 singletons. Based on similarity searches with known proteins, 17,356 unigenes were assigned to 6,700 GO categories, and the terms were summarized into three main GO categories and 59 sub-categories. 3,548 and 46,778 unigenes had significant similarity to existing sequences in the KEGG and COG databases, respectively. Comparative analysis revealed that 42,254 unigenes were aligned to 17,532 different sequences in NCBI non-redundant nucleotide databases. 97,236 (82.51%) unigenes were mapped to the 30 goat chromosomes. 35,551 (30.17%) unigenes were matched to 11,438 reported goat protein-coding genes. The remaining non-matched unigenes were further compared with cattle and human reference genes, 67 putative new goat genes were discovered. Additionally, 2,781 potential simple sequence repeats were initially identified from all unigenes. Conclusion The transcriptome of Liaoning cashmere goat was deep sequenced, de novo assembled, and annotated, providing abundant data to better understand the Liaoning cashmere goat transcriptome. The potential simple sequence repeats provide a material basis for future genetic linkage and quantitative trait loci analyses.

Liu, Hongliang; Wang, Tingting; Wang, Jinke; Quan, Fusheng; Zhang, Yong

2013-01-01

233

Heterologous expression of plasmodial proteins for structural studies and functional annotation  

PubMed Central

Malaria remains the world's most devastating tropical infectious disease with as many as 40% of the world population living in risk areas. The widespread resistance of Plasmodium parasites to the cost-effective chloroquine and antifolates has forced the introduction of more costly drug combinations, such as Coartem®. In the absence of a vaccine in the foreseeable future, one strategy to address the growing malaria problem is to identify and characterize new and durable antimalarial drug targets, the majority of which are parasite proteins. Biochemical and structure-activity analysis of these proteins is ultimately essential in the characterization of such targets but requires large amounts of functional protein. Even though heterologous protein production has now become a relatively routine endeavour for most proteins of diverse origins, the functional expression of soluble plasmodial proteins is highly problematic and slows the progress of antimalarial drug target discovery. Here the status quo of heterologous production of plasmodial proteins is presented, constraints are highlighted and alternative strategies and hosts for functional expression and annotation of plasmodial proteins are reviewed.

Birkholtz, Lyn-Marie; Blatch, Gregory; Coetzer, Theresa L; Hoppe, Heinrich C; Human, Esmare; Morris, Elizabeth J; Ngcete, Zoleka; Oldfield, Lyndon; Roth, Robyn; Shonhai, Addmore; Stephens, Linda; Louw, Abraham I

2008-01-01

234

Structural proteomics: a tool for genome annotation  

Microsoft Academic Search

In any newly sequenced genome, 30% to 50% of genes encode proteins with unknown molecular or cellular function. Fortunately, structural genomics is emerging as a powerful approach of functional annotation. Because of recent developments in high-throughput technologies, ongoing structural genomics projects are generating new structures at an unprecedented rate. In the past year, structural studies have identified many new structural

Alexander F Yakunin; Adelinda A Yee; Alexei Savchenko; Aled M Edwards; Cheryl H Arrowsmith

2004-01-01

235

PANNOTATOR: an automated tool for annotation of pan-genomes.  

PubMed

Due to next-generation sequence technologies, sequencing of bacterial genomes is no longer one of the main bottlenecks in bacterial research and the number of new genomes deposited in public databases continues to increase at an accelerating rate. Among these new genomes, several belong to the same species and were generated for pan-genomic studies. A pan-genomic study allows investigation of strain phenotypic differences based on genotypic differences. Along with a need for good assembly quality, it is also fundamental to guarantee good functional genome annotation of the different strains. In order to ensure quality and standards for functional genome annotation among different strains, we developed and made available PANNOTATOR (http://bnet.egr.vcu.edu/iioab/agenote.php), a web-based automated pipeline for the annotation of closely related and well-suited genomes for pan-genome studies, aiming at reducing the manual work to generate reports and corrections of various genome strains. PANNOTATOR achieved 98 and 76% of correctness for gene name and function, respectively, as result of an annotation transfer, with a similarity cut-off of 70%, compared with a gold standard annotation for the same species. These results surpassed the RAST and BASys softwares by 41 and 21% and 66 and 17% for gene name and function annotation, respectively, when there were reliable genome annotations of closely related species. PANNOTATOR provides fast and reliable pan-genome annotation; thereby allowing us to maintain the research focus on the main genotype differences between strains. PMID:24065654

Santos, A R; Barbosa, E; Fiaux, K; Zurita-Turk, M; Chaitankar, V; Kamapantula, B; Abdelzaher, A; Ghosh, P; Tiwari, S; Barve, N; Jain, N; Barh, D; Silva, A; Miyoshi, A; Azevedo, V

2013-08-16

236

Large-scale identification and characterization of alternative splicing variants of human gene transcripts using 56,419 completely sequenced and manually annotated full-length cDNAs.  

PubMed

We report the first genome-wide identification and characterization of alternative splicing in human gene transcripts based on analysis of the full-length cDNAs. Applying both manual and computational analyses for 56,419 completely sequenced and precisely annotated full-length cDNAs selected for the H-Invitational human transcriptome annotation meetings, we identified 6877 alternative splicing genes with 18 297 different alternative splicing variants. A total of 37,670 exons were involved in these alternative splicing events. The encoded protein sequences were affected in 6005 of the 6877 genes. Notably, alternative splicing affected protein motifs in 3015 genes, subcellular localizations in 2982 genes and transmembrane domains in 1348 genes. We also identified interesting patterns of alternative splicing, in which two distinct genes seemed to be bridged, nested or having overlapping protein coding sequences (CDSs) of different reading frames (multiple CDS). In these cases, completely unrelated proteins are encoded by a single locus. Genome-wide annotations of alternative splicing, relying on full-length cDNAs, should lay firm groundwork for exploring in detail the diversification of protein function, which is mediated by the fast expanding universe of alternative splicing variants. PMID:16914452

Takeda, Jun-ichi; Suzuki, Yutaka; Nakao, Mitsuteru; Barrero, Roberto A; Koyanagi, Kanako O; Jin, Lihua; Motono, Chie; Hata, Hiroko; Isogai, Takao; Nagai, Keiichi; Otsuki, Tetsuji; Kuryshev, Vladimir; Shionyu, Masafumi; Yura, Kei; Go, Mitiko; Thierry-Mieg, Jean; Thierry-Mieg, Danielle; Wiemann, Stefan; Nomura, Nobuo; Sugano, Sumio; Gojobori, Takashi; Imanishi, Tadashi

2006-08-12

237

CASTp: computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues  

Microsoft Academic Search

Cavities on a proteins surface as well as specific amino acid positioning within it create the physico- chemical properties needed for a protein to perform its function. CASTp (http:\\/\\/cast.engr.uic.edu) is an online tool that locates and measures pockets and voids on 3D protein structures. This new version of CASTp includes annotated functional information of specific residues on the protein structure.

Joe Dundas; Zheng Ouyang; Jeffery Tseng; T. Andrew Binkowski; Yaron Turpaz; Jie Liang

2006-01-01

238

QTL MatchMaker: a multi-species quantitative trait loci (QTL) database and query system for annotation of genes and QTL  

PubMed Central

Identifying genes that underlie quantitative trait loci (QTL) is a challenging task. Here, we present a new QTL software system, named QTL MatchMaker. The system is designed to integrate and mine QTL information across human, mouse and rat genomes and to annotate functional genomic data. It combines and organizes information from relevant public databases and publications and integrates QTL, physical, genetic and cytogenetic maps across human, mouse and rat. To make this application available to the research community we have developed a website for high-throughput mapping of expressed sequences to QTL and for selection of candidate genes in the physiological genomics context of complex traits. QTL MatchMaker is accessible at

Star, Kremena V.; Song, Quingbin; Zhu, Andy; Bottinger, Erwin P.

2006-01-01

239

Genomic Sequence around Butterfly Wing Development Genes: Annotation and Comparative Analysis  

Microsoft Academic Search

BackgroundAnalysis of genomic sequence allows characterization of genome content and organization, and access beyond gene-coding regions for identification of functional elements. BAC libraries, where relatively large genomic regions are made readily available, are especially useful for species without a fully sequenced genome and can increase genomic coverage of phylogenetic and biological diversity. For example, no butterfly genome is yet available

Inês C. Conceição; Anthony D. Long; Jonathan D. Gruber; Patrícia Beldade

2011-01-01

240

Proteogenomic Analysis of Polymorphisms and Gene Annotation Divergences in Prokaryotes using a Clustered Mass Spectrometry-Friendly Database*  

PubMed Central

Precise annotation of genes or open reading frames is still a difficult task that results in divergence even for data generated from the same genomic sequence. This has an impact in further proteomic studies, and also compromises the characterization of clinical isolates with many specific genetic variations that may not be represented in the selected database. We recently developed software called multistrain mass spectrometry prokaryotic database builder (MSMSpdbb) that can merge protein databases from several sources and be applied on any prokaryotic organism, in a proteomic-friendly approach. We generated a database for the Mycobacterium tuberculosis complex (using three strains of Mycobacterium bovis and five of M. tuberculosis), and analyzed data collected from two laboratory strains and two clinical isolates of M. tuberculosis. We identified 2561 proteins, of which 24 were present in M. tuberculosis H37Rv samples, but not annotated in the M. tuberculosis H37Rv genome. We were also able to identify 280 nonsynonymous single amino acid polymorphisms and confirm 367 translational start sites. As a proof of concept we applied the database to whole-genome DNA sequencing data of one of the clinical isolates, which allowed the validation of 116 predicted single amino acid polymorphisms and the annotation of 131 N-terminal start sites. Moreover we identified regions not present in the original M. tuberculosis H37Rv sequence, indicating strain divergence or errors in the reference sequence. In conclusion, we demonstrated the potential of using a merged database to better characterize laboratory or clinical bacterial strains.

de Souza, Gustavo A.; Arntzen, Magnus ?.; Fortuin, Suereta; Schurch, Anita C.; Malen, Hiwa; McEvoy, Christopher R. E.; van Soolingen, Dick; Thiede, Bernd; Warren, Robin M.; Wiker, Harald G.

2011-01-01

241

Cancer markers: Integratively annotated classification.  

PubMed

Translational cancer genomics research aims to ensure that experimental knowledge is subject to computational analysis, and integrated with a variety of records from omics and clinical sources. The data retrieval from such sources is not trivial, due to their redundancy and heterogeneity, and the presence of false evidence. In silico marker identification, therefore, remains a complex task that is mainly motivated by the impact that target identification from the elucidation of gene co-expression dynamics and regulation mechanisms, combined with the discovery of genotype-phenotype associations, may have for clinical validation. Based on the reuse of publicly available gene expression data, our aim is to propose cancer marker classification by integrating the prediction power of multiple annotation sources. In particular, with reference to the functional annotation for colorectal markers, we indicate a classification of markers into diagnostic and prognostic classes combined with susceptibility and risk factors. PMID:23928109

Orsini, M; Travaglione, A; Capobianco, E

2013-08-06

242

The Institute for Genomic Research Osa1 Rice Genome Annotation Database1  

PubMed Central

We have developed a rice (Oryza sativa) genome annotation database (Osa1) that provides structural and functional annotation for this emerging model species. Using the sequence of O. sativa subsp. japonica cv Nipponbare from the International Rice Genome Sequencing Project, pseudomolecules, or virtual contigs, of the 12 rice chromosomes were constructed. Our most recent release, version 3, represents our third build of the pseudomolecules and is composed of 98% finished sequence. Genes were identified using a series of computational methods developed for Arabidopsis (Arabidopsis thaliana) that were modified for use with the rice genome. In release 3 of our annotation, we identified 57,915 genes, of which 14,196 are related to transposable elements. Of these 43,719 nontransposable element-related genes, 18,545 (42.4%) were annotated with a putative function, 5,777 (13.2%) were annotated as encoding an expressed protein with no known function, and the remaining 19,397 (44.4%) were annotated as encoding a hypothetical protein. Multiple splice forms (5,873) were detected for 2,538 genes, resulting in a total of 61,250 gene models in the rice genome. We incorporated experimental evidence into 18,252 gene models to improve the quality of the structural annotation. A series of functional data types has been annotated for the rice genome that includes alignment with genetic markers, assignment of gene ontologies, identification of flanking sequence tags, alignment with homologs from related species, and syntenic mapping with other cereal species. All structural and functional annotation data are available through interactive search and display windows as well as through download of flat files. To integrate the data with other genome projects, the annotation data are available through a Distributed Annotation System and a Genome Browser. All data can be obtained through the project Web pages at http://rice.tigr.org.

Yuan, Qiaoping; Ouyang, Shu; Wang, Aihui; Zhu, Wei; Maiti, Rama; Lin, Haining; Hamilton, John; Haas, Brian; Sultana, Razvan; Cheung, Foo; Wortman, Jennifer; Buell, C. Robin

2005-01-01

243

Bovine Genome Database: integrated tools for genome annotation and discovery  

PubMed Central

The Bovine Genome Database (BGD; http://BovineGenome.org) strives to improve annotation of the bovine genome and to integrate the genome sequence with other genomics data. BGD includes GBrowse genome browsers, the Apollo Annotation Editor, a quantitative trait loci (QTL) viewer, BLAST databases and gene pages. Genome browsers, available for both scaffold and chromosome coordinate systems, display the bovine Official Gene Set (OGS), RefSeq and Ensembl gene models, non-coding RNA, repeats, pseudogenes, single-nucleotide polymorphism, markers, QTL and alignments to complementary DNAs, ESTs and protein homologs. The Bovine QTL viewer is connected to the BGD Chromosome GBrowse, allowing for the identification of candidate genes underlying QTL. The Apollo Annotation Editor connects directly to the BGD Chado database to provide researchers with remote access to gene evidence in a graphical interface that allows editing and creating new gene models. Researchers may upload their annotations to the BGD server for review and integration into the subsequent release of the OGS. Gene pages display information for individual OGS gene models, including gene structure, transcript variants, functional descriptions, gene symbols, Gene Ontology terms, annotator comments and links to National Center for Biotechnology Information and Ensembl. Each gene page is linked to a wiki page to allow input from the research community.

Childers, Christopher P.; Reese, Justin T.; Sundaram, Jaideep P.; Vile, Donald C.; Dickens, C. Michael; Childs, Kevin L.; Salih, Hanni; Bennett, Anna K.; Hagen, Darren E.; Adelson, David L.; Elsik, Christine G.

2011-01-01

244

A semi-supervised learning approach to predict synthetic genetic interactions by combining functional and topological properties of functional gene network  

Microsoft Academic Search

BACKGROUND: Genetic interaction profiles are highly informative and helpful for understanding the functional linkages between genes, and therefore have been extensively exploited for annotating gene functions and dissecting specific pathway structures. However, our understanding is rather limited to the relationship between double concurrent perturbation and various higher level phenotypic changes, e.g. those in cells, tissues or organs. Modifier screens, such

Zhu-Hong You; Zheng Yin; Kyungsook Han; De-Shuang Huang; Xiaobo Zhou

2010-01-01

245

The UniProt-GO Annotation database in 2011  

PubMed Central

The GO annotation dataset provided by the UniProt Consortium (GOA: http://www.ebi.ac.uk/GOA) is a comprehensive set of evidenced-based associations between terms from the Gene Ontology resource and UniProtKB proteins. Currently supplying over 100 million annotations to 11 million proteins in more than 360?000 taxa, this resource has increased 2-fold over the last 2?years and has benefited from a wealth of checks to improve annotation correctness and consistency as well as now supplying a greater information content enabled by GO Consortium annotation format developments. Detailed, manual GO annotations obtained from the curation of peer-reviewed papers are directly contributed by all UniProt curators and supplemented with manual and electronic annotations from 36 model organism and domain-focused scientific resources. The inclusion of high-quality, automatic annotation predictions ensures the UniProt GO annotation dataset supplies functional information to a wide range of proteins, including those from poorly characterized, non-model organism species. UniProt GO annotations are freely available in a range of formats accessible by both file downloads and web-based views. In addition, the introduction of a new, normalized file format in 2010 has made for easier handling of the complete UniProt-GOA data set.

Dimmer, Emily C.; Huntley, Rachael P.; Alam-Faruque, Yasmin; Sawford, Tony; O'Donovan, Claire; Martin, Maria J.; Bely, Benoit; Browne, Paul; Mun Chan, Wei; Eberhardt, Ruth; Gardner, Michael; Laiho, Kati; Legge, Duncan; Magrane, Michele; Pichler, Klemens; Poggioli, Diego; Sehra, Harminder; Auchincloss, Andrea; Axelsen, Kristian; Blatter, Marie-Claude; Boutet, Emmanuel; Braconi-Quintaje, Silvia; Breuza, Lionel; Bridge, Alan; Coudert, Elizabeth; Estreicher, Anne; Famiglietti, Livia; Ferro-Rojas, Serenella; Feuermann, Marc; Gos, Arnaud; Gruaz-Gumowski, Nadine; Hinz, Ursula; Hulo, Chantal; James, Janet; Jimenez, Silvia; Jungo, Florence; Keller, Guillaume; Lemercier, Phillippe; Lieberherr, Damien; Masson, Patrick; Moinat, Madelaine; Pedruzzi, Ivo; Poux, Sylvain; Rivoire, Catherine; Roechert, Bernd; Schneider, Michael; Stutz, Andre; Sundaram, Shyamala; Tognolli, Michael; Bougueleret, Lydie; Argoud-Puy, Ghislaine; Cusin, Isabelle; Duek- Roggli, Paula; Xenarios, Ioannis; Apweiler, Rolf

2012-01-01

246

Inferring gene functions through dissection of relevance networks: interleaving the intra- and inter-species views.  

PubMed

Inference of accurate gene annotations requires integration of existing biological knowledge, structured in a form of ontology, with data from transcriptomics high-throughput technologies. This undertaking requires developing algorithms that integrate genome-scale data, even for model organisms. Gene relevance networks have emerged as a powerful representative of the structure of the data. Such networks can be used for intra-species transfer of gene annotations following the guilt-by-association principle. An analogous principle can serve as a basis for inter-species transfer of gene annotations by comparing well-defined subnetworks. In this review, we compare and contrast the concepts of relevance and proximity networks and briefly review the concept of semantic similarity. We then provide a detailed account of quantitative guilt-by-association inference in the setting of genome-scale relevance networks. Moreover, we systematically survey the existing network-based approaches for automated gene function annotation and categorize them under one umbrella in terms of employed methodology. Furthermore, we discuss suitable data selection strategies required for deriving meaningful and unbiased genome-scale networks from large transcriptomics compendia. Lastly, by simulating gene function prediction with a classical network-based algorithm, we show how the number of genes of unknown function influences prediction within a species and pinpoint the need and the requirements for inter-species knowledge transfer. PMID:22744313

Klie, Sebastian; Mutwil, Marek; Persson, Staffan; Nikoloski, Zoran

2012-06-29

247

Mining functionally relevant gene sets for analyzing physiologically novel clinical expression data.  

PubMed

Gene set analyses have become a standard approach for increasing the sensitivity of transcriptomic studies. However, analytical methods incorporating gene sets require the availability of pre-defined gene sets relevant to the underlying physiology being studied. For novel physiological problems, relevant gene sets may be unavailable or existing gene set databases may bias the results towards only the best-studied of the relevant biological processes. We describe a successful attempt to mine novel functional gene sets for translational projects where the underlying physiology is not necessarily well characterized in existing annotation databases. We choose targeted training data from public expression data repositories and define new criteria for selecting biclusters to serve as candidate gene sets. Many of the discovered gene sets show little or no enrichment for informative Gene Ontology terms or other functional annotation. However, we observe that such gene sets show coherent differential expression in new clinical test data sets, even if derived from different species, tissues, and disease states. We demonstrate the efficacy of this method on a human metabolic data set, where we discover novel, uncharacterized gene sets that are diagnostic of diabetes, and on additional data sets related to neuronal processes and human development. Our results suggest that our approach may be an efficient way to generate a collection of gene sets relevant to the analysis of data for novel clinical applications where existing functional annotation is relatively incomplete. PMID:21121032

Turcan, Sevin; Vetter, Douglas E; Maron, Jill L; Wei, Xintao; Slonim, Donna K

2011-01-01

248

Evolution and Biochemistry of Family 4 Glycosidases: Implications for Assigning Enzyme Function in Sequence Annotations  

PubMed Central

Glycosyl hydrolase Family 4 (GH4) is exceptional among the 114 families in this enzyme superfamily. Members of GH4 exhibit unusual cofactor requirements for activity, and an essential cysteine residue is present at the active site. Of greatest significance is the fact that members of GH4 employ a unique catalytic mechanism for cleavage of the glycosidic bond. By phylogenetic analysis, and from available substrate specificities, we have assigned a majority of the enzymes of GH4 to five subgroups. Our classification revealed an unexpected relationship between substrate specificity and the presence, in each subgroup, of a motif of four amino acids that includes the active-site Cys residue: ?-glucosidase, CHE(I/V); ?-galactosidase, CHSV; ?-glucuronidase, CHGx; 6-phospho-?-glucosidase, CDMP; and 6-phospho-?-glucosidase, CN(V/I)P. The question arises: Does the presence of a particular motif sufficiently predict the catalytic function of an unassigned GH4 protein? To test this hypothesis, we have purified and characterized the ?-glucoside–specific GH4 enzyme (PalH) from the phytopathogen, Erwinia rhapontici. The CHEI motif in this protein has been changed by site-directed mutagenesis, and the effects upon substrate specificity have been determined. The change to CHSV caused the loss of all ?-glucosidase activity, but the mutant protein exhibited none of the anticipated ?-galactosidase activity. The Cys-containing motif may be suggestive of enzyme specificity, but phylogenetic placement is required for confidence in that specificity. The Acholeplasma laidlawii GH4 protein is phylogenetically a phospho-?-glucosidase but has a unique SSSP motif. Lacking the initial Cys in that motif it cannot hydrolyze glycosides by the normal GH4 mechanism because the Cys is required to position the metal ion for hydrolysis, nor can it use the more common single or double-displacement mechanism of Koshland. Several considerations suggest that the protein has acquired a new function as the consequence of positive selection. This study emphasizes the importance of automatic annotation systems that by integrating phylogenetic analysis, functional motifs, and bioinformatics data, may lead to innovative experiments that further our understanding of biological systems.

Pikis, Andreas; Thompson, John

2009-01-01

249

The RAST server : rapid annotations using subsystems technology.  

SciTech Connect

The number of prokaryotic genome sequences becoming available is growing steadily and is growing faster than our ability to accurately annotate them. We describe a fully automated service for annotating bacterial and archaeal genomes. The service identifies protein-encoding, rRNA and tRNA genes, assigns functions to the genes, predicts which subsystems are represented in the genome, uses this information to reconstruct the metabolic network and makes the output easily downloadable for the user. In addition, the annotated genome can be browsed in an environment that supports comparative analysis with the annotated genomes maintained in the SEED environment. The service normally makes the annotated genome available within 12-24 hours of submission, but ultimately the quality of such a service will be judged in terms of accuracy, consistency, and completeness of the produced annotations. We summarize our attempts to address these issues and discuss plans for incrementally enhancing the service. By providing accurate, rapid annotation freely to the community we have created an important community resource. The service has now been utilized by over 120 external users annotating over 350 distinct genomes.

Aziz, R. K.; Bartels, D.; Best, A. A.; DeJongh, M.; Disz, T.; Edwards, R. A.; Formsma, K.; Gerdes, S.; Glass, E. M.; Kubal, M.; Meyer, F.; Olsen, G. J.; Olson, R.; Osterman, A. L.; Overbeek, R. A.; McNeil, L. K.; Paarmann, D.; Paczian, T.; Parrello, B.; Pusch, G. D.; Reich, C.; Stevens, R.; Vassieva, O.; Vonstein, V.; Wilke, A.; Zagnitko, O.; Mathematics and Computer Science; Fellowship for Interpretation of Genomes; Univ. of Chicago; Univ. of Illinois; The Burnham Inst.; Hope Coll.; Univ. of Tenn.; Cairo Univ.

2008-02-08

250

Genomic Sequence around Butterfly Wing Development Genes: Annotation and Comparative Analysis  

PubMed Central

Background Analysis of genomic sequence allows characterization of genome content and organization, and access beyond gene-coding regions for identification of functional elements. BAC libraries, where relatively large genomic regions are made readily available, are especially useful for species without a fully sequenced genome and can increase genomic coverage of phylogenetic and biological diversity. For example, no butterfly genome is yet available despite the unique genetic and biological properties of this group, such as diversified wing color patterns. The evolution and development of these patterns is being studied in a few target species, including Bicyclus anynana, where a whole-genome BAC library allows targeted access to large genomic regions. Methodology/Principal Findings We characterize ?1.3 Mb of genomic sequence around 11 selected genes expressed in B. anynana developing wings. Extensive manual curation of in silico predictions, also making use of a large dataset of expressed genes for this species, identified repetitive elements and protein coding sequence, and highlighted an expansion of Alcohol dehydrogenase genes. Comparative analysis with orthologous regions of the lepidopteran reference genome allowed assessment of conservation of fine-scale synteny (with detection of new inversions and translocations) and of DNA sequence (with detection of high levels of conservation of non-coding regions around some, but not all, developmental genes). Conclusions The general properties and organization of the available B. anynana genomic sequence are similar to the lepidopteran reference, despite the more than 140 MY divergence. Our results lay the groundwork for further studies of new interesting findings in relation to both coding and non-coding sequence: 1) the Alcohol dehydrogenase expansion with higher similarity between the five tandemly-repeated B. anynana paralogs than with the corresponding B. mori orthologs, and 2) the high conservation of non-coding sequence around the genes wingless and Ecdysone receptor, both involved in multiple developmental processes including wing pattern formation.

Conceicao, Ines C.; Long, Anthony D.; Gruber, Jonathan D.; Beldade, Patricia

2011-01-01

251

Work and Family Functioning: An Annotated Bibliography Selected from Family Database.  

ERIC Educational Resources Information Center

|This annotated bibliography lists works published in Australia on issues regarding work obligations and family responsibilities. All works cited are included in Australia's FAMILY database. The following topics are covered: (1) adolescents and attitudes to employment (14 citations); (2) the aged and employment (20 citations); (3) career…

Davis, Mari, Comp.

252

Re-annotating the Mycoplasma pneumoniae genome sequence: adding value, function and reading frames  

Microsoft Academic Search

Four years after the original sequence submission, we have re-annotated the genome of Mycoplasma pneumoniae to incorporate novel data. The total number of ORFss has been increased from 677 to 688 (10 new proteins were predicted in intergenic regions, two further were newly identified by mass spectrometry and one protein ORF was dismissed) and the number of RNAs from 39

Thomas Dandekar; Martijn Huynen; Jörg Thomas Regula; Barbara Ueberle; Carl Ulrich Zimmermann; Miguel A. Andrade; Tobias Doerks; Luis Sánchez-Pulido; Berend Snel; Mikita Suyama; P. Yuan; Richard Herrmann; Peer Bork

2000-01-01

253

Linking enzyme sequence to function using conserved property difference locator to identify and annotate positions likely to control specific functionality  

PubMed Central

Background Families of homologous enzymes evolved from common progenitors. The availability of multiple sequences representing each activity presents an opportunity for extracting information specifying the functionality of individual homologs. We present a straightforward method for the identification of residues likely to determine class specific functionality in which multiple sequence alignments are converted to an annotated graphical form by the Conserved Property Difference Locator (CPDL) program. Results Three test cases, each comprised of two groups of funtionally-distinct homologs, are presented. Of the test cases, one is a membrane and two are soluble enzyme families. The desaturase/hydroxylase data was used to design and test the CPDL algorithm because a comparative sequence approach had been successfully applied to manipulate the specificity of these enzymes. The other two cases, ATP/GTP cyclases, and MurD/MurE synthases were chosen because they are well characterized structurally and biochemically. For the desaturase/hydroxylase enzymes, the ATP/GTP cyclases and the MurD/MurE synthases, groups of 8 (of ~400), 4 (of ~150) and 10 (of >400) residues, respectively, of interest were identified that contain empirically defined specificity determining positions. Conclusion CPDL consistently identifies positions near enzyme active sites that include those predicted from structural and/or biochemical studies to be important for specificity and/or function. This suggests that CPDL will have broad utility for the identification of potential class determining residues based on multiple sequence analysis of groups of homologous proteins. Because the method is sequence, rather than structure, based it is equally well suited for designing structure-function experiments to investigate membrane and soluble proteins.

Mayer, Kimberly M; McCorkle, Sean R; Shanklin, John

2005-01-01

254

Comprehensive functional annotation of seventy-one breast cancer risk Loci.  

PubMed

Breast Cancer (BCa) genome-wide association studies revealed allelic frequency differences between cases and controls at index single nucleotide polymorphisms (SNPs). To date, 71 loci have thus been identified and replicated. More than 320,000 SNPs at these loci define BCa risk due to linkage disequilibrium (LD). We propose that BCa risk resides in a subgroup of SNPs that functionally affects breast biology. Such a shortlist will aid in framing hypotheses to prioritize a manageable number of likely disease-causing SNPs. We extracted all the SNPs, residing in 1 Mb windows around breast cancer risk index SNP from the 1000 genomes project to find correlated SNPs. We used FunciSNP, an R/Bioconductor package developed in-house, to identify potentially functional SNPs at 71 risk loci by coinciding them with chromatin biofeatures. We identified 1,005 SNPs in LD with the index SNPs (r(2)?0.5) in three categories; 21 in exons of 18 genes, 76 in transcription start site (TSS) regions of 25 genes, and 921 in enhancers. Thirteen SNPs were found in more than one category. We found two correlated and predicted non-benign coding variants (rs8100241 in exon 2 and rs8108174 in exon 3) of the gene, ANKLE1. Most putative functional LD SNPs, however, were found in either epigenetically defined enhancers or in gene TSS regions. Fifty-five percent of these non-coding SNPs are likely functional, since they affect response element (RE) sequences of transcription factors. Functionality of these SNPs was assessed by expression quantitative trait loci (eQTL) analysis and allele-specific enhancer assays. Unbiased analyses of SNPs at BCa risk loci revealed new and overlooked mechanisms that may affect risk of the disease, thereby providing a valuable resource for follow-up studies. PMID:23717510

Rhie, Suhn Kyong; Coetzee, Simon G; Noushmehr, Houtan; Yan, Chunli; Kim, Jae Mun; Haiman, Christopher A; Coetzee, Gerhard A

2013-05-22

255

Comprehensive Functional Annotation of Seventy-One Breast Cancer Risk Loci  

PubMed Central

Breast Cancer (BCa) genome-wide association studies revealed allelic frequency differences between cases and controls at index single nucleotide polymorphisms (SNPs). To date, 71 loci have thus been identified and replicated. More than 320,000 SNPs at these loci define BCa risk due to linkage disequilibrium (LD). We propose that BCa risk resides in a subgroup of SNPs that functionally affects breast biology. Such a shortlist will aid in framing hypotheses to prioritize a manageable number of likely disease-causing SNPs. We extracted all the SNPs, residing in 1 Mb windows around breast cancer risk index SNP from the 1000 genomes project to find correlated SNPs. We used FunciSNP, an R/Bioconductor package developed in-house, to identify potentially functional SNPs at 71 risk loci by coinciding them with chromatin biofeatures. We identified 1,005 SNPs in LD with the index SNPs (r2?0.5) in three categories; 21 in exons of 18 genes, 76 in transcription start site (TSS) regions of 25 genes, and 921 in enhancers. Thirteen SNPs were found in more than one category. We found two correlated and predicted non-benign coding variants (rs8100241 in exon 2 and rs8108174 in exon 3) of the gene, ANKLE1. Most putative functional LD SNPs, however, were found in either epigenetically defined enhancers or in gene TSS regions. Fifty-five percent of these non-coding SNPs are likely functional, since they affect response element (RE) sequences of transcription factors. Functionality of these SNPs was assessed by expression quantitative trait loci (eQTL) analysis and allele-specific enhancer assays. Unbiased analyses of SNPs at BCa risk loci revealed new and overlooked mechanisms that may affect risk of the disease, thereby providing a valuable resource for follow-up studies.

Rhie, Suhn Kyong; Coetzee, Simon G.; Noushmehr, Houtan; Yan, Chunli; Kim, Jae Mun; Haiman, Christopher A.; Coetzee, Gerhard A.

2013-01-01

256

PHYLOGENOMICS - GUIDED VALIDATION OF FUNCTION FOR CONSERVED UNKNOWN GENES  

SciTech Connect

Identifying functions for all gene products in all sequenced organisms is a central challenge of the post-genomic era. However, at least 30-50% of the proteins encoded by any given genome are of unknown function, or wrongly or vaguely annotated. Many of these 'unknown' proteins are common to prokaryotes and plants. We accordingly set out to predict and experimentally test the functions of such proteins. Our approach to functional prediction is integrative, coupling the extensive post-genomic resources available for plants with comparative genomics based on hundreds of microbial genomes, and functional genomic datasets from model microorganisms. The early phase is computer-assisted; later phases incorporate intellectual input from expert plant and microbial biochemists. The approach thus bridges the gap between automated homology-based annotations and the classical gene discovery efforts of experimentalists, and is much more powerful than purely computational approaches to identifying gene-function associations. Among Arabidopsis genes, we focused on those (2,325 in total) that (i) are unique or belong to families with no more than three members, (ii) are conserved between plants and prokaryotes, and (iii) have unknown or poorly known functions. Computer-assisted selection of promising targets for deeper analysis was based on homology .. independent characteristics associated in the SEED database with the prokaryotic members of each family, specifically gene clustering and phyletic spread, as well as availability of functional genomics data, and publications that could link candidate families to general metabolic areas, or to specific functions. In-depth comparative genomic analysis was then performed for about 500 top candidate families, which connected ~55 of them to general areas of metabolism and led to specific functional predictions for a subset of ~25 more. Twenty predicted functions were experimentally tested in at least one prokaryotic organism via reverse genetics, metabolic profiling, functional complementation, and recombinant protein biochemistry. Our approach predicted and validated functions for 10 formerly uncharacterized protein families common to plants and prokaryotes; none of these functions had previously been correctly predicted by computational methods. The functions of five more are currently being validated. Experimental testing of diverse representatives of these families combined with in silica analysis allowed accurate projection of the annotations to hundreds more sequenced genomes.

V, DE CRECY-LAGARD; D, HANSON A

2012-01-03

257

Characterization of transcriptome dynamics during watermelon fruit development: sequencing, assembly, annotation and gene expression profiles  

PubMed Central

Background Cultivated watermelon [Citrullus lanatus (Thunb.) Matsum. & Nakai var. lanatus] is an important agriculture crop world-wide. The fruit of watermelon undergoes distinct stages of development with dramatic changes in its size, color, sweetness, texture and aroma. In order to better understand the genetic and molecular basis of these changes and significantly expand the watermelon transcript catalog, we have selected four critical stages of watermelon fruit development and used Roche/454 next-generation sequencing technology to generate a large expressed sequence tag (EST) dataset and a comprehensive transcriptome profile for watermelon fruit flesh tissues. Results We performed half Roche/454 GS-FLX run for each of the four watermelon fruit developmental stages (immature white, white-pink flesh, red flesh and over-ripe) and obtained 577,023 high quality ESTs with an average length of 302.8 bp. De novo assembly of these ESTs together with 11,786 watermelon ESTs collected from GenBank produced 75,068 unigenes with a total length of approximately 31.8 Mb. Overall 54.9% of the unigenes showed significant similarities to known sequences in GenBank non-redundant (nr) protein database and around two-thirds of them matched proteins of cucumber, the most closely-related species with a sequenced genome. The unigenes were further assigned with gene ontology (GO) terms and mapped to biochemical pathways. More than 5,000 SSRs were identified from the EST collection. Furthermore we carried out digital gene expression analysis of these ESTs and identified 3,023 genes that were differentially expressed during watermelon fruit development and ripening, which provided novel insights into watermelon fruit biology and a comprehensive resource of candidate genes for future functional analysis. We then generated profiles of several interesting metabolites that are important to fruit quality including pigmentation and sweetness. Integrative analysis of metabolite and digital gene expression profiles helped elucidating molecular mechanisms governing these important quality-related traits during watermelon fruit development. Conclusion We have generated a large collection of watermelon ESTs, which represents a significant expansion of the current transcript catalog of watermelon and a valuable resource for future studies on the genomics of watermelon and other closely-related species. Digital expression analysis of this EST collection allowed us to identify a large set of genes that were differentially expressed during watermelon fruit development and ripening, which provide a rich source of candidates for future functional analysis and represent a valuable increase in our knowledge base of watermelon fruit biology.

2011-01-01

258

Information-based methods for predicting gene function from systematic gene knock-downs  

PubMed Central

Background The rapid annotation of genes on a genome-wide scale is now possible for several organisms using high-throughput RNA interference assays to knock down the expression of a specific gene. To date, dozens of RNA interference phenotypes have been recorded for the nematode Caenorhabditis elegans. Although previous studies have demonstrated the merit of using knock-down phenotypes to predict gene function, it is unclear how the data can be used most effectively. An open question is how to optimally make use of phenotypic observations, possibly in combination with other functional genomics datasets, to identify genes that share a common role. Results We compared several methods for detecting gene-gene functional similarity from phenotypic knock-down profiles. We found that information-based measures, which explicitly incorporate a phenotype's genomic frequency when calculating gene-gene similarity, outperform non-information-based methods. We report the presence of newly predicted modules identified from an integrated functional network containing phenotypic congruency links derived from an information-based measure. One such module is a set of genes predicted to play a role in regulating body morphology based on their multiply-supported interactions with members of the TGF-? signaling pathway. Conclusion Information-based metrics significantly improve the comparison of phenotypic knock-down profiles, based upon their ability to enhance gene function prediction and identify novel functional modules.

Weirauch, Matthew T; Wong, Christopher K; Byrne, Alexandra B; Stuart, Joshua M

2008-01-01

259

Elucidating metabolic pathways and digging for genes of unknown function in microbial communities: the riboswitch approach.  

PubMed

In the current post-genomic era, only 3% of all genes have been annotated based on experimental evidence. Even though functions can readily be predicted for many genes, 25% of these are likely to be wrong. The most widely used methods for function prediction rely on sequence similarity, which might be misleading in many cases. Other methods such as genomic context or phylogenetic profiles have been developed to increase gene annotation accuracy; nevertheless these are only efficient when complete genome sequences are available. Here we propose a new approach based on riboswitch identification. Riboswitches are highly conserved regulators of gene expression located in the 5' untranslated region of certain genes. When transcribed they adopt three-dimensional structures that recognize their ligands with great affinity and specificity. This specificity is a key issue for our method, allowing functional assignment with great accuracy. PMID:22647046

Gutiérrez-Preciado, A; Merino, E

2012-07-01

260

GeneViTo: Visualizing gene-product functional and structural features in genomic datasets  

PubMed Central

Background The availability of increasing amounts of sequence data from completely sequenced genomes boosts the development of new computational methods for automated genome annotation and comparative genomics. Therefore, there is a need for tools that facilitate the visualization of raw data and results produced by bioinformatics analysis, providing new means for interactive genome exploration. Visual inspection can be used as a basis to assess the quality of various analysis algorithms and to aid in-depth genomic studies. Results GeneViTo is a JAVA-based computer application that serves as a workbench for genome-wide analysis through visual interaction. The application deals with various experimental information concerning both DNA and protein sequences (derived from public sequence databases or proprietary data sources) and meta-data obtained by various prediction algorithms, classification schemes or user-defined features. Interaction with a Graphical User Interface (GUI) allows easy extraction of genomic and proteomic data referring to the sequence itself, sequence features, or general structural and functional features. Emphasis is laid on the potential comparison between annotation and prediction data in order to offer a supplement to the provided information, especially in cases of "poor" annotation, or an evaluation of available predictions. Moreover, desired information can be output in high quality JPEG image files for further elaboration and scientific use. A compilation of properly formatted GeneViTo input data for demonstration is available to interested readers for two completely sequenced prokaryotes, Chlamydia trachomatis and Methanococcus jannaschii. Conclusions GeneViTo offers an inspectional view of genomic functional elements, concerning data stemming both from database annotation and analysis tools for an overall analysis of existing genomes. The application is compatible with Linux or Windows ME-2000-XP operating systems, provided that the appropriate Java Runtime Environment is already installed in the system.

Vernikos, Georgios S; Gkogkas, Christos G; Promponas, Vasilis J; Hamodrakas, Stavros J

2003-01-01

261

Gene Ontology annotation highlights shared and divergent pathogenic strategies of type III effector proteins deployed by the plant pathogen Pseudomonas syringae pv tomato DC3000 and animal pathogenic Escherichia coli strains.  

PubMed

Genome-informed identification and characterization of Type III effector repertoires in various bacterial strains and species is revealing important insights into the critical roles that these proteins play in the pathogenic strategies of diverse bacteria. However, non-systematic discipline-specific approaches to their annotation impede analysis of the accumulating wealth of data and inhibit easy communication of findings among researchers working on different experimental systems. The development of Gene Ontology (GO) terms to capture biological processes occurring during the interaction between organisms creates a common language that facilitates cross-genome analyses. The application of these terms to annotate type III effector genes in different bacterial species - the plant pathogen Pseudomonas syringae pv tomato DC3000 and animal pathogenic strains of Escherichia coli - illustrates how GO can effectively describe fundamental similarities and differences among different gene products deployed as part of diverse pathogenic strategies. In depth descriptions of the GO annotations for P. syringae pv tomato DC3000 effector AvrPtoB and the E. coli effector Tir are described, with special emphasis given to GO capability for capturing information about interacting proteins and taxa. GO-highlighted similarities in biological process and molecular function for effectors from additional pathosystems are also discussed. PMID:19278552

Lindeberg, Magdalen; Biehl, Bryan S; Glasner, Jeremy D; Perna, Nicole T; Collmer, Alan; Collmer, Candace W

2009-02-19

262

Gene Ontology annotation highlights shared and divergent pathogenic strategies of type III effector proteins deployed by the plant pathogen Pseudomonas syringae pv tomato DC3000 and animal pathogenic Escherichia coli strains  

PubMed Central

Genome-informed identification and characterization of Type III effector repertoires in various bacterial strains and species is revealing important insights into the critical roles that these proteins play in the pathogenic strategies of diverse bacteria. However, non-systematic discipline-specific approaches to their annotation impede analysis of the accumulating wealth of data and inhibit easy communication of findings among researchers working on different experimental systems. The development of Gene Ontology (GO) terms to capture biological processes occurring during the interaction between organisms creates a common language that facilitates cross-genome analyses. The application of these terms to annotate type III effector genes in different bacterial species – the plant pathogen Pseudomonas syringae pv tomato DC3000 and animal pathogenic strains of Escherichia coli – illustrates how GO can effectively describe fundamental similarities and differences among different gene products deployed as part of diverse pathogenic strategies. In depth descriptions of the GO annotations for P. syringae pv tomato DC3000 effector AvrPtoB and the E. coli effector Tir are described, with special emphasis given to GO capability for capturing information about interacting proteins and taxa. GO-highlighted similarities in biological process and molecular function for effectors from additional pathosystems are also discussed.

Lindeberg, Magdalen; Biehl, Bryan S; Glasner, Jeremy D; Perna, Nicole T; Collmer, Alan; Collmer, Candace W

2009-01-01

263

A Framework for Annotating Human Genome in Disease Context  

PubMed Central

Identification of gene-disease association is crucial to understanding disease mechanism. A rapid increase in biomedical literatures, led by advances of genome-scale technologies, poses challenge for manually-curated-based annotation databases to characterize gene-disease associations effectively and timely. We propose an automatic method-The Disease Ontology Annotation Framework (DOAF) to provide a comprehensive annotation of the human genome using the computable Disease Ontology (DO), the NCBO Annotator service and NCBI Gene Reference Into Function (GeneRIF). DOAF can keep the resulting knowledgebase current by periodically executing automatic pipeline to re-annotate the human genome using the latest DO and GeneRIF releases at any frequency such as daily or monthly. Further, DOAF provides a computable and programmable environment which enables large-scale and integrative analysis by working with external analytic software or online service platforms. A user-friendly web interface (doa.nubic.northwestern.edu) is implemented to allow users to efficiently query, download, and view disease annotations and the underlying evidences.

Cheng, Wenqing; Fu, Dong; Xia, Tian; Kibbe, Warren A.; Lin, Simon M.

2012-01-01

264

Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones  

Microsoft Academic Search

The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult

Tadashi Imanishi; Takeshi Itoh; Yutaka Suzuki; Claire ODonovan; Satoshi Fukuchi; Kanako O. Koyanagi; Roberto A. Barrero; Takuro Tamura; Yumi Yamaguchi-Kabata; Motohiko Tanino; Kei Yura; Satoru Miyazaki; Kazuho Ikeo; Keiichi Homma; Arek Kasprzyk; Tetsuo Nishikawa; Mika Hirakawa; Jean Thierry-Mieg; Danielle Thierry-Mieg; Jennifer Ashurst; Libin Jia; Mitsuteru Nakao; Michael A. Thomas; Nicola Mulder; Youla Karavidopoulou; Lihua Jin; Sangsoo Kim; Tomohiro Yasuda; Boris Lenhard; Eric Eveno; Yoshiyuki Suzuki; Chisato Yamasaki; Jun-ichi Takeda; Craig Gough; Phillip Hilton; Yasuyuki Fujii; Hiroaki Sakai; Susumu Tanaka; Clara Amid; Matthew Bellgard; Maria de Fatima Bonaldo; Hidemasa Bono; Susan K. Bromberg; Anthony J. Brookes; Elspeth Bruford; Piero Carninci; Claude Chelala; Christine Couillault; Sandro J. de Souza; Marie-Anne Debily; Marie-Dominique Devignes; Inna Dubchak; Toshinori Endo; Anne Estreicher; Eduardo Eyras; Kaoru Fukami-Kobayashi; Gopal R. Gopinath; Esther Graudens; Yoonsoo Hahn; Michael Han; Ze-Guang Han; Kousuke Hanada; Hideki Hanaoka; Erimi Harada; Katsuyuki Hashimoto; Ursula Hinz; Momoki Hirai; Teruyoshi Hishiki; Ian Hopkinson; Sandrine Imbeaud; Hidetoshi Inoko; Alexander Kanapin; Yayoi Kaneko; Takeya Kasukawa; Janet Kelso; Paul Kersey; Reiko Kikuno; Kouichi Kimura; Bernhard Korn; Vladimir Kuryshev; Izabela Makalowska; Takashi Makino; Shuhei Mano; Regine Mariage-Samson; Jun Mashima; Hideo Matsuda; Hans-Werner Mewes; Shinsei Minoshima; Keiichi Nagai; Hideki Nagasaki; Naoki Nagata; Rajni Nigam; Osamu Ogasawara; Osamu Ohara; Masafumi Ohtsubo; Norihiro Okada; Toshihisa Okido; Satoshi Oota; Motonori Ota; Toshio Ota; Tetsuji Otsuki; Dominique Piatier-Tonneau; Annemarie Poustka; Shuang-Xi Ren; Naruya Saitou; Katsunaga Sakai; Shigetaka Sakamoto; Ryuichi Sakate; Ingo Schupp; Florence Servant; Stephen Sherry; Rie Shiba; Nobuyoshi Shimizu; Mary Shimoyama; Andrew J Simpson; Bento Soares; Charles Steward; Makiko Suwa; Mami Suzuki; Aiko Takahashi; Gen Tamiya; Hiroshi Tanaka; Todd Taylor; Joseph D Terwilliger; Per Unneberg; Vamsi Veeramachaneni; Shinya Watanabe; Laurens Wilming; Norikazu Yasuda; Hyang-Sook Yoo; Marvin Stodolsky; Wojciech Makalowski; Mitiko Go; Kenta Nakai; Toshihisa Takagi; Minoru Kanehisa; Yoshiyuki Sakaki; John Quackenbush; Yasushi Okazaki; Yoshihide Hayashizaki; Winston Hide; Ranajit Chakraborty; Ken Nishikawa; Hideaki Sugawara; Yoshio Tateno; Zhu Chen; Michio Oishi; Peter Tonellato; Rolf Apweiler; Kousaku Okubo; Lukas Wagner; Stefan Wiemann; Robert L Strausberg; Takao Isogai; Charles Auffray; Nobuo Nomura; Takashi Gojobori; Sumio Sugano

2004-01-01

265

Accumulation, functional annotation, and comparative analysis of expressed sequence tags in eggplant (Solanum melongena L.), the third pole of the genus Solanum species after tomato and potato.  

PubMed

Eggplant (Solanum melongena L.) is a widely grown vegetable crop that belongs to the genus Solanum, which is comprised of more than 1000 species of wide genetic and phenotypic variation. Unlike tomato and potato, Solanum crops that belong to subgenus Potatoe and have been targets for comprehensive genomic studies, eggplant is endemic to the Old World and belongs to a different subgenus, Leptostemonum, and therefore, would be a unique member for comparative molecular biology in Solanum. In this study, more than 60,000 eggplant cDNA clones from various tissues and treatments were sequenced from both the 5'- and 3'-ends, and a unigene set consisting of 16,245 unique sequences was constructed. Functional annotations based on sequence similarity to known plant reference datasets revealed a distribution of functional categories almost similar to that of tomato, while 1316 unigenes were suggested to be eggplant-specific. Sequence-based comparative analysis using putative orthologous gene groups setup by reciprocal sequence comparison among six solanaceous species suggested that eggplant and its wild ally Solanum torvum were clustered separately from subgenus Potatoe species, and then, all Solanum species were clustered separately from the genus Capsicum. Microsatellite motif distribution was different among species and likely to be coincident with the phylogenetic relationships. Furthermore, the eggplant unigene dataset exhibited its utility in transcriptome analysis by the SAGE strategy where a considerable number of short tag sequences of interest were successfully assigned to unigenes and their functional annotations. The eggplant ESTs and 16k unigene set developed in this study would be a useful resource not only for molecular genetics and breeding in eggplant itself, but for expanding the scope of comparative biology in Solanum species. PMID:19857557

Fukuoka, Hiroyuki; Yamaguchi, Hirotaka; Nunome, Tsukasa; Negoro, Satomi; Miyatake, Koji; Ohyama, Akio

2010-01-15

266

Analysis and functional annotation of expressed sequence tags from the fall armyworm Spodoptera frugiperda  

PubMed Central

Background Little is known about the genome sequences of lepidopteran insects, although this group of insects has been studied extensively in the fields of endocrinology, development, immunity, and pathogen-host interactions. In addition, cell lines derived from Spodoptera frugiperda and other lepidopteran insects are routinely used for baculovirus foreign gene expression. This study reports the results of an expressed sequence tag (EST) sequencing project in cells from the lepidopteran insect S. frugiperda, the fall armyworm. Results We have constructed an EST database using two cDNA libraries from the S. frugiperda-derived cell line, SF-21. The database consists of 2,367 ESTs which were assembled into 244 contigs and 951 singlets for a total of 1,195 unique sequences. Conclusion S. frugiperda is an agriculturally important pest insect and genomic information will be instrumental for establishing initial transcriptional profiling and gene function studies, and for obtaining information about genes manipulated during infections by insect pathogens such as baculoviruses.

Deng, Youping; Dong, Yinghua; Thodima, Venkata; Clem, Rollie J; Passarelli, A Lorena

2006-01-01

267

DAVID: Database for Annotation, Visualization, and Integrated Discovery  

Microsoft Academic Search

BACKGROUND: Functional annotation of differentially expressed genes is a necessary and critical step in the analysis of microarray data. The distributed nature of biological knowledge frequently requires researchers to navigate through numerous web-accessible databases gathering information one gene at a time. A more judicious approach is to provide query-based access to an integrated database that disseminates biologically rich information across

Glynn Dennis Jr; Brad T Sherman; Douglas A Hosack; Jun Yang; Wei Gao; Richard A Lempicki

2003-01-01

268

Mining and gene ontology based annotation of SSR markers from expressed sequence tags of Humulus lupulus.  

PubMed

Humulus lupulus is commonly known as hops, a member of the family moraceae. Currently many projects are underway leading to the accumulation of voluminous genomic and expressed sequence tag sequences in public databases. The genetically characterized domains in these databases are limited due to non-availability of reliable molecular markers. The large data of EST sequences are available in hops. The simple sequence repeat markers extracted from EST data are used as molecular markers for genetic characterization, in the present study. 25,495 EST sequences were examined and assembled to get full-length sequences. Maximum frequency distribution was shown by mononucleotide SSR motifs i.e. 60.44% in contig and 62.16% in singleton where as minimum frequency are observed for hexanucleotide SSR in contig (0.09%) and pentanucleotide SSR in singletons (0.12%). Maximum trinucleotide motifs code for Glutamic acid (GAA) while AT/TA were the most frequent repeat of dinucleotide SSRs. Flanking primer pairs were designed in-silico for the SSR containing sequences. Functional categorization of SSRs containing sequences was done through gene ontology terms like biological process, cellular component and molecular function. PMID:22368382

Singh, Swati; Gupta, Sanchita; Mani, Ashutosh; Chaturvedi, Anoop

2012-02-03

269

Mining and gene ontology based annotation of SSR markers from expressed sequence tags of Humulus lupulus  

PubMed Central

Humulus lupulus is commonly known as hops, a member of the family moraceae. Currently many projects are underway leading to the accumulation of voluminous genomic and expressed sequence tag sequences in public databases. The genetically characterized domains in these databases are limited due to non-availability of reliable molecular markers. The large data of EST sequences are available in hops. The simple sequence repeat markers extracted from EST data are used as molecular markers for genetic characterization, in the present study. 25,495 EST sequences were examined and assembled to get full-length sequences. Maximum frequency distribution was shown by mononucleotide SSR motifs i.e. 60.44% in contig and 62.16% in singleton where as minimum frequency are observed for hexanucleotide SSR in contig (0.09%) and pentanucleotide SSR in singletons (0.12%). Maximum trinucleotide motifs code for Glutamic acid (GAA) while AT/TA were the most frequent repeat of dinucleotide SSRs. Flanking primer pairs were designed in-silico for the SSR containing sequences. Functional categorization of SSRs containing sequences was done through gene ontology terms like biological process, cellular component and molecular function.

Singh, Swati; Gupta, Sanchita; Mani, Ashutosh; Chaturvedi, Anoop

2012-01-01

270

Multiconstrained gene clustering based on generalized projections  

Microsoft Academic Search

Background: Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem. Results: We propose a novel multiconstrained

Jia Zeng; Shanfeng Zhu; Alan Wee-Chung Liew; Hong Yan

2010-01-01

271

Correlation between Gene Expression and GO Semantic Similarity  

Microsoft Academic Search

This research analyzes some aspects of the relationship between gene expression, gene function, and gene annotation. Many recent studies are implicitly based on the assumption that gene products that are biologically and functionally related would maintain this similarity both in their expression profiles as well as in their Gene Ontology (GO) annotation. We analyze how accurate this assumption proves to

Jose L. Sevilla; Victor Segura; Adam Podhorski; Elizabeth Guruceaga; Jose M. Mato; Luis A. Martinez-Cruz; Fernando J. Corrales; Angel Rubio

2005-01-01

272

Representative Proteomes: A Stable, Scalable and Unbiased Proteome Set for Sequence Analysis and Functional Annotation  

PubMed Central

The accelerating growth in the number of protein sequences taxes both the computational and manual resources needed to analyze them. One approach to dealing with this problem is to minimize the number of proteins subjected to such analysis in a way that minimizes loss of information. To this end we have developed a set of Representative Proteomes (RPs), each selected from a Representative Proteome Group (RPG) containing similar proteomes calculated based on co-membership in UniRef50 clusters. A Representative Proteome is the proteome that can best represent all the proteomes in its group in terms of the majority of the sequence space and information. RPs at 75%, 55%, 35% and 15% co-membership threshold (CMT) are provided to allow users to decrease or increase the granularity of the sequence space based on their requirements. We find that a CMT of 55% (RP55) most closely follows standard taxonomic classifications. Further analysis of this set reveals that sequence space is reduced by more than 80% relative to UniProtKB, while retaining both sequence diversity (over 95% of InterPro domains) and annotation information (93% of experimentally characterized proteins). All sets can be browsed and are available for sequence similarity searches and download at http://www.proteininformationresource.org/rps, while the set of 637 RPs determined using a 55% CMT are also available for text searches. Potential applications include sequence similarity searches, protein classification and targeted protein annotation and characterization.

Finn, Robert D.; Huang, Hongzhan; Zhang, Jian; Wu, Cathy H.; Mazumder, Raja

2011-01-01

273

Automatic annotation of organellar genomes with DOGMA  

SciTech Connect

Dual Organellar GenoMe Annotator (DOGMA) automates the annotation of extra-nuclear organellar (chloroplast and animal mitochondrial) genomes. It is a web-based package that allows the use of comparative BLAST searches to identify and annotate genes in a genome. DOGMA presents a list of putative genes to the user in a graphical format for viewing and editing. Annotations are stored on our password-protected server. Complete annotations can be extracted for direct submission to GenBank. Furthermore, intergenic regions of specified length can be extracted, as well the nucleotide sequences and amino acid sequences of the genes.

Wyman, Stacia; Jansen, Robert K.; Boore, Jeffrey L.

2004-06-01

274

IMG ER: A System for Microbial Genome Annotation Expert Review and Curation  

SciTech Connect

A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG's rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER's annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG's comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes.

Markowitz, Victor M.; Mavromatis, Konstantinos; Ivanova, Natalia N.; Chen, I-Min A.; Chu, Ken; Kyrpides, Nikos C.

2009-05-25

275

Predictive screening for regulators of conserved functional gene modules (gene batteries) in mammals  

PubMed Central

Background The expression of gene batteries, genomic units of functionally linked genes which are activated by similar sets of cis- and trans-acting regulators, has been proposed as a major determinant of cell specialization in metazoans. We developed a predictive procedure to screen the mouse and human genomes and transcriptomes for cases of gene-battery-like regulation. Results In a screen that covered ~40 per cent of all annotated protein-coding genes, we identified 21 co-expressed gene clusters with statistically supported sharing of cis-regulatory sequence elements. 66 predicted cases of over-represented transcription factor binding motifs were validated against the literature and fell into three categories: (i) previously described cases of gene battery-like regulation, (ii) previously unreported cases of gene battery-like regulation with some support in a limited number of genes, and (iii) predicted cases that currently lack experimental support. The novel predictions include for example Sox 17 and RFX transcription factor binding sites that were detected in ~10% of all testis specific genes, and HNF-1 and 4 binding sites that were detected in ~30% of all kidney specific genes respectively. The results are publicly available at . Conclusion 21 co-expressed gene clusters were enriched for a total of 66 shared cis-regulatory sequence elements. A majority of these predictions represent novel cases of potential co-regulation of functionally coupled proteins. Critical technical parameters were evaluated, and the results and the methods provide a valuable resource for future experimental design.

Nelander, Sven; Larsson, Erik; Kristiansson, Erik; Mansson, Robert; Nerman, Olle; Sigvardsson, Mikael; Mostad, Petter; Lindahl, Per

2005-01-01

276

The meaning of it all: web-based resources for large-scale functional annotation and visualization of DNA microarray data.  

PubMed

The vast amount of unstructured data emerging from the various genome projects has led to the development of a number of web-based tools designed to annotate genes with biological information. Here we discuss a selection of these tools with regards to their scope, limitations and ease of use. PMID:23573656

Guffanti, Alessandro; Reid, James F; Alcalay, Myriam; Simon, Gyorgy

2004-12-15

277

The meaning of it all: web-based resources for large-scale functional annotation and visualization of DNA microarray data.  

PubMed

The vast amount of unstructured data emerging from the various genome projects has led to the development of a number of web-based tools designed to annotate genes with biological information. Here we discuss a selection of these tools with regards to their scope, limitations and ease of use. PMID:12414191

Guffanti, Alessandro; Reid, James F; Alcalay, Myriam; Simon, Gyorgy

2002-11-01

278

Automatic annotation of organellar genomes with DOGMA.  

PubMed

The Dual Organellar GenoMe Annotator (DOGMA) automates the annotation of organellar (plant chloroplast and animal mitochondrial) genomes. It is a Web-based package that allows the use of BLAST searches against a custom database, and conservation of basepairing in the secondary structure of animal mitochondrial tRNAs to identify and annotate genes. DOGMA provides a graphical user interface for viewing and editing annotations. Annotations are stored on our password-protected server to enable repeated sessions of working on the same genome. Finished annotations can be extracted for direct submission to GenBank. PMID:15180927

Wyman, Stacia K; Jansen, Robert K; Boore, Jeffrey L

2004-06-04

279

Annotated embryonic CNS expression patterns of 5000 GMR GAL4 lines: a resource for manipulating gene expression and analyzing cis-regulatory modules  

PubMed Central

Here we describe the embryonic CNS expression of 5,000 GAL4 lines made using molecularly defined cis-regulatory DNA inserted into a single attP genomic location. We document and annotate the patterns in early embryos when neurogenesis is at its peak, and in older embryos where there is maximal neuronal diversity and the first neural circuits are established. We note expression in other tissues such as the lateral body wall (muscle, sensory neurons, trachea) and viscera. Companion papers report on the adult brain and larval imaginal discs, and the integrated datasets are available online (www.janelia.org/flylight/gal4-gen1). This collection of embryonically-expressed GAL4 lines will be valuable for determining neuronal morphology and function; the 1862 lines expressed in small subsets of neurons (<20/segment) will be especially valuable for characterizing interneuronal diversity and function, as interneurons comprise the majority of all CNS neurons, yet their gene expression profile and function remain virtually unexplored.

Manning, Laurina; Heckscher, Ellie S.; Purice, Maria D.; Roberts, Jourdain; Bennett, Alysha L.; Kroll, Jason R.; Pollard, Jill L.; Strader, Marie E.; Lupton, Josh R.; Dyukareva, Anna V.; Doan, Phuong Nam; Bauer, David M.; Wilbur, Allison N.; Tanner, Stephanie; Kelly, Jimmy J.; Lai, Sen-Lin; Tran, Khoa D.; Kohwi, Minoree; Laverty, Todd R.; Pearson, Joseph C.; Crews, Stephen T.; Rubin, Gerald M.; Doe, Chris Q.

2012-01-01

280

A weighted power framework for integrating multisource information: gene function prediction in yeast.  

PubMed

Predicting the functions of unannotated genes is one of the major challenges of biological investigation. In this study, we propose a weighted power scoring framework, called weighted power biological score (WPBS), for combining different biological data sources and predicting the function of some of the unclassified yeast Saccharomyces cerevisiae genes. The relative power and weight coefficients of different data sources, in the proposed score, are estimated systematically by utilizing functional annotations [yeast Gene Ontology (GO)-Slim: Process] of classified genes, available from Saccharomyces Genome Database. Genes are then clustered by applying k-medoids algorithm on WPBS, and functional categories of 334 unclassified genes are predicted using a P-value cutoff 1 ×10(-5). The WPBS is available online at http://www.isical.ac.in/~ shubhra/WPBS/WPBS.html, where one can download WPBS, related files, and a MATLAB code to predict functions of unclassified genes. PMID:22318478

Ray, Shubhra Sankar; Bandyopadhyay, Sanghamitra; Pal, Sankar K

2012-02-03

281

Annotation and comparative analysis of the glycoside hydrolase genes in Brachypodium distachyon  

Microsoft Academic Search

BACKGROUND: Glycoside hydrolases cleave the bond between a carbohydrate and another carbohydrate, a protein, lipid or other moiety. Genes encoding glycoside hydrolases are found in a wide range of organisms, from archea to animals, and are relatively abundant in plant genomes. In plants, these enzymes are involved in diverse processes, including starch metabolism, defense, and cell-wall remodeling. Glycoside hydrolase genes

Ludmila Tyler; Jennifer N Bragg; Jiajie Wu; Xiaohan Yang; Gerald A Tuskan; John P Vogel

2010-01-01

282

APPRIS: annotation of principal and alternative splice isoforms  

PubMed Central

Here, we present APPRIS (http://appris.bioinfo.cnio.es), a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform.

Rodriguez, Jose Manuel; Maietta, Paolo; Ezkurdia, Iakes; Pietrelli, Alessandro; Wesselink, Jan-Jaap; Lopez, Gonzalo; Valencia, Alfonso; Tress, Michael L.

2013-01-01

283

AGeS: A Software System for Microbial Genome Sequence Annotation  

PubMed Central

Background The annotation of genomes from next-generation sequencing platforms needs to be rapid, high-throughput, and fully integrated and automated. Although a few Web-based annotation services have recently become available, they may not be the best solution for researchers that need to annotate a large number of genomes, possibly including proprietary data, and store them locally for further analysis. To address this need, we developed a standalone software application, the Annotation of microbial Genome Sequences (AGeS) system, which incorporates publicly available and in-house-developed bioinformatics tools and databases, many of which are parallelized for high-throughput performance. Methodology The AGeS system supports three main capabilities. The first is the storage of input contig sequences and the resulting annotation data in a central, customized database. The second is the annotation of microbial genomes using an integrated software pipeline, which first analyzes contigs from high-throughput sequencing by locating genomic regions that code for proteins, RNA, and other genomic elements through the Do-It-Yourself Annotation (DIYA) framework. The identified protein-coding regions are then functionally annotated using the in-house-developed Pipeline for Protein Annotation (PIPA). The third capability is the visualization of annotated sequences using GBrowse. To date, we have implemented these capabilities for bacterial genomes. AGeS was evaluated by comparing its genome annotations with those provided by three other methods. Our results indicate that the software tools integrated into AGeS provide annotations that are in general agreement with those provided by the compared methods. This is demonstrated by a >94% overlap in the number of identified genes, a significant number of identical annotated features, and a >90% agreement in enzyme function predictions.

Kumar, Kamal; Desai, Valmik; Cheng, Li; Khitrov, Maxim; Grover, Deepak; Satya, Ravi Vijaya; Yu, Chenggang; Zavaljevski, Nela; Reifman, Jaques

2011-01-01

284

Coexpression Analysis of Tomato Genes and Experimental Verification of Coordinated Expression of Genes Found in a Functionally Enriched Coexpression Module  

PubMed Central

Gene-to-gene coexpression analysis is a powerful approach to infer the function of uncharacterized genes. Here, we report comprehensive identification of coexpression gene modules of tomato (Solanum lycopersicum) and experimental verification of coordinated expression of module member genes. On the basis of the gene-to-gene correlation coefficient calculated from 67 microarray hybridization data points, we performed a network-based analysis. This facilitated the identification of 199 coexpression modules. A gene ontology annotation search revealed that 75 out of the 199 modules are enriched with genes associated with common functional categories. To verify the coexpression relationships between module member genes, we focused on one module enriched with genes associated with the flavonoid biosynthetic pathway. A non-enzyme, non-transcription factor gene encoding a zinc finger protein in this module was overexpressed in S. lycopersicum cultivar Micro-Tom, and expression levels of flavonoid pathway genes were investigated. Flavonoid pathway genes included in the module were up-regulated in the plant overexpressing the zinc finger gene. This result demonstrates that coexpression modules, at least the ones identified in this study, represent actual transcriptional coordination between genes, and can facilitate the inference of tomato gene function.

Ozaki, Soichi; Ogata, Yoshiyuki; Suda, Kunihiro; Kurabayashi, Atsushi; Suzuki, Tatsuya; Yamamoto, Naoki; Iijima, Yoko; Tsugane, Taneaki; Fujii, Takashi; Konishi, Chiaki; Inai, Shuji; Bunsupa, Somnuk; Yamazaki, Mami; Shibata, Daisuke; Aoki, Koh

2010-01-01

285

Cloning, annotation and expression analysis of mycoparasitism-related genes in Trichoderma harzianum 88.  

PubMed

Trichoderma harzianum 88, a filamentous soil fungus, is an effective biocontrol agent against several plant pathogens. High-throughput sequencing was used here to study the mycoparasitism mechanisms of T. harzianum 88. Plate confrontation tests of T. harzianum 88 against plant pathogens were conducted, and a cDNA library was constructed from T. harzianum 88 mycelia in the presence of plant pathogen cell walls. Randomly selected transcripts from the cDNA library were compared with eukaryotic plant and fungal genomes. Of the 1,386 transcripts sequenced, the most abundant Gene Ontology (GO) classification group was "physiological process". Differential expression of 19 genes was confirmed by real-time RT-PCR at different mycoparasitism stages against plant pathogens. Gene expression analysis revealed the transcription of various genes involved in mycoparasitism of T. harzianum 88. Our study provides helpful insights into the mechanisms of T. harzianum 88-plant pathogen interactions. PMID:23625217

Yao, Lin; Yang, Qian; Song, Jinzhu; Tan, Chong; Guo, Changhong; Wang, Li; Qu, Lianhai; Wang, Yun

2013-04-27

286

Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation  

Microsoft Academic Search

We have developed a DNA tag sequencing and mapping strategy called gene identification signature (GIS) analysis, in which 5? and 3? signatures of full-length cDNAs are accurately extracted into paired-end ditags (PETs) that are concatenated for efficient sequencing and mapped to genome sequences to demarcate the transcription boundaries of every gene. GIS analysis is potentially 30-fold more efficient than standard

Patrick Ng; Chia-Lin Wei; Wing-Kin Sung; Kuo Ping Chiu; Leonard Lipovich; Chin Chin Ang; Sanjay Gupta; Atif Shahab; Azmi Ridwan; Chee Hong Wong; Edison T Liu; Yijun Ruan

2005-01-01

287

INTERFEROME v2.0: an updated database of annotated interferon-regulated genes  

PubMed Central

Interferome v2.0 (http://interferome.its.monash.edu.au/interferome/) is an update of an earlier version of the Interferome DB published in the 2009 NAR database edition. Vastly improved computational infrastructure now enables more complex and faster queries, and supports more data sets from types I, II and III interferon (IFN)-treated cells, mice or humans. Quantitative, MIAME compliant data are collected, subjected to thorough, standardized, quantitative and statistical analyses and then significant changes in gene expression are uploaded. Comprehensive manual collection of metadata in v2.0 allows flexible, detailed search capacity including the parameters: range of -fold change, IFN type, concentration and time, and cell/tissue type. There is no limit to the number of genes that can be used to search the database in a single query. Secondary analysis such as gene ontology, regulatory factors, chromosomal location or tissue expression plots of IFN-regulated genes (IRGs) can be performed in Interferome v2.0, or data can be downloaded in convenient text formats compatible with common secondary analysis programs. Given the importance of IFN to innate immune responses in infectious, inflammatory diseases and cancer, this upgrade of the Interferome to version 2.0 will facilitate the identification of gene signatures of importance in the pathogenesis of these diseases.

Rusinova, Irina; Forster, Sam; Yu, Simon; Kannan, Anitha; Masse, Marion; Cumming, Helen; Chapman, Ross; Hertzog, Paul J.

2013-01-01

288

Interferome v2.0: an updated database of annotated interferon-regulated genes.  

PubMed

Interferome v2.0 (http://interferome.its.monash.edu.au/interferome/) is an update of an earlier version of the Interferome DB published in the 2009 NAR database edition. Vastly improved computational infrastructure now enables more complex and faster queries, and supports more data sets from types I, II and III interferon (IFN)-treated cells, mice or humans. Quantitative, MIAME compliant data are collected, subjected to thorough, standardized, quantitative and statistical analyses and then significant changes in gene expression are uploaded. Comprehensive manual collection of metadata in v2.0 allows flexible, detailed search capacity including the parameters: range of -fold change, IFN type, concentration and time, and cell/tissue type. There is no limit to the number of genes that can be used to search the database in a single query. Secondary analysis such as gene ontology, regulatory factors, chromosomal location or tissue expression plots of IFN-regulated genes (IRGs) can be performed in Interferome v2.0, or data can be downloaded in convenient text formats compatible with common secondary analysis programs. Given the importance of IFN to innate immune responses in infectious, inflammatory diseases and cancer, this upgrade of the Interferome to version 2.0 will facilitate the identification of gene signatures of importance in the pathogenesis of these diseases. PMID:23203888

Rusinova, Irina; Forster, Sam; Yu, Simon; Kannan, Anitha; Masse, Marion; Cumming, Helen; Chapman, Ross; Hertzog, Paul J

2012-11-29

289

Semantic annotation of biological concepts interplaying microbial cellular responses  

PubMed Central

Background Automated extraction systems have become a time saving necessity in Systems Biology. Considerable human effort is needed to model, analyse and simulate biological networks. Thus, one of the challenges posed to Biomedical Text Mining tools is that of learning to recognise a wide variety of biological concepts with different functional roles to assist in these processes. Results Here, we present a novel corpus concerning the integrated cellular responses to nutrient starvation in the model-organism Escherichia coli. Our corpus is a unique resource in that it annotates biomedical concepts that play a functional role in expression, regulation and metabolism. Namely, it includes annotations for genetic information carriers (genes and DNA, RNA molecules), proteins (transcription factors, enzymes and transporters), small metabolites, physiological states and laboratory techniques. The corpus consists of 130 full-text papers with a total of 59043 annotations for 3649 different biomedical concepts; the two dominant classes are genes (highest number of unique concepts) and compounds (most frequently annotated concepts), whereas other important cellular concepts such as proteins account for no more than 10% of the annotated concepts. Conclusions To the best of our knowledge, a corpus that details such a wide range of biological concepts has never been presented to the text mining community. The inter-annotator agreement statistics provide evidence of the importance of a consolidated background when dealing with such complex descriptions, the ambiguities naturally arising from the terminology and their impact for modelling purposes. Availability is granted for the full-text corpora of 130 freely accessible documents, the annotation scheme and the annotation guidelines. Also, we include a corpus of 340 abstracts.

2011-01-01

290

Annotation and comparative analysis of the glycoside hydrolase genes in Brachypodium distachyon  

Technology Transfer Automated Retrieval System (TEKTRAN)

Glycoside hydrolase genes have been previously cataloged for Oryza sativa (rice), the model dicotyledonous plant Arabidopsis thaliana, and the fast-growing tree Populus trichocarpa (poplar). To improve our understanding of glycoside hydrolases in plants generally and in grasses specifically, we ann...

291

DAVID: Database for Annotation, Visualization, and Integrated Discovery  

PubMed Central

The distributed nature of biological knowledge poses a major challenge to the interpretation of genome-scale datasets, including those derived from microarray and proteomic studies. This report describes DAVID, a web-accessible program that integrates functional genomic annotations with intuitive graphical summaries. Lists of gene or protein identifiers are rapidly annotated and summarized according to shared categorical data for Gene Ontology, protein domain, and biochemical pathway membership. DAVID assists in the interpretation of genome-scale datasets by facilitating the transition from data collection to biological meaning.

Dennis, Glynn; Sherman, Brad T; Hosack, Douglas A; Yang, Jun; Gao, Wei; Lane, H Clifford; Lempicki, Richard A

2003-01-01

292

5' Long serial analysis of gene expression (LongSAGE) and 3' LongSAGE for transcriptome characterization and genome annotation  

Microsoft Academic Search

Complete genome annotation relies on precise identification of transcription units bounded by a transcription initiation site (TIS) and a polyadenylation site (PAS). To facilitate this process, we developed a set of two complementary methods, 5' Long serial analysis of gene expression (LS) and 3'LS. These analyses are based on the original SAGE and LS methods coupled with full-length cDNA cloning,

Chia-Lin Wei; Patrick Ng; Kuo Ping Chiu; Chee Hong Wong; Chin Chin Ang; Leonard Lipovich; Edison T. Liu; Yijun Ruan

2004-01-01

293

Conserved spatially interacting motifs of protein superfamilies: application to fold recognition and function annotation of genome data.  

PubMed

Limitations in techniques for the elucidation of protein function have led to an increasing gap between the annotated proteins and those encoded in a genome. The functional selection and three-dimensional structural constraints of proteins in nature often relate to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. We identify spatially interacting conserved regions, or motifs, within protein superfamilies that are critical for structure and/or function. A search in sequence databases using these descriptors as additional constraints is an approach to identifying putative additional members of superfamilies. Such constrained searches have been tested against proteins of known structure to demonstrate high percentage specificity (93) with a low error rate of 0.0004. This approach has been compared with other sensitive sequence search methods (e.g., PSI-BLAST, HMMsearch, and IMPALA). It has been extended to analyze the distribution of 11 superfamilies in 93 genomes, including the human genome. PMID:14997562

Bhaduri, Anirban; Ravishankar, R; Sowdhamini, R

2004-03-01

294

Columba: Multidimensional Data Integration of Protein Annotations  

Microsoft Academic Search

We present COLUMBA, an integrated database of protein annotations. COLUMBA is centered around proteins whose structure has been resolved and adds as much annotations as possible to those proteins, describing their proper- ties such as function, sequence, classification, textual description, participation in pathways, etc. Annotations are extracted from seven (soon eleven) external data sources. In this paper we describe the

Kristian Rother; Heiko Müller; Silke Trissl; Ina Koch; Thomas Steinke; Robert Preissner; Cornelius Frömmel; Ulf Leser

2004-01-01

295

Annotated document: Scoring and ranking method  

Microsoft Academic Search

Semantic annotation represents a metadata of the document based on domain ontology. The purpose of this paper is to present semantic similarity document annotation ranking framework given a user's query. The framework features related concepts inclusion and applies appropriate weighting functions. Our aim is to rank and score semantic document annotation based on document richness. We also compare our approach

Syarifah Bahiyah Rahayu; Shahrul Azman Noah

2011-01-01

296

TranSeqAnnotator: large-scale analysis of transcriptomic data  

PubMed Central

Background The transcriptome of an organism can be studied with the analysis of expressed sequence tag (EST) data sets that offers a rapid and cost effective approach with several new and updated bioinformatics approaches and tools for assembly and annotation. The comprehensive analyses comprehend an organism along with the genome and proteome analysis. With the advent of large-scale sequencing projects and generation of sequence data at protein and cDNA levels, automated analysis pipeline is necessary to store, organize and annotate ESTs. Results TranSeqAnnotator is a workflow for large-scale analysis of transcriptomic data with the most appropriate bioinformatics tools for data management and analysis. The pipeline automatically cleans, clusters, assembles and generates consensus sequences, conceptually translates these into possible protein products and assigns putative function based on various DNA and protein similarity searches. Excretory/secretory (ES) proteins inferred from ESTs/short reads are also identified. The TranSeqAnnotator accepts FASTA format raw and quality ESTs along with protein and short read sequences and are analysed with user selected programs. After pre-processing and assembly, the dataset is annotated at the nucleotide, protein and ES protein levels. Conclusion TranSeqAnnotator has been developed in a Linux cluster, to perform an exhaustive and reliable analysis and provide detailed annotation. TranSeqAnnotator outputs gene ontologies, protein functional identifications in terms of mapping to protein domains and metabolic pathways. The pipeline is applied to annotate large EST datasets to identify several novel and known genes with therapeutic experimental validations and could serve as potential targets for parasite intervention. TransSeqAnnotator is freely available for the scientific community at http://estexplorer.biolinfo.org/TranSeqAnnotator/.

2012-01-01

297

Cloning, analysis and functional annotation of expressed sequence tags from the Earthworm Eisenia fetida  

Microsoft Academic Search

BACKGROUND: Eisenia fetida, commonly known as red wiggler or compost worm, belongs to the Lumbricidae family of the Annelida phylum. Little is known about its genome sequence although it has been extensively used as a test organism in terrestrial ecotoxicology. In order to understand its gene expression response to environmental contaminants, we cloned 4032 cDNAs or expressed sequence tags (ESTs)

Mehdi Pirooznia; Ping Gong; Xin Guan; Laura S. Inouye; Kuan Yang; Edward J. Perkins; Youping Deng

2007-01-01

298

An approach to predict transcription factor DNA binding site specificity based upon gene and transcription factor functional categorization  

Microsoft Academic Search

Motivation: To understand transcription regulatory mechanisms, it is indispensable to investigate transcription factor (TF) DNA binding preferences. We noted that the generally acknowledged information of functional annotations of TFs as well as that of their target genes should provide useful hints in determining TF DNA binding preferences. Results: In this contribution, we developed an integrative method based on the Nearest

Ziliang Qian; Lingyi Lu; Xiao-jun Liu; Yu-dong Cai; Yixue Li

2007-01-01

299

Discovery and Annotation of Functional Chromatin Signatures in the Human Genome  

Microsoft Academic Search

Transcriptional regulation in human cells is a complex process involving a multitude of regulatory elements encoded by the genome. Recent studies have shown that distinct chromatin signatures mark a variety of functional genomic elements and that subtle variations of these signatures mark elements with different functions. To identify novel chromatin signatures in the human genome, we apply a de novo

Gary Hon; Wei Wang; Bing Ren

2009-01-01

300

Visualizing the genome: techniques for presenting human genome data and annotations  

PubMed Central

Background In order to take full advantage of the newly available public human genome sequence data and associated annotations, biologists require visualization tools ("genome browsers") that can accommodate the high frequency of alternative splicing in human genes and other complexities. Results In this article, we describe visualization techniques for presenting human genomic sequence data and annotations in an interactive, graphical format. These techniques include: one-dimensional, semantic zooming to show sequence data alongside gene structures; color-coding exons to indicate frame of translation; adjustable, moveable tiers to permit easier inspection of a genomic scene; and display of protein annotations alongside gene structures to show how alternative splicing impacts protein structure and function. These techniques are illustrated using examples from two genome browser applications: the Neomorphic GeneViewer annotation tool and ProtAnnot, a prototype viewer which shows protein annotations in the context of genomic sequence. Conclusion By presenting techniques for visualizing genomic data, we hope to provide interested software developers with a guide to what features are most likely to meet the needs of biologists as they seek to make sense of the rapidly expanding body of public genomic data and annotations.

2002-01-01

301

MPromDb update 2010: an integrated resource for annotation and visualization of mammalian gene promoters and ChIP-seq experimental data  

PubMed Central

MPromDb (Mammalian Promoter Database) is a curated database that strives to annotate gene promoters identified from ChIP-seq results with the goal of providing an integrated resource for mammalian transcriptional regulation and epigenetics. We analyzed 507 million uniquely aligned RNAP-II ChIP-seq reads from 26 different data sets that include six human cell-types and 10 distinct mouse cell/tissues. The updated MPromDb version consists of computationally predicted (novel) and known active RNAP-II promoters (42?893 human and 48?366 mouse promoters) from various data sets freely available at NCBI GEO database. We found that 36% and 40% of protein-coding genes have alternative promoters in human and mouse genomes and ?40% of promoters are tissue/cell specific. The identified RNAP-II promoters were annotated using various known and novel gene models. Additionally, for novel promoters we looked into other evidences—GenBank mRNAs, spliced ESTs, CAGE promoter tags and mRNA-seq reads. Users can search the database based on gene id/symbol, or by specific tissue/cell type and filter results based on any combination of tissue/cell specificity, Known/Novel, CpG/NonCpG, and protein-coding/non-coding gene promoters. We have also integrated GBrowse genome browser with MPromDb for visualization of ChIP-seq profiles and to display the annotations. The current release of MPromDb can be accessed at http://bioinformatics.wistar.upenn.edu/MPromDb/.

Gupta, Ravi; Bhattacharyya, Anirban; Agosto-Perez, Francisco J.; Wickramasinghe, Priyankara; Davuluri, Ramana V.

2011-01-01

302

RIDDLE: reflective diffusion and local extension reveal functional associations for unannotated gene sets via proximity in a gene network.  

PubMed

The growing availability of large-scale functional networks has promoted the development of many successful techniques for predicting functions of genes. Here we extend these network-based principles and techniques to functionally characterize whole sets of genes. We present RIDDLE (Reflective Diffusion and Local Extension), which uses well developed guilt-by-association principles upon a human gene network to identify associations of gene sets. RIDDLE is particularly adept at characterizing sets with no annotations, a major challenge where most traditional set analyses fail. Notably, RIDDLE found microRNA-450a to be strongly implicated in ocular diseases and development. A web application is available at http://www.functionalnet.org/RIDDLE. PMID:23268829

Wang, Peggy I; Hwang, Sohyun; Kincaid, Rodney P; Sullivan, Christopher S; Lee, Insuk; Marcotte, Edward M

2012-12-26

303

Partially-supervised protein subclass discovery with simultaneous annotation of functional residues  

PubMed Central

Background The study of functional subfamilies of protein domain families and the identification of the residues which determine substrate specificity is an important question in the analysis of protein domains. One way to address this question is the use of clustering methods for protein sequence data and approaches to predict functional residues based on such clusterings. The locations of putative functional residues in known protein structures provide insights into how different substrate specificities are reflected on the protein structure level. Results We have developed an extension of the context-specific independence mixture model clustering framework which allows for the integration of experimental data. As these are usually known only for a few proteins, our algorithm implements a partially-supervised learning approach. We discover domain subfamilies and predict functional residues for four protein domain families: phosphatases, pyridoxal dependent decarboxylases, WW and SH3 domains to demonstrate the usefulness of our approach. Conclusion The partially-supervised clustering revealed biologically meaningful subfamilies even for highly heterogeneous domains and the predicted functional residues provide insights into the basis of the different substrate specificities.

Georgi, Benjamin; Schultz, Jorg; Schliep, Alexander

2009-01-01

304

Protein surface analysis for function annotation in high-throughput structural genomics pipeline.  

PubMed

Structural genomics (SG) initiatives are expanding the universe of protein fold space by rapidly determining structures of proteins that were intentionally selected on the basis of low sequence similarity to proteins of known structure. Often these proteins have no associated biochemical or cellular functions. The SG success has resulted in an accelerated deposition of novel structures. In some cases the structural bioinformatics analysis applied to these novel structures has provided specific functional assignment. However, this approach has also uncovered limitations in the functional analysis of uncharacterized proteins using traditional sequence and backbone structure methodologies. A novel method, named pvSOAR (pocket and void Surface of Amino Acid Residues), of comparing the protein surfaces of geometrically defined pockets and voids was developed. pvSOAR was able to detect previously unrecognized and novel functional relationships between surface features of proteins. In this study, pvSOAR is applied to several structural genomics proteins. We examined the surfaces of YecM, BioH, and RpiB from Escherichia coli as well as the CBS domains from inosine-5'-monosphate dehydrogenase from Streptococcus pyogenes, conserved hypothetical protein Ta549 from Thermoplasm acidophilum, and CBS domain protein mt1622 from Methanobacterium thermoautotrophicum with the goal to infer information about their biochemical function. PMID:16322579

Binkowski, T Andrew; Joachimiak, Andrzej; Liang, Jie

2005-12-01

305

Disease candidate gene identification and prioritization using protein interaction networks  

Microsoft Academic Search

BACKGROUND: Although most of the current disease candidate gene identification and prioritization methods depend on functional annotations, the coverage of the gene functional annotations is a limiting factor. In the current study, we describe a candidate gene prioritization method that is entirely based on protein-protein interaction network (PPIN) analyses. RESULTS: For the first time, extended versions of the PageRank and

Jing Chen; Bruce J. Aronow; Anil G. Jegga

2009-01-01

306

Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs.  

PubMed

Only a small proportion of the mouse genome is transcribed into mature messenger RNA transcripts. There is an international collaborative effort to identify all full-length mRNA transcripts from the mouse, and to ensure that each is represented in a physical collection of clones. Here we report the manual annotation of 60,770 full-length mouse complementary DNA sequences. These are clustered into 33,409 'transcriptional units', contributing 90.1% of a newly established mouse transcriptome database. Of these transcriptional units, 4,258 are new protein-coding and 11,665 are new non-coding messages, indicating that non-coding RNA is a major component of the transcriptome. 41% of all transcriptional units showed evidence of alternative splicing. In protein-coding transcripts, 79% of splice variations altered the protein product. Whole-transcriptome analyses resulted in the identification of 2,431 sense-antisense pairs. The present work, completely supported by physical clones, provides the most comprehensive survey of a mammalian transcriptome so far, and is a valuable resource for functional genomics. PMID:12466851

Okazaki, Y; Furuno, M; Kasukawa, T; Adachi, J; Bono, H; Kondo, S; Nikaido, I; Osato, N; Saito, R; Suzuki, H; Yamanaka, I; Kiyosawa, H; Yagi, K; Tomaru, Y; Hasegawa, Y; Nogami, A; Schönbach, C; Gojobori, T; Baldarelli, R; Hill, D P; Bult, C; Hume, D A; Quackenbush, J; Schriml, L M; Kanapin, A; Matsuda, H; Batalov, S; Beisel, K W; Blake, J A; Bradt, D; Brusic, V; Chothia, C; Corbani, L E; Cousins, S; Dalla, E; Dragani, T A; Fletcher, C F; Forrest, A; Frazer, K S; Gaasterland, T; Gariboldi, M; Gissi, C; Godzik, A; Gough, J; Grimmond, S; Gustincich, S; Hirokawa, N; Jackson, I J; Jarvis, E D; Kanai, A; Kawaji, H; Kawasawa, Y; Kedzierski, R M; King, B L; Konagaya, A; Kurochkin, I V; Lee, Y; Lenhard, B; Lyons, P A; Maglott, D R; Maltais, L; Marchionni, L; McKenzie, L; Miki, H; Nagashima, T; Numata, K; Okido, T; Pavan, W J; Pertea, G; Pesole, G; Petrovsky, N; Pillai, R; Pontius, J U; Qi, D; Ramachandran, S; Ravasi, T; Reed, J C; Reed, D J; Reid, J; Ring, B Z; Ringwald, M; Sandelin, A; Schneider, C; Semple, C A M; Setou, M; Shimada, K; Sultana, R; Takenaka, Y; Taylor, M S; Teasdale, R D; Tomita, M; Verardo, R; Wagner, L; Wahlestedt, C; Wang, Y; Watanabe, Y; Wells, C; Wilming, L G; Wynshaw-Boris, A; Yanagisawa, M; Yang, I; Yang, L; Yuan, Z; Zavolan, M; Zhu, Y; Zimmer, A; Carninci, P; Hayatsu, N; Hirozane-Kishikawa, T; Konno, H; Nakamura, M; Sakazume, N; Sato, K; Shiraki, T; Waki, K; Kawai, J; Aizawa, K; Arakawa, T; Fukuda, S; Hara, A; Hashizume, W; Imotani, K; Ishii, Y; Itoh, M; Kagawa, I; Miyazaki, A; Sakai, K; Sasaki, D; Shibata, K; Shinagawa, A; Yasunishi, A; Yoshino, M; Waterston, R; Lander, E S; Rogers, J; Birney, E; Hayashizaki, Y

2002-12-01

307

mirTools 2.0 for non-coding RNA discovery, profiling, and functional annotation based on high-throughput sequencing.  

PubMed

Next-generation sequencing has been widely applied to understand the complexity of non-coding RNAs (ncRNAs) in a cost-effective way. In this study, we developed mirTools 2.0, an updated version of mirTools 1.0, which includes the following new features. (1) From miRNA discovery in mirTools 1.0, mirTools 2.0 allows users to detect and profile various types of ncRNAs, such as miRNA, tRNA, snRNA, snoRNA, rRNA, and piRNA. (2) From miRNA profiling in mirTools 1.0, mirTools 2.0 allows users to identify miRNA-targeted genes and performs detailed functional annotation of miRNA targets, including Gene Ontology, KEGG pathway and protein-protein interaction. (3) From comparison of two samples for differentially expressed miRNAs in mirTools 1.0, mirTools 2.0 allows users to detect differentially expressed ncRNAs between two experimental groups or among multiple samples. (4) Other significant improvements include strategies used to detect novel miRNAs and piRNAs, more taxonomy categories to discover more known miRNAs and a stand-alone version of mirTools 2.0. In conclusion, we believe that mirTools 2.0 (122.228.158.106/mr2_dev and centre.bioinformatics.zj.cn/mr2_dev) will provide researchers with more detailed insight into small RNA transcriptomes. PMID:23778453

Wu, Jinyu; Liu, Qi; Wang, Xin; Zheng, Jiayong; Wang, Tao; You, Mingcong; Sheng Sun, Zhong; Shi, Qinghua

2013-05-29

308

Identifying gene functions using functional expression profiles obtained by voxelation  

Microsoft Academic Search

Gene expression profiles have been widely used in functional genomic studies. However, not much work in traditional gene expression profiling takes into account the location information of a gene's expressions in the brain. Gene expression maps, which contain spatial information regarding the expression of genes in mice's brain, are obtained by combining voxelation and microarrays. Based on the idea that

Li An; Desmond J. Smith; Hongbo Xie; Vasileios Megalooikonomou; Zoran Obradovic

2010-01-01

309

Activity-based protein profiling for the functional annotation of enzymes  

Microsoft Academic Search

Activity-based protein profiling (ABPP), the use of active site-directed chemical probes to monitor enzyme function in complex biological systems, is emerging as a powerful post-genomic technology. ABPP probes have been developed for several enzyme classes and have been used to inventory enzyme activities en masse for a range of (patho) physiological processes. By presenting specific examples, we show here that

Katherine T Barglow; Benjamin F Cravatt

2007-01-01

310

Gene discovery and gene function assignment in filamentous fungi  

PubMed Central

Filamentous fungi are a large group of diverse and economically important microorganisms. Large-scale gene disruption strategies developed in budding yeast are not applicable to these organisms because of their larger genomes and lower rate of targeted integration (TI) during transformation. We developed transposon-arrayed gene knockouts (TAGKO) to discover genes and simultaneously create gene disruption cassettes for subsequent transformation and mutant analysis. Transposons carrying a bacterial and fungal drug resistance marker are used to mutagenize individual cosmids or entire libraries in vitro. Cosmids are annotated by DNA sequence analysis at the transposon insertion sites, and cosmid inserts are liberated to direct insertional mutagenesis events in the genome. Based on saturation analysis of a cosmid insert and insertions in a fungal cosmid library, we show that TAGKO can be used to rapidly identify and mutate genes. We further show that insertions can create alterations in gene expression, and we have used this approach to investigate an amino acid oxidation pathway in two important fungal phytopathogens.

Hamer, Lisbeth; Adachi, Kiichi; Montenegro-Chamorro, Maria V.; Tanzer, Matthew M.; Mahanty, Sanjoy K.; Lo, Clive; Tarpey, Rex W.; Skalchunes, Amy R.; Heiniger, Ryan W.; Frank, Sheryl A.; Darveaux, Blaise A.; Lampe, David J.; Slater, Ted M.; Ramamurthy, Lakshman; DeZwaan, Todd M.; Nelson, Grant H.; Shuster, Jeffrey R.; Woessner, Jeffrey; Hamer, John E.

2001-01-01

311

The DNA sequence and biological annotation of human chromosome 1.  

PubMed

The reference sequence for each human chromosome provides the framework for understanding genome function, variation and evolution. Here we report the finished sequence and biological annotation of human chromosome 1. Chromosome 1 is gene-dense, with 3,141 genes and 991 pseudogenes, and many coding sequences overlap. Rearrangements and mutations of chromosome 1 are prevalent in cancer and many other diseases. Patterns of sequence variation reveal signals of recent selection in specific genes that may contribute to human fitness, and also in regions where no function is evident. Fine-scale recombination occurs in hotspots of varying intensity along the sequence, and is enriched near genes. These and other studies of human biology and disease encoded within chromosome 1 are made possible with the highly accurate annotated sequence, as part of the completed set of chromosome sequences that comprise the reference human genome. PMID:16710414

Gregory, S G; Barlow, K F; McLay, K E; Kaul, R; Swarbreck, D; Dunham, A; Scott, C E; Howe, K L; Woodfine, K; Spencer, C C A; Jones, M C; Gillson, C; Searle, S; Zhou, Y; Kokocinski, F; McDonald, L; Evans, R; Phillips, K; Atkinson, A; Cooper, R; Jones, C; Hall, R E; Andrews, T D; Lloyd, C; Ainscough, R; Almeida, J P; Ambrose, K D; Anderson, F; Andrew, R W; Ashwell, R I S; Aubin, K; Babbage, A K; Bagguley, C L; Bailey, J; Beasley, H; Bethel, G; Bird, C P; Bray-Allen, S; Brown, J Y; Brown, A J; Buckley, D; Burton, J; Bye, J; Carder, C; Chapman, J C; Clark, S Y; Clarke, G; Clee, C; Cobley, V; Collier, R E; Corby, N; Coville, G J; Davies, J; Deadman, R; Dunn, M; Earthrowl, M; Ellington, A G; Errington, H; Frankish, A; Frankland, J; French, L; Garner, P; Garnett, J; Gay, L; Ghori, M R J; Gibson, R; Gilby, L M; Gillett, W; Glithero, R J; Grafham, D V; Griffiths, C; Griffiths-Jones, S; Grocock, R; Hammond, S; Harrison, E S I; Hart, E; Haugen, E; Heath, P D; Holmes, S; Holt, K; Howden, P J; Hunt, A R; Hunt, S E; Hunter, G; Isherwood, J; James, R; Johnson, C; Johnson, D; Joy, A; Kay, M; Kershaw, J K; Kibukawa, M; Kimberley, A M; King, A; Knights, A J; Lad, H; Laird, G; Lawlor, S; Leongamornlert, D A; Lloyd, D M; Loveland, J; Lovell, J; Lush, M J; Lyne, R; Martin, S; Mashreghi-Mohammadi, M; Matthews, L; Matthews, N S W; McLaren, S; Milne, S; Mistry, S; Moore, M J F; Nickerson, T; O'Dell, C N; Oliver, K; Palmeiri, A; Palmer, S A; Parker, A; Patel, D; Pearce, A V; Peck, A I; Pelan, S; Phelps, K; Phillimore, B J; Plumb, R; Rajan, J; Raymond, C; Rouse, G; Saenphimmachak, C; Sehra, H K; Sheridan, E; Shownkeen, R; Sims, S; Skuce, C D; Smith, M; Steward, C; Subramanian, S; Sycamore, N; Tracey, A; Tromans, A; Van Helmond, Z; Wall, M; Wallis, J M; White, S; Whitehead, S L; Wilkinson, J E; Willey, D L; Williams, H; Wilming, L; Wray, P W; Wu, Z; Coulson, A; Vaudin, M; Sulston, J E; Durbin, R; Hubbard, T; Wooster, R; Dunham, I; Carter, N P; McVean, G; Ross, M T; Harrow, J; Olson, M V; Beck, S; Rogers, J; Bentley, D R; Banerjee, R; Bryant, S P; Burford, D C; Burrill, W D H; Clegg, S M; Dhami, P; Dovey, O; Faulkner, L M; Gribble, S M; Langford, C F; Pandian, R D; Porter, K M; Prigmore, E

2006-05-18

312

Metagenomic Annotation Networks: Construction and Applications  

PubMed Central

The derivation and comparison of biological interaction networks are vital for understanding the functional capacity and hierarchical organization of integrated microbial communities. In the current work we present metagenomic annotation networks as a novel taxonomy-free approach for understanding the functional architecture of metagenomes. Specifically, metagenomic operon predictions are exploited to derive functional interactions that are translated and categorized according to their associated functional annotations. The result is a collection of discrete networks of weighted annotation linkages. These networks are subsequently examined for the occurrence of annotation modules that portray the functional and organizational characteristics of various microbial communities. A variety of network perspectives and annotation categories are applied to recover a diverse range of modules with different degrees of annotative cohesiveness. Applications to biocatalyst discovery and human health issues are discussed, as well as the limitations of the current implementation.

Vey, Gregory; Moreno-Hagelsieb, Gabriel

2012-01-01

313

ClustScan: an integrated program package for the semi-automatic annotation of modular biosynthetic gene clusters and in silico prediction of novel chemical structures.  

PubMed

The program package 'ClustScan' (Cluster Scanner) is designed for rapid, semi-automatic, annotation of DNA sequences encoding modular biosynthetic enzymes including polyketide synthases (PKS), non-ribosomal peptide synthetases (NRPS) and hybrid (PKS/NRPS) enzymes. The program displays the predicted chemical structures of products as well as allowing export of the structures in a standard format for analyses with other programs. Recent advances in understanding of enzyme function are incorporated to make knowledge-based predictions about the stereochemistry of products. The program structure allows easy incorporation of additional knowledge about domain specificities and function. The results of analyses are presented to the user in a graphical interface, which also allows easy editing of the predictions to incorporate user experience. The versatility of this program package has been demonstrated by annotating biochemical pathways in microbial, invertebrate animal and metagenomic datasets. The speed and convenience of the package allows the annotation of all PKS and NRPS clusters in a complete Actinobacteria genome in 2-3 man hours. The open architecture of ClustScan allows easy integration with other programs, facilitating further analyses of results, which is useful for a broad range of researchers in the chemical and biological sciences. PMID:18978015

Starcevic, Antonio; Zucko, Jurica; Simunkovic, Jurica; Long, Paul F; Cullum, John; Hranueli, Daslav

2008-10-31

314

The DNA sequence and biological annotation of human chromosome1  

Microsoft Academic Search

The reference sequence for each human chromosome provides the framework for understanding genome function, variation and evolution. Here we report the finished sequence and biological annotation of human chromosome1. Chromosome1 is gene-dense, with 3,141 genes and 991 pseudogenes, and many coding sequences overlap. Rearrangements and mutations of chromosome1 are prevalent in cancer and many other diseases. Patterns of sequence variation

S. G. Gregory; K. F. Barlow; K. E. McLay; R. Kaul; D. Swarbreck; A. Dunham; C. E. Scott; K. L. Howe; K. Woodfine; C. C. A. Spencer; M. C. Jones; C. Gillson; S. Searle; Y. Zhou; F. Kokocinski; L. McDonald; R. Evans; K. Phillips; A. Atkinson; R. Cooper; C. Jones; R. E. Hall; T. D. Andrews; C. Lloyd; R. Ainscough; J. P. Almeida; K. D. Ambrose; F. Anderson; R. W. Andrew; R. I. S. Ashwell; K. Aubin; A. K. Babbage; C. L. Bagguley; J. Bailey; H. Beasley; G. Bethel; C. P. Bird; S. Bray-Allen; J. Y. Brown; A. J. Brown; D. Buckley; J. Burton; J. Bye; C. Carder; J. C. Chapman; S. Y. Clark; G. Clarke; C. Clee; V. Cobley; R. E. Collier; N. Corby; G. J. Coville; J. Davies; R. Deadman; M. Dunn; M. Earthrowl; A. G. Ellington; H. Errington; A. Frankish; J. Frankland; P. Garner; J. Garnett; L. Gay; M. R. J. Ghori; R. Gibson; L. M. Gilby; W. Gillett; R. J. Glithero; D. V. Grafham; C. Griffiths; S. Griffiths-Jones; R. Grocock; S. Hammond; E. S. I. Harrison; E. Haugen; P. D. Heath; S. Holmes; K. Holt; P. J. Howden; A. R. Hunt; S. E. Hunt; G. Hunter; J. Isherwood; R. James; C. Johnson; D. Johnson; A. Joy; M. Kay; J. K. Kershaw; M. Kibukawa; A. M. Kimberley; A. J. Knights; H. Lad; G. Laird; S. Lawlor; D. A. Leongamornlert; D. M. Lloyd; J. Loveland; J. Lovell; M. J. Lush; R. Lyne; S. Martin; M. Mashreghi-Mohammadi; L. Matthews; N. S. W. Matthews; S. McLaren; S. Milne; S. Mistry; M. J. F. M Oore; T. Nickerson; C. N. O'Dell; K. Oliver; A. Palmeiri; S. A. Palmer; A. Parker; D. Patel; A. V. Pearce; A. I. Peck; S. Pelan; K. Phelps; R. Plumb; J. Rajan; C. Raymond; G. Rouse; C. Saenphimmachak; H. K. Sehra; E. Sheridan; R. Shownkeen; S. Sims; C. D. Skuce; M. Smith; C. Steward; S. Subramanian; N. Sycamore; A. Tracey; A. Tromans; Z. van Helmond; M. Wall; J. M. Wallis; S. L. Whitehead; J. E. Wilkinson; D. L. Willey; H. Williams; L. Wilming; P. W. Wray; Z. Wu; A. Coulson; M. Vaudin; J. E. Sulston; R. Durbin; I. Dunham; N. P. Carter; G. McVean; M. T. Ross; J. Harrow; M. V. Olson; S. Beck; J. Rogers; D. R. Bentley

2006-01-01

315

A large-scale zebrafish gene knockout resource for the genome-wide study of gene function  

PubMed Central

With the completion of the zebrafish genome sequencing project, it becomes possible to analyze the function of zebrafish genes in a systematic way. The first step in such an analysis is to inactivate each protein-coding gene by targeted or random mutation. Here we describe a streamlined pipeline using proviral insertions coupled with high-throughput sequencing and mapping technologies to widely mutagenize genes in the zebrafish genome. We also report the first 6144 mutagenized and archived F1's predicted to carry up to 3776 mutations in annotated genes. Using in vitro fertilization, we have rescued and characterized ?0.5% of the predicted mutations, showing mutation efficacy and a variety of phenotypes relevant to both developmental processes and human genetic diseases. Mutagenized fish lines are being made freely available to the public through the Zebrafish International Resource Center. These fish lines establish an important milestone for zebrafish genetics research and should greatly facilitate systematic functional studies of the vertebrate genome.

Varshney, Gaurav K.; Lu, Jing; Gildea, Derek E.; Huang, Haigen; Pei, Wuhong; Yang, Zhongan; Huang, Sunny C.; Schoenfeld, David; Pho, Nam H.; Casero, David; Hirase, Takashi; Mosbrook-Davis, Deborah; Zhang, Suiyuan; Jao, Li-En; Zhang, Bo; Woods, Ian G.; Zimmerman, Steven; Schier, Alexander F.; Wolfsberg, Tyra G.; Pellegrini, Matteo; Burgess, Shawn M.; Lin, Shuo

2013-01-01

316

Rice Annotation Database (RAD): a contig-oriented database for map-based rice genomics.  

PubMed

A contig-oriented database for annotation of the rice genome has been constructed to facilitate map-based rice genomics. The Rice Annotation Database has the following functional features: (i) extensive effort of manual annotations of P1-derived artificial chromosome/bacterial artificial chromosome clones can be merged at chromosome and contig-level; (ii) concise visualization of the annotation information such as the predicted genes, results of various prediction programs (RiceHMM, Genscan, Genscan+, Fgenesh, GeneMark, etc.), homology to expressed sequence tag, full-length cDNA and protein; (iii) user-friendly clone / gene query system; (iv) download functions for nucleotide, amino acid and coding sequences; (v) analysis of various features of the genome (GC-content, average value, etc.); and (vi) genome-wide homology search (BLAST) of contig- and chromosome-level genome sequence to allow comparative analysis with the genome sequence of other organisms. As of October 2004, the database contains a total of 215 Mb sequence with relevant annotation results including 30 000 manually curated genes. The database can provide the latest information on manual annotation as well as a comprehensive structural analysis of various features of the rice genome. The database can be accessed at http://rad.dna.affrc.go.jp/. PMID:15608281

Ito, Yuichi; Arikawa, Kohji; Antonio, Baltazar A; Ohta, Isamu; Naito, Shinji; Mukai, Yoshiyuki; Shimano, Atsuko; Masukawa, Masatoshi; Shibata, Michie; Yamamoto, Mayu; Ito, Yukiyo; Yokoyama, Junri; Sakai, Yasumichi; Sakata, Katsumi; Nagamura, Yoshiaki; Namiki, Nobukazu; Matsumoto, Takashi; Higo, Kenichi; Sasaki, Takuji

2005-01-01

317

Function of the DISC1 Gene  

NSDL National Science Digital Library

As a result of the human genome project, we now know largely where our genes are, and what structure they have. The search to uncover each gene's function, on the other hand, is only in its infancy. Functional genomics is an area of research dedicated to studying what protein is produced by a gene, and what happens in the body when it is activated. Understanding gene function is the next major hurdle in genomic research, which holds the key to developing revolutionary therapeutics.

2009-04-14

318

Utilizing protein networks to determine novel annotations  

NASA Astrophysics Data System (ADS)

Proteins are a key element of life because they are involved in every metabolic process, yet a majority of proteins remain unannotated. Current chemical and physical annotation methods are inaccurate, inefficient, or expensive. Without proper annotation, understanding of organisms' metabolic pathways is limited. Based on the hypothesis that proteins with similar primary structures have similar characteristics, we theorize that a method for protein annotation can be developed using protein networking, which was previously thought to be useful in determining the evolutionary paths of proteins. A large, diverse database of proteins is used to connect protein fragments by using a preset identity threshold. With this method, unknown proteins are connected to known ones. By observing the number of links to proteins with annotated functions, a likely annotation candidate will be reached. This procedure can potentially facilitate the process of finding more accurate annotations. We have used and validated this approach to annotate putative uncharacterized proteins. Results will be presented at the conference.

Shiao, Kenneth; Feng, Jerry; Doan, Tina; Gorin, Andrey

2011-10-01

319

EST Express: PHP\\/MySQL based automated annotation of ESTs from expression libraries  

Microsoft Academic Search

BACKGROUND: Several biological techniques result in the acquisition of functional sets of cDNAs that must be sequenced and analyzed. The emergence of redundant databases such as UniGene and centralized annotation engines such as Entrez Gene has allowed the development of software that can analyze a great number of sequences in a matter of seconds. RESULTS: We have developed \\

Robin P. Smith; William J. Buchser; Marcus B. Lemmon; Jose R. Pardinas; John L. Bixby; Vance P. Lemmon

2008-01-01

320

The Relationship between Protein Sequences and their Gene Ontology Functions  

Microsoft Academic Search

Background: One main research challenge in the post-genomic era is to understand the relationship between protein sequences and their biological functions. In recent years, several automated annotation systems have been developed for the functional assignment of uncharacterized proteins. The underlying assumption of these systems is that similar sequences imply similar biological functions. However, it has been noted that matching sequences

Zhong-hui Duan; Brent Hughes; Lothar Reichel; Ting Shi

2006-01-01

321

BG7: A New Approach for Bacterial Genome Annotation Designed for Next Generation Sequencing Data  

PubMed Central

BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version – which is developed in Java, takes advantage of Amazon Web Services (AWS) cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic E. coli genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future.

Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Pareja, Eduardo; Tobes, Raquel

2012-01-01

322

Novel Genes from Formation to Function  

PubMed Central

The study of the evolution of novel genes generally focuses on the formation of new coding sequences. However, equally important in the evolution of novel functional genes are the formation of regulatory regions that allow the expression of the genes and the effects of the new genes in the organism as well. Herein, we discuss the current knowledge on the evolution of novel functional genes, and we examine in more detail the youngest genes discovered. We examine the existing data on a very recent and rapidly evolving cluster of duplicated genes, the Sdic gene cluster. This cluster of genes is an excellent model for the evolution of novel genes, as it is very recent and may still be in the process of evolving.

Ponce, Rita; Martinsen, Lene; Vicente, Luis M.; Hartl, Daniel L.

2012-01-01

323

Annotation of Genes Involved in Glycerolipid Biosynthesis in Chlamydomonas reinhardtii: Discovery of the Betaine Lipid Synthase BTA1Cr  

PubMed Central

Lipid metabolism in flowering plants has been intensely studied, and knowledge regarding the identities of genes encoding components of the major fatty acid and membrane lipid biosynthetic pathways is very extensive. We now present an in silico analysis of fatty acid and glycerolipid metabolism in an algal model, enabled by the recent availability of expressed sequence tag and genomic sequences of Chlamydomonas reinhardtii. Genes encoding proteins involved in membrane biogenesis were predicted on the basis of similarity to proteins with confirmed functions and were organized so as to reconstruct the major pathways of glycerolipid synthesis in Chlamydomonas. This analysis accounts for the majority of genes predicted to encode enzymes involved in anabolic reactions of membrane lipid biosynthesis and compares and contrasts these pathways in Chlamydomonas and flowering plants. As an important result of the bioinformatics analysis, we identified and isolated the C. reinhardtii BTA1 (BTA1Cr) gene and analyzed the bifunctional protein that it encodes; we predicted this protein to be sufficient for the synthesis of the betaine lipid diacylglyceryl-N,N,N-trimethylhomoserine (DGTS), a major membrane component in Chlamydomonas. Heterologous expression of BTA1Cr led to DGTS accumulation in Escherichia coli, which normally lacks this lipid, and allowed in vitro analysis of the enzymatic properties of BTA1Cr. In contrast, in the bacterium Rhodobacter sphaeroides, two separate proteins, BtaARs and BtaBRs, are required for the biosynthesis of DGTS. Site-directed mutagenesis of the active sites of the two domains of BTA1Cr allowed us to study their activities separately, demonstrating directly their functional homology to the bacterial orthologs BtaARs and BtaBRs.

Riekhof, Wayne R.; Sears, Barbara B.; Benning, Christoph

2005-01-01

324

An annotated cDNA library and microarray for large-scale gene-expression studies in the ant Solenopsis invicta.  

PubMed

Ants display a range of fascinating behaviors, a remarkable level of intra-species phenotypic plasticity and many other interesting characteristics. Here we present a new tool to study the molecular mechanisms underlying these traits: a tentatively annotated expressed sequence tag (EST) resource for the fire ant Solenopsis invicta. From a normalized cDNA library we obtained 21,715 ESTs, which represent 11,864 putatively different transcripts with very diverse molecular functions. All ESTs were used to construct a cDNA microarray. PMID:17224046

Wang, John; Jemielity, Stephanie; Uva, Paolo; Wurm, Yannick; Gräff, Johannes; Keller, Laurent

2007-01-01

325

DEFOG: Discrete Enrichment of Functionally Organized Genes  

PubMed Central

High-throughput biological experiments commonly result in a list of genes or proteins of interest. In order to understand the observed changes of the genes and to generate new hypotheses, one needs to understand the functions and roles of the genes and how those functions relate to the experimental conditions. Typically, statistical tests are performed in order to detect enriched Gene Ontology categories or Pathways, i.e. the categories are observed in the genes of interest more often than is expected by chance. Depending on the number of genes and the complexity and quantity of functions in which they are involved, such an analysis can easily result in hundreds of enriched terms. To this end we developed DEFOG, a web-based application that facilitates the functional analysis of gene sets by hierarchically organizing the genes into functionally related modules. Our computational pipeline utilizes three powerful tools to achieve this goal: (1) GeneMANIA creates a functional consensus network of the genes of interest based on gene-list-specific data fusion of hundreds of genomic networks from publicly available sources; (2) Transitivity Clustering organizes those genes into a clear hierarchy of functionally related groups, and (3) Ontologizer performs a Gene Ontology enrichment analysis on the resulting gene clusters. DEFOG integrates this computational pipeline within an easy-to-use web interface, thus allowing for a novel visual analysis of gene sets that aids in the discovery of potentially important biological mechanisms and facilitates the creation of new hypotheses. DEFOG is available at http://www.mooneygroup.org/defog.

Wittkop, Tobias; Berman, Ari E.; Fleisch, K. Mathew; Mooney, Sean D.

2012-01-01

326

Annotation of Differentially Expressed Genes in the Somatic Embryogenesis of Musa and Their Location in the Banana Genome  

PubMed Central

Analysis of cDNA-AFLP was used to study the genes expressed in zygotic and somatic embryogenesis of Musa acuminata Colla ssp. malaccensis, and a comparison was made between their differential transcribed fragments (TDFs) and the sequenced genome of the double haploid- (DH-) Pahang of the malaccensis subspecies that is available in the network. A total of 253 transcript-derived fragments (TDFs) were detected with apparent size of 100–4000 bp using 5 pairs of AFLP primers, of which 21 were differentially expressed during the different stages of banana embryogenesis; 15 of the sequences have matched DH-Pahang chromosomes, with 7 of them being homologous to gene sequences encoding either known or putative protein domains of higher plants. Four TDF sequences were located in all Musa chromosomes, while the rest were located in one or two chromosomes. Their putative individual function is briefly reviewed based on published information, and the potential roles of these genes in embryo development are discussed. Thus the availability of the genome of Musa and the information of TDFs sequences presented here opens new possibilities for an in-depth study of the molecular and biochemical research of zygotic and somatic embryogenesis of Musa.

Maldonado-Borges, Josefina Ines; Ku-Cauich, Jose Roberto; Escobedo-GraciaMedrano, Rosa Maria

2013-01-01

327

RNA sequencing reveals sexually dimorphic gene expression before gonadal differentiation in chicken and allows comprehensive annotation of the W-chromosome.  

PubMed

BACKGROUND: Birds have a ZZ male: ZW female sex chromosome system and while the Z-linked DMRT1 gene is necessary for testis development, the exact mechanism of sex determination in birds remains unsolved. This is partly due to the poor annotation of the W chromosome, which is speculated to carry a female determinant. Few genes have been mapped to the W and little is known of their expression. RESULTS: We used RNA-seq to produce a comprehensive profile of gene expression in chicken blastoderms and embryonic gonads prior to sexual differentiation. We found robust sexually dimorphic gene expression in both tissues pre-dating gonadogenesis, including sex-linked and autosomal genes. This supports the hypothesis that sexual differentiation at the molecular level is at least partly cell autonomous in birds. Different sets of genes were sexually dimorphic in the two tissues, indicating that molecular sexual differentiation is tissue specific. Further analyses allowed the assembly of full-length transcripts for 26 W chromosome genes, providing a view of the W transcriptome in embryonic tissues. This is the first extensive analysis of W-linked genes and their expression profiles in early avian embryos. CONCLUSION: Sexual differentiation at the molecular level is established in chicken early in embryogenesis, before gonadal sex differentiation. We find that the W chromosome is more transcriptionally active than previously thought, expand the number of known genes to 26 and present complete coding sequences for these W genes. This includes two novel W-linked sequences and three small RNAs reassigned to the W from the Un_Random chromosome. PMID:23531366

Ayers, Katie L; Davidson, Nadia M; Demiyah, Diana; Roeszler, Kelly N; Grützner, Frank; Sinclair, Andrew H; Oshlack, Alicia; Smith, Craig A

2013-03-25

328

FunGene: the functional gene pipeline and repository  

PubMed Central

Ribosomal RNA genes have become the standard molecular markers for microbial community analysis for good reasons, including universal occurrence in cellular organisms, availability of large databases, and ease of rRNA gene region amplification and analysis. As markers, however, rRNA genes have some significant limitations. The rRNA genes are often present in multiple copies, unlike most protein-coding genes. The slow rate of change in rRNA genes means that multiple species sometimes share identical 16S rRNA gene sequences, while many more species share identical sequences in the short 16S rRNA regions commonly analyzed. In addition, the genes involved in many important processes are not distributed in a phylogenetically coherent manner, potentially due to gene loss or horizontal gene transfer. While rRNA genes remain the most commonly used markers, key genes in ecologically important pathways, e.g., those involved in carbon and nitrogen cycling, can provide important insights into community composition and function not obtainable through rRNA analysis. However, working with ecofunctional gene data requires some tools beyond those required for rRNA analysis. To address this, our Functional Gene Pipeline and Repository (FunGene; http://fungene.cme.msu.edu/) offers databases of many common ecofunctional genes and proteins, as well as integrated tools that allow researchers to browse these collections and choose subsets for further analysis, build phylogenetic trees, test primers and probes for coverage, and download aligned sequences. Additional FunGene tools are specialized to process coding gene amplicon data. For example, FrameBot produces frameshift-corrected protein and DNA sequences from raw reads while finding the most closely related protein reference sequence. These tools can help provide better insight into microbial communities by directly studying key genes involved in important ecological processes.

Fish, Jordan A.; Chai, Benli; Wang, Qiong; Sun, Yanni; Brown, C. Titus; Tiedje, James M.; Cole, James R.

2013-01-01

329

FunGene: the functional gene pipeline and repository.  

PubMed

Ribosomal RNA genes have become the standard molecular markers for microbial community analysis for good reasons, including universal occurrence in cellular organisms, availability of large databases, and ease of rRNA gene region amplification and analysis. As markers, however, rRNA genes have some significant limitations. The rRNA genes are often present in multiple copies, unlike most protein-coding genes. The slow rate of change in rRNA genes means that multiple species sometimes share identical 16S rRNA gene sequences, while many more species share identical sequences in the short 16S rRNA regions commonly analyzed. In addition, the genes involved in many important processes are not distributed in a phylogenetically coherent manner, potentially due to gene loss or horizontal gene transfer. While rRNA genes remain the most commonly used markers, key genes in ecologically important pathways, e.g., those involved in carbon and nitrogen cycling, can provide important insights into community composition and function not obtainable through rRNA analysis. However, working with ecofunctional gene data requires some tools beyond those required for rRNA analysis. To address this, our Functional Gene Pipeline and Repository (FunGene; http://fungene.cme.msu.edu/) offers databases of many common ecofunctional genes and proteins, as well as integrated tools that allow researchers to browse these collections and choose subsets for further analysis, build phylogenetic trees, test primers and probes for coverage, and download aligned sequences. Additional FunGene tools are specialized to process coding gene amplicon data. For example, FrameBot produces frameshift-corrected protein and DNA sequences from raw reads while finding the most closely related protein reference sequence. These tools can help provide better insight into microbial communities by directly studying key genes involved in important ecological processes. PMID:24101916

Fish, Jordan A; Chai, Benli; Wang, Qiong; Sun, Yanni; Brown, C Titus; Tiedje, James M; Cole, James R

2013-10-01

330

Gene Ontology consistent protein function prediction: the FALCON algorithm applied to six eukaryotic genomes  

PubMed Central

Gene Ontology (GO) is a hierarchical vocabulary for the description of biological functions and locations, often employed by computational methods for protein function prediction. Due to the structure of GO, function predictions can be self- contradictory. For example, a protein may be predicted to belong to a detailed functional class, but not in a broader class that, due to the vocabulary structure, includes the predicted one. We present a novel discrete optimization algorithm called Functional Annotation with Labeling CONsistency (FALCON) that resolves such contradictions. The GO is modeled as a discrete Bayesian Network. For any given input of GO term membership probabilities, the algorithm returns the most probable GO term assignments that are in accordance with the Gene Ontology structure. The optimization is done using the Differential Evolution algorithm. Performance is evaluated on simulated and also real data from Arabidopsis thaliana showing improvement compared to related approaches. We finally applied the FALCON algorithm to obtain genome-wide function predictions for six eukaryotic species based on data provided by the CAFA (Critical Assessment of Function Annotation) project.

2013-01-01

331

Objective-guided image annotation.  

PubMed

Automatic image annotation, which is usually formulated as a multi-label classification problem, is one of the major tools used to enhance the semantic understanding of web images. Many multimedia applications (e.g., tag-based image retrieval) can greatly benefit from image annotation. However, the insufficient performance of image annotation methods prevents these applications from being practical. On the other hand, specific measures are usually designed to evaluate how well one annotation method performs for a specific objective or application, but most image annotation methods do not consider optimization of these measures, so that they are inevitably trapped into suboptimal performance of these objective-specific measures. To address this issue, we first summarize a variety of objective-guided performance measures under a unified representation. Our analysis reveals that macro-averaging measures are very sensitive to infrequent keywords, and hamming measure is easily affected by skewed distributions. We then propose a unified multi-label learning framework, which directly optimizes a variety of objective-specific measures of multi-label learning tasks. Specifically, we first present a multilayer hierarchical structure of learning hypotheses for multi-label problems based on which a variety of loss functions with respect to objective-guided measures are defined. And then, we formulate these loss functions as relaxed surrogate functions and optimize them by structural SVMs. According to the analysis of various measures and the high time complexity of optimizing micro-averaging measures, in this paper, we focus on example-based measures that are tailor-made for image annotation tasks but are seldom explored in the literature. Experiments show consistency with the formal analysis on two widely used multi-label datasets, and demonstrate the superior performance of our proposed method over state-of-the-art baseline methods in terms of example-based measures on four image annotation datasets. PMID:23247859

Mao, Qi; Tsang, Ivor Wai-Hung; Gao, Shenghua

2012-12-11

332

Functional Genes and Proteins of Clonorchis sinensis  

PubMed Central

During the past several decades, researches on parasite genetics have progressed from biochemical and serodiagnostic studies to protein chemistry, molecular biology, and functional gene studies. Nowadays, bioinformatics, genomics, and proteomics approaches are being applied by Korean parasitology researchers. As for Clonorchis sinensis, investigations have been carried out to identify its functional genes using forward and reverse genetic approaches and to characterize the biochemical and biological properties of its gene products. The authors review the proteins of cloned genes, which include antigenic proteins, physiologic and metabolic enzymes, and the gene expression profile of Clonorchis sinensis.

Kim, Tae Im; Na, Byoung-Kuk

2009-01-01

333

Curated genome annotation of Oryza sativa ssp. japonica and comparative genome analysis with Arabidopsis thaliana  

PubMed Central

We present here the annotation of the complete genome of rice Oryza sativa L. ssp. japonica cultivar Nipponbare. All functional annotations for proteins and non-protein-coding RNA (npRNA) candidates were manually curated. Functions were identified or inferred in 19,969 (70%) of the proteins, and 131 possible npRNAs (including 58 antisense transcripts) were found. Almost 5000 annotated protein-coding genes were found to be disrupted in insertional mutant lines, which will accelerate future experimental validation of the annotations. The rice loci were determined by using cDNA sequences obtained from rice and other representative cereals. Our conservative estimate based on these loci and an extrapolation suggested that the gene number of rice is ?32,000, which is smaller than previous estimates. We conducted comparative analyses between rice and Arabidopsis thaliana and found that both genomes possessed several lineage-specific genes, which might account for the observed differences between these species, while they had similar sets of predicted functional domains among the protein sequences. A system to control translational efficiency seems to be conserved across large evolutionary distances. Moreover, the evolutionary process of protein-coding genes was examined. Our results suggest that natural selection may have played a role for duplicated genes in both species, so that duplication was suppressed or favored in a manner that depended on the function of a gene.

Itoh, Takeshi; Tanaka, Tsuyoshi; Barrero, Roberto A.; Yamasaki, Chisato; Fujii, Yasuyuki; Hilton, Phillip B.; Antonio, Baltazar A.; Aono, Hideo; Apweiler, Rolf; Bruskiewich, Richard; Bureau, Thomas; Burr, Frances; Costa de Oliveira, Antonio; Fuks, Galina; Habara, Takuya; Haberer, Georg; Han, Bin; Harada, Erimi; Hiraki, Aiko T.; Hirochika, Hirohiko; Hoen, Douglas; Hokari, Hiroki; Hosokawa, Satomi; Hsing, Yue; Ikawa, Hiroshi; Ikeo, Kazuho; Imanishi, Tadashi; Ito, Yukiyo; Jaiswal, Pankaj; Kanno, Masako; Kawahara, Yoshihiro; Kawamura, Toshiyuki; Kawashima, Hiroaki; Khurana, Jitendra P.; Kikuchi, Shoshi; Komatsu, Setsuko; Koyanagi, Kanako O.; Kubooka, Hiromi; Lieberherr, Damien; Lin, Yao-Cheng; Lonsdale, David; Matsumoto, Takashi; Matsuya, Akihiro; McCombie, W. Richard; Messing, Joachim; Miyao, Akio; Mulder, Nicola; Nagamura, Yoshiaki; Nam, Jongmin; Namiki, Nobukazu; Numa, Hisataka; Nurimoto, Shin; O'Donovan, Claire; Ohyanagi, Hajime; Okido, Toshihisa; OOta, Satoshi; Osato, Naoki; Palmer, Lance E.; Quetier, Francis; Raghuvanshi, Saurabh; Saichi, Naomi; Sakai, Hiroaki; Sakai, Yasumichi; Sakata, Katsumi; Sakurai, Tetsuya; Sato, Fumihiko; Sato, Yoshiharu; Schoof, Heiko; Seki, Motoaki; Shibata, Michie; Shimizu, Yuji; Shinozaki, Kazuo; Shinso, Yuji; Singh, Nagendra K.; Smith-White, Brian; Takeda, Jun-ichi; Tanino, Motohiko; Tatusova, Tatiana; Thongjuea, Supat; Todokoro, Fusano; Tsugane, Mika; Tyagi, Akhilesh K.; Vanavichit, Apichart; Wang, Aihui; Wing, Rod A.; Yamaguchi, Kaori; Yamamoto, Mayu; Yamamoto, Naoyuki; Yu, Yeisoo; Zhang, Hao; Zhao, Qiang; Higo, Kenichi; Burr, Benjamin; Gojobori, Takashi; Sasaki, Takuji

2007-01-01

334

Functional annotation of the human brain methylome identifies tissue-specific epigenetic variation across brain and blood  

PubMed Central

Background Dynamic changes to the epigenome play a critical role in establishing and maintaining cellular phenotype during differentiation, but little is known about the normal methylomic differences that occur between functionally distinct areas of the brain. We characterized intra- and inter-individual methylomic variation across whole blood and multiple regions of the brain from multiple donors. Results Distinct tissue-specific patterns of DNA methylation were identified, with a highly significant over-representation of tissue-specific differentially methylated regions (TS-DMRs) observed at intragenic CpG islands and low CG density promoters. A large proportion of TS-DMRs were located near genes that are differentially expressed across brain regions. TS-DMRs were significantly enriched near genes involved in functional pathways related to neurodevelopment and neuronal differentiation, including BDNF, BMP4, CACNA1A, CACA1AF, EOMES, NGFR, NUMBL, PCDH9, SLIT1, SLITRK1 and SHANK3. Although between-tissue variation in DNA methylation was found to greatly exceed between-individual differences within any one tissue, we found that some inter-individual variation was reflected across brain and blood, indicating that peripheral tissues may have some utility in epidemiological studies of complex neurobiological phenotypes. Conclusions This study reinforces the importance of DNA methylation in regulating cellular phenotype across tissues, and highlights genomic patterns of epigenetic variation across functionally distinct regions of the brain, providing a resource for the epigenetics and neuroscience research communities.

2012-01-01

335

Functional characterization and high-throughput proteomic analysis of interrupted genes in the archaeon Sulfolobus solfataricus.  

PubMed

Sequenced genomes often reveal interrupted coding sequences that complicate the annotation process and the subsequent functional characterization of the genes. In the past, interrupted genes were generally considered to be the result of sequencing errors or pseudogenes, that is, gene remnants with little or no biological importance. However, recent lines of evidence support the hypothesis that these coding sequences can be functional; thus, it is crucial to understand whether interrupted genes are expressed in vivo. We addressed this issue by experimentally demonstrating the existence of functional disrupted genes in archaeal genomes. We discovered previously unknown disrupted genes that have interrupted homologues in distantly related species of archaea. The combination of a RT-PCR strategy with shotgun proteomics demonstrates that interrupted genes in the archaeon Sulfolobus solfataricus are expressed in vivo. In addition, the sequence of the peptides determined by LCMSMS and experiments of in vitro translation allows us to identify a gene expressed by programmed -1 frameshifting. Our findings will enable an accurate reinterpretation of archaeal interrupted genes shedding light on their function and on archaeal genome evolution. PMID:20192274

Cobucci-Ponzano, Beatrice; Guzzini, Lucia; Benelli, Dario; Londei, Paola; Perrodou, Emmanuel; Lecompte, Odile; Tran, Diem; Sun, Jun; Wei, Jing; Mathur, Eric J; Rossi, Mosè; Moracci, Marco

2010-05-01

336

Integration of gene ontology pathways with North American Rheumatoid Arthritis Consortium genome-wide association data via linear modeling  

PubMed Central

We describe an empirical Bayesian linear model for integration of functional gene annotation data with genome-wide association data. Using case-control study data from the North American Rheumatoid Arthritis Consortium and gene annotation data from the Gene Ontology, we illustrate how the method can be used to prioritize candidate genes for further investigation.

2009-01-01

337

Biomarker discovery across annotated and unannotated microarray datasets using semi-supervised learning.  

PubMed

The growing body of DNA microarray data has the potential to advance our understanding of the molecular basis of disease. However annotating microarray datasets with clinically useful information is not always possible, as this often requires access to detailed patient records. In this study we introduce GLAD, a new Semi-Supervised Learning (SSL) method for combining independent annotated datasets and unannotated datasets with the aim of identifying more robust sample classifiers. In our method, independent models are developed using subsets of genes for the annotated and unannotated datasets. These models are evaluated according to a scoring function that incorporates terms for classification accuracy on annotated data, and relative cluster separation in unannotated data. Improved models are iteratively generated using a genetic algorithm feature selection technique. Our results show that the addition of unannotated data into training, significantly improves classifier robustness. PMID:18831798

Harris, Cole; Ghaffari, Noushin

2008-09-16

338

Assembly, Gene Annotation and Marker Development Using 454 Floral Transcriptome Sequences in Ziziphus Celata (Rhamnaceae), a Highly Endangered, Florida Endemic Plant  

PubMed Central

Large-scale DNA sequence data may enable development of genetic resources in endangered species, thereby facilitating conservation efforts. Ziziphus celata, a federally endangered, self-incompatible plant species occurring in Florida, USA, is one species for which genetic resources are necessary to facilitate new introductions and augmentations essential for recovery of the species. We used 454 pyrosequencing of a Z. celata normalized floral cDNA library to create a genomic resource for gene and marker discovery. A half-plate GS-FLX Titanium run yielded 655 337 reads averaging 250 bp. A total of 474 025 reads were assembled de novo into 84 645 contigs averaging 408 bp, while 181 312 reads remained unassembled. Forty-seven and 43% of contig consensus sequences had BLAST matches to known proteins in the Uniref50 and TAIR9 annotated protein databases, respectively; many contigs fully represented orthologous proteins in TAIR9. A total of 22 707 unique genes were sequenced, indicating substantial coverage of the Z. celata transcriptome. We detected single-nucleotide polymorphisms and simple sequence repeats (SSRs) and developed thousands of SSR primers for use in future genetic studies. As a first step towards understanding self-incompatibility in Z. celata, we identified sequences belonging to the gene family encoding self-incompatibility. This study demonstrates the efficacy of 454 transcriptome sequencing for rapid gene and marker discovery in an endangered plant.

Edwards, Christine E.; Parchman, Thomas L.; Weekley, Carl W.

2012-01-01

339

Data Annotation Models and Annotation Query Language  

Microsoft Academic Search

This paper presents data annotation models at five levels of granularity (database, relation, column, tuple, and cell) of relational data to address the problem of unsuitability of most relational databases to express annotations. These models do not require any structural and schematic changes to the underlying database. These models are also flexible, extensi- ble, customizable, database-neutral, and platform-independent. This paper

Neerja Bhatnagar; Benjoe A. Juliano; Renee S. Renner

2007-01-01

340

Annotation of primate miRNAs by high throughput sequencing of small RNA libraries  

PubMed Central

Background In addition to genome sequencing, accurate functional annotation of genomes is required in order to carry out comparative and evolutionary analyses between species. Among primates, the human genome is the most extensively annotated. Human miRNA gene annotation is based on multiple lines of evidence including evidence for expression as well as prediction of the characteristic hairpin structure. In contrast, most miRNA genes in non-human primates are annotated based on homology without any expression evidence. We have sequenced small-RNA libraries from chimpanzee, gorilla, orangutan and rhesus macaque from multiple individuals and tissues. Using patterns of miRNA expression in conjunction with a model of miRNA biogenesis we used these high-throughput sequencing data to identify novel miRNAs in non-human primates. Results We predicted 47 new miRNAs in chimpanzee, 240 in gorilla, 55 in orangutan and 47 in rhesus macaque. The algorithm we used was able to predict 64% of the previously known miRNAs in chimpanzee, 94% in gorilla, 61% in orangutan and 71% in rhesus macaque. We therefore added evidence for expression in between one and five tissues to miRNAs that were previously annotated based only on homology to human miRNAs. We increased from 60 to 175 the number miRNAs that are located in orthologous regions in humans and the four non-human primate species studied here. Conclusions In this study we provide expression evidence for homology-based annotated miRNAs and predict de novo miRNAs in four non-human primate species. We increased the number of annotated miRNA genes and provided evidence for their expression in four non-human primates. Similar approaches using different individuals and tissues would improve annotation in non-human primates and allow for further comparative studies in the future.

2012-01-01

341

CoCiter: An Efficient Tool to Infer Gene Function by Assessing the Significance of Literature Co-Citation.  

PubMed

A routine approach to inferring functions for a gene set is by using function enrichment analysis based on GO, KEGG or other curated terms and pathways. However, such analysis requires the existence of overlapping genes between the query gene set and those annotated by GO/KEGG. Furthermore, GO/KEGG databases only maintain a very restricted vocabulary. Here, we have developed a tool called "CoCiter" based on literature co-citations to address the limitations in conventional function enrichment analysis. Co-citation analysis is widely used in ranking articles and predicting protein-protein interactions (PPIs). Our algorithm can further assess the co-citation significance of a gene set with any other user-defined gene sets, or with free terms. We show that compared with the traditional approaches, CoCiter is a more accurate and flexible function enrichment analysis method. CoCiter is freely available at www.picb.ac.cn/hanlab/cociter/. PMID:24086311

Qiao, Nan; Huang, Yi; Naveed, Hammad; Green, Christopher D; Han, Jing-Dong J

2013-09-23

342

CoCiter: An Efficient Tool to Infer Gene Function by Assessing the Significance of Literature Co-Citation  

PubMed Central

A routine approach to inferring functions for a gene set is by using function enrichment analysis based on GO, KEGG or other curated terms and pathways. However, such analysis requires the existence of overlapping genes between the query gene set and those annotated by GO/KEGG. Furthermore, GO/KEGG databases only maintain a very restricted vocabulary. Here, we have developed a tool called “CoCiter” based on literature co-citations to address the limitations in conventional function enrichment analysis. Co-citation analysis is widely used in ranking articles and predicting protein-protein interactions (PPIs). Our algorithm can further assess the co-citation significance of a gene set with any other user-defined gene sets, or with free terms. We show that compared with the traditional approaches, CoCiter is a more accurate and flexible function enrichment analysis method. CoCiter is freely available at www.picb.ac.cn/hanlab/cociter/.

Naveed, Hammad; Green, Christopher D.; Han, Jing-Dong J.

2013-01-01

343

GeneCards Version 3: the human gene integrator.  

PubMed

GeneCards (www.genecards.org) is a comprehensive, authoritative compendium of annotative information about human genes, widely used for nearly 15 years. Its gene-centric content is automatically mined and integrated from over 80 digital sources, resulting in a web-based deep-linked card for each of >73,000 human gene entries, encompassing the following categories: protein coding, pseudogene, RNA gene, genetic locus, cluster and uncategorized. We now introduce GeneCards Version 3, featuring a speedy and sophisticated search engine and a revamped, technologically enabling infrastructure, catering to the expanding needs of biomedical researchers. A key focus is on gene-set analyses, which leverage GeneCards' unique wealth of combinatorial annotations. These include the GeneALaCart batch query facility, which tabulates user-selected annotations for multiple genes and GeneDecks, which identifies similar genes with shared annotations, and finds set-shared annotations by descriptor enrichment analysis. Such set-centric features address a host of applications, including microarray data analysis, cross-database annotation mapping and gene-disorder associations for drug targeting. We highlight the new Version 3 database architecture, its multi-faceted search engine, and its semi-automated quality assurance system. Data enhancements include an expanded visualization of gene expression patterns in normal and cancer tissues, an integrated alternative splicing pattern display, and augmented multi-source SNPs and pathways sections. GeneCards now provides direct links to gene-related research reagents such as antibodies, recombinant proteins, DNA clones and inhibitory RNAs and features gene-related drugs and compounds lists. We also portray the GeneCards Inferred Functionality Score annotation landscape tool for scoring a gene's functional information status. Finally, we delineate examples of applications and collaborations that have benefited from the GeneCards suite. Database URL: www.genecards.org. PMID:20689021

Safran, Marilyn; Dalah, Irina; Alexander, Justin; Rosen, Naomi; Iny Stein, Tsippi; Shmoish, Michael; Nativ, Noam; Bahir, Iris; Doniger, Tirza; Krug, Hagit; Sirota-Madi, Alexandra; Olender, Tsviya; Golan, Yaron; Stelzer, Gil; Harel, Arye; Lancet, Doron

2010-08-05

344

Teachers Reference: Annotations  

NSDL National Science Digital Library

This collection of 171 annotations was written to enhance and explain the text of the book 'Stone Wall Secrets'. Each annotation consists of a number that refers specifically to the phrase preceding it. Each annotation number is followed by three indexing elements: subject category, one or more keywords, and one or more sample questions with answers.

345

Antagonistic functional duality of cancer genes.  

PubMed

Cancer evolution is a stochastic process both at the genome and gene levels. Most of tumors contain multiple genetic subclones, evolving in either succession or in parallel, either in a linear or branching manner, with heterogeneous genome and gene alterations, extensively rewired signaling networks, and addicted to multiple oncogenes easily switching with each other during cancer progression and medical intervention. Hundreds of discovered cancer genes are classified according to whether they function in a dominant (oncogenes) or recessive (tumor suppressor genes) manner in a cancer cell. However, there are many cancer "gene-chameleons", which behave distinctly in opposite way in the different experimental settings showing antagonistic duality. In contrast to the widely accepted view that mutant NADP(+)-dependent isocitrate dehydrogenases 1/2 (IDH1/2) and associated metabolite 2-hydroxyglutarate (R)-enantiomer are intrinsically "the drivers" of tumourigenesis, mutant IDH1/2 inhibited, promoted or had no effect on cell proliferation, growth and tumorigenicity in diverse experiments. Similar behavior was evidenced for dozens of cancer genes. Gene function is dependent on genetic network, which is defined by the genome context. The overall changes in karyotype can result in alterations of the role and function of the same genes and pathways. The diverse cell lines and tumor samples have been used in experiments for proving gene tumor promoting/suppressive activity. They all display heterogeneous individual karyotypes and disturbed signaling networks. Consequently, the effect and function of gene under investigation can be opposite and versatile in cells with different genomes that may explain antagonistic duality of cancer genes and the cell type- or the cellular genetic/context-dependent response to the same protein. Antagonistic duality of cancer genes might contribute to failure of chemotherapy. Instructive examples of unexpected activity of cancer genes and "paradoxical" effects of different anticancer drugs depending on the cellular genetic context/signaling network are discussed. PMID:23933273

Stepanenko, A A; Vassetzky, Y S; Kavsan, V M

2013-08-09

346

Steady progress and recent breakthroughs in the accuracy of automated genome annotation.  

PubMed

The sequencing of large, complex genomes has become routine, but understanding how sequences relate to biological function is less straightforward. Although much attention is focused on how to annotate genomic features such as developmental enhancers and non-coding RNAs, there is still no higher eukaryote for which we know the correct exon-intron structure of at least one ORF for each gene. Despite this uncomfortable truth, genome annotation has made remarkable progress since the first drafts of the human genome were analysed. By combining several computational and experimental methods, we are now closer to producing complete and accurate gene catalogues than ever before. PMID:18087260

Brent, Michael R

2008-01-01

347

Functional identity of the gamma tropomyosin gene  

PubMed Central

The actin filament system is fundamental to cellular functions including regulation of shape, motility, cytokinesis, intracellular trafficking and tissue organization. Tropomyosins (Tm) are highly conserved components of actin filaments which differentially regulate filament stability and function. The mammalian Tm family consists of four genes; ?Tm, ?Tm, ?Tm and ?Tm. Multiple Tm isoforms (>40) are generated by alternative splicing and expression of these isoforms is highly regulated during development. In order to further identify the role of Tm isoforms during development, we tested the specificity of function of products from the ?Tm gene family in mice using a series of gene knockouts. Ablation of all ?Tm gene cytoskeletal products results in embryonic lethality. Elimination of just two cytoskeletal products from the ?Tm gene (NM1,2) resulted in a 50% reduction in embryo viability. It was also not possible to generate homozygous knockout ES cells for the targets which eliminated or reduced embryo viability in mice. In contrast, homozygous knockout ES cells were generated for a different set of isoforms (NM3,5,6,8,9,11) which were not required for embryogenesis. We also observed that males hemizygous for the knockout of all cytoskeletal products from the ?Tm gene preferentially transmitted the minus allele with 80–100% transmission. Since all four Tm genes are expressed in early embryos, ES cells and sperm, we conclude that isoforms of the ?Tm gene are functionally unique in their role in embryogenesis, ES cell viability and sperm function.

Hook, Jeff; Lemckert, Frances; Schevzov, Galina; Fath, Thomas

2011-01-01

348

A large-scale RNAi screen identifies functional classes of genes shaping synaptic development and maintenance.  

PubMed

Neuronal circuit development and function require proper synapse formation and maintenance. Genetic screens are one powerful method to identify the mechanisms shaping synaptic development and stability. However, genes with essential roles in non-neural tissues may be missed in traditional loss-of-function screens. In an effort to circumvent this limitation, we used neuron-specific RNAi knock down in Drosophila and assayed the formation, growth, and maintenance of the neuromuscular junction (NMJ). We examined 1970 Drosophila genes, each of which has a conserved ortholog in mammalian genomes. Knock down of 158 genes in post-mitotic neurons led to abnormalities in the neuromuscular system, including misapposition of active zone components opposite postsynaptic glutamate receptors, synaptic terminal overgrowth and undergrowth, abnormal accumulation of synaptic material within the axon, and retraction of synaptic terminals from their postsynaptic targets. Bioinformatics analysis demonstrates that genes with overlapping annotated function are enriched within the hits for each phenotype, suggesting that the shared biological function is important for that aspect of synaptic development. For example, genes for proteasome subunits and mitotic spindle organizers are enriched among the genes whose knock down leads to defects in synaptic apposition and NMJ stability. Such genes play essential roles in all cells, however the use of tissue- and temporally-restricted RNAi indicates that the proteasome and mitotic spindle organizers participate in discrete aspects of synaptic development. In addition to identifying functional classes of genes shaping synaptic development, this screen also identifies candidate genes whose role at the synapse can be validated by traditional loss-of-function analysis. We present one such example, the dynein-interacting protein NudE, and demonstrate that it is required for proper axonal transport and synaptic maintenance. Thus, this screen has identified both functional classes of genes as well as individual candidate genes that are critical for synaptic development and will be a useful resource for subsequent mechanistic analysis of synapse formation and maintenance. PMID:22542760

Valakh, Vera; Naylor, Sarah A; Berns, Dominic S; DiAntonio, Aaron

2012-04-19

349

Concept annotation in the CRAFT corpus  

PubMed Central

Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semanti