These are representative sample records from Science.gov related to your search topic.
For comprehensive and current results, perform a real-time search at Science.gov.
1

FUNCTIONAL ANNOTATION OF OIL PALM GENES USING AN AUTOMATED BIOINFORMATICS APPROACH FUNCTIONAL ANNOTATION OF OIL PALM  

E-print Network

FUNCTIONAL ANNOTATION OF OIL PALM GENES USING AN AUTOMATED BIOINFORMATICS APPROACH 35 FUNCTIONAL ANNOTATION OF OIL PALM GENES USING AN AUTOMATED BIOINFORMATICS APPROACH LAURA B WILLIS*; PHILIP A LESSARDBank, and duplicate entries were eliminated by pairwise BLAST searches, resulting in a collection of unique oil palm

Sinskey, Anthony J.

2

Predicting Gene Function From Patterns of Annotation  

E-print Network

Department of Biomedical Engineering, Boston University, Boston, Massachusetts 02215, USA The Gene Ontology Ontology Consortium (Gene Ontology Consortium 2000) provides a standardized vocabulary for the annotation experimentally. A variety of approaches for predicting Gene Ontology (GO) attributes have been attempted. Natural

Roth, Frederick

3

GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network  

E-print Network

GeneFarm, structural and functional annotation of Arabidopsis gene and protein families11 , Christophe Geourjon8 , Jean-Michel Grienenberger10 , Guy Houlne´10 , Elisabeth Jamet10 , Fre on every member of each gene family. Performing a family-wide annotation makes the task easier and *To whom

Gent, Universiteit

4

Functional annotation of human cytomegalovirus gene products: an update  

PubMed Central

Human cytomegalovirus is an opportunistic double-stranded DNA virus with one of the largest viral genomes known. The 235 kB genome is divided in a unique long (UL) and a unique short (US) region which are flanked by terminal and internal repeats. The expression of HCMV genes is highly complex and involves the production of protein coding transcripts, polyadenylated long non-coding RNAs, polyadenylated anti-sense transcripts and a variety of non-polyadenylated RNAs such as microRNAs. Although the function of many of these transcripts is unknown, they are suggested to play a direct or regulatory role in the delicately orchestrated processes that ensure HCMV replication and life-long persistence. This review focuses on annotating the complete viral genome based on three sources of information. First, previous reviews were used as a template for the functional keywords to ensure continuity; second, the Uniprot database was used to further enrich the functional database; and finally, the literature was manually curated for novel functions of HCMV gene products. Novel discoveries were discussed in light of the viral life cycle. This functional annotation highlights still poorly understood regions of the genome but more importantly it can give insight in functional clusters and/or may be helpful in the analysis of future transcriptomics and proteomics studies. PMID:24904534

Van Damme, Ellen; Van Loock, Marnix

2014-01-01

5

Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation  

PubMed Central

Hypothetical (HyP) and conserved HyP genes account for >30% of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved HyP (9.5%) along with 887 HyP genes (24.4%). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were to determine which genes were expressed and provide a more functionally based annotation. To accomplish this, expression profiles of 1234 HyP and conserved genes were used from transcriptomic datasets of 11 environmental stresses, complemented with shotgun LC–MS/MS and AMT tag proteomic data. Genes were divided into putatively polycistronic operons and those predicted to be monocistronic, then classified by basal expression levels and grouped according to changes in expression for one or multiple stresses. One thousand two hundred and twelve of these genes were transcribed with 786 producing detectable proteins. There was no evidence for expression of 17 predicted genes. Except for the latter, monocistronic gene annotation was expanded using the above criteria along with matching Clusters of Orthologous Groups. Polycistronic genes were annotated in the same manner with inferences from their proximity to more confidently annotated genes. Two targeted deletion mutants were used as test cases to determine the relevance of the inferred functional annotations. PMID:19293273

Elias, Dwayne A.; Mukhopadhyay, Aindrila; Joachimiak, Marcin P.; Drury, Elliott C.; Redding, Alyssa M.; Yen, Huei-Che B.; Fields, Matthew W.; Hazen, Terry C.; Arkin, Adam P.; Keasling, Jay D.; Wall, Judy D.

2009-01-01

6

Algal functional annotation tool  

SciTech Connect

The Algal Functional Annotation Tool is a web-based comprehensive analysis suite integrating annotation data from several pathway, ontology, and protein family databases. The current version provides annotation for the model alga Chlamydomonas reinhardtii, and in the future will include additional genomes. The site allows users to interpret large gene lists by identifying associated functional terms, and their enrichment. Additionally, expression data for several experimental conditions were compiled and analyzed to provide an expression-based enrichment search. A tool to search for functionally-related genes based on gene expression across these conditions is also provided. Other features include dynamic visualization of genes on KEGG pathway maps and batch gene identifier conversion.

Lopez, D. [UCLA; Casero, D. [UCLA; Cokus, S. J. [UCLA; Merchant, S. S. [UCLA; Pellegrini, M. [UCLA

2012-07-01

7

Evaluation of a Hybrid Approach Using UBLAST and BLASTX for Metagenomic Sequences Annotation of Specific Functional Genes  

PubMed Central

The fast development of next generation sequencing (NGS) has dramatically increased the application of metagenomics in various aspects. Functional annotation is a major step in the metagenomics studies. Fast annotation of functional genes has been a challenge because of the deluge of NGS data and expanding databases. A hybrid annotation pipeline proposed previously for taxonomic assignments was evaluated in this study for metagenomic sequences annotation of specific functional genes, such as antibiotic resistance genes, arsenic resistance genes and key genes in nitrogen metabolism. The hybrid approach using UBLAST and BLASTX is 44–177 times faster than direct BLASTX in the annotation using the small protein database for the specific functional genes, with the cost of missing a small portion (<1.8%) of target sequences compared with direct BLASTX hits. Different from direct BLASTX, the time required for specific functional genes annotation using the hybrid annotation pipeline depends on the abundance for the target genes. Thus this hybrid annotation pipeline is more suitable in specific functional genes annotation than in comprehensive functional genes annotation. PMID:25347677

Yang, Ying; Jiang, Xiao-Tao; Zhang, Tong

2014-01-01

8

Global profiling of Shewanella oneidensis MR-1: Expression of hypothetical genes and improved functional annotations  

SciTech Connect

The gamma-proteobacterium Shewanella oneidensis strain MR-1 is a metabolically versatile organism that can reduce a wide range of organic compounds, metal ions, and radionuclides. Similar to most other sequenced organisms, approximate to40% of the predicted ORFs in the S. oneidensis genome were annotated as uncharacterized "hypothetical" genes. We implemented an integrative approach by using experimental and computational analyses to provide more detailed insight into gene function. Global expression profiles were determined for cells after UV irradiation and under aerobic and suboxic growth conditions. Transcriptomic and proteomic analyses confidently identified 538 hypothetical genes as expressed in S. oneidensis cells both as mRNAs and proteins (33% of all predicted hypothetical proteins). Publicly available analysis tools and databases and the expression data were applied to improve the annotation of these genes. The annotation results were scored by using a seven-category schema that ranked both confidence and precision of the functional assignment. We were able to identify homologs for nearly all of these hypothetical proteins (97%), but could confidently assign exact biochemical functions for only 16 proteins (category 1; 3%). Altogether, computational and experimental evidence provided functional assignments or insights for 240 more genes (categories 2-5; 45%). These functional annotations advance our understanding of genes involved in vital cellular processes, including energy conversion, ion transport, secondary metabolism, and signal transduction. We propose that this integrative approach offers a valuable means to undertake the enormous challenge of characterizing the rapidly growing number of hypothetical proteins with each newly sequenced genome.

Picone, Alex F. [Biatech, Bothell WA; Galperin, Michael Y. [National Center for Biotechnology Information; Romine, Margaret [Pacific Northwest National Laboratory (PNNL); Higdon, Roger [Biatech, Bothell WA; Makarova, Kira S. [National Center for Biotechnology Information; Kolker, Natali [Biatech, Bothell WA; Anderson, Gordon A [ORNL; Qiu, Xiaoyun [ORNL; Babnigg, Gyorgy [Oak Ridge National Laboratory (ORNL); Beliaev, Alexander S [ORNL; Edlefsen, Paul [Biatech, Bothell WA; Elias, Dwayne A. [Pacific Northwest National Laboratory (PNNL); Gorby, Dr. Yuri A. [J. Craig Venter Institute; Holzman, Ted [Biatech, Bothell WA; Klappenbach, Joel [Michigan State University, East Lansing; Konstantinidis, Konstantinos T [Michigan State University, East Lansing; Land, Miriam L [ORNL; Lipton, Mary S. [Pacific Northwest National Laboratory (PNNL); McCue, Lee Ann [Pacific Northwest National Laboratory (PNNL); Monroe, Matthew [Pacific Northwest National Laboratory (PNNL); Pasa-Tolic, Ljiljana [Pacific Northwest National Laboratory (PNNL); Pinchuk, Grigoriy [Pacific Northwest National Laboratory (PNNL); Purvine, Samuel [Pacific Northwest National Laboratory (PNNL); Serres, Margrethe H. [Woods Hole Oceanographic Institution (WHOI), Woods Hole, MA; Tsapin, Sasha [University of Southern California; Zakrajsek, Brian A. [Pacific Northwest National Laboratory (PNNL); Zhu, Wenguang [Harvard University; Zhou, Jizhong [University of Oklahoma; Larimer, Frank W [ORNL; Lawrence, Charles E. [Wadsworth Center, Albany, NY; Riley, Monica [Woods Hole Oceanographic Institution (WHOI), Woods Hole, MA; Collart, Frank [Argonne National Laboratory (ANL); YatesIII, John R. [Scripps Research Institute, The, La Jolla, CA; Smith, Richard D. [Pacific Northwest National Laboratory (PNNL); Nealson, Kenneth H. [University of Southern California; Fredrickson, James K [Pacific Northwest National Laboratory (PNNL); Tiedje, James M. [Michigan State University, East Lansing

2005-01-01

9

Comparative analysis of grapevine whole-genome gene predictions, functional annotation, categorization and integration of the predicted gene sequences  

PubMed Central

Background The first draft assembly and gene prediction of the grapevine genome (8X base coverage) was made available to the scientific community in 2007, and functional annotation was developed on this gene prediction. Since then additional Sanger sequences were added to the 8X sequences pool and a new version of the genomic sequence with superior base coverage (12X) was produced. Results In order to more efficiently annotate the function of the genes predicted in the new assembly, it is important to build on as much of the previous work as possible, by transferring 8X annotation of the genome to the 12X version. The 8X and 12X assemblies and gene predictions of the grapevine genome were compared to answer the question, “Can we uniquely map 8X predicted genes to 12X predicted genes?” The results show that while the assemblies and gene structure predictions are too different to make a complete mapping between them, most genes (18,725) showed a one-to-one relationship between 8X predicted genes and the last version of 12X predicted genes. In addition, reshuffled genomic sequence structures appeared. These highlight regions of the genome where the gene predictions need to be taken with caution. Based on the new grapevine gene functional annotation and in-depth functional categorization, twenty eight new molecular networks have been created for VitisNet while the existing networks were updated. Conclusions The outcomes of this study provide a functional annotation of the 12X genes, an update of VitisNet, the system of the grapevine molecular networks, and a new functional categorization of genes. Data are available at the VitisNet website (http://www.sdstate.edu/ps/research/vitis/pathways.cfm). PMID:22554261

2012-01-01

10

Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga  

E-print Network

Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga University, East Lansing, Michigan, United States of America Abstract Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred

Yandell, Mark

11

Gene Expression and Functional Annotation of the Human and Mouse Choroid Plexus Epithelium  

PubMed Central

Background The choroid plexus epithelium (CPE) is a lobed neuro-epithelial structure that forms the outer blood-brain barrier. The CPE protrudes into the brain ventricles and produces the cerebrospinal fluid (CSF), which is crucial for brain homeostasis. Malfunction of the CPE is possibly implicated in disorders like Alzheimer disease, hydrocephalus or glaucoma. To study human genetic diseases and potential new therapies, mouse models are widely used. This requires a detailed knowledge of similarities and differences in gene expression and functional annotation between the species. The aim of this study is to analyze and compare gene expression and functional annotation of healthy human and mouse CPE. Methods We performed 44k Agilent microarray hybridizations with RNA derived from laser dissected healthy human and mouse CPE cells. We functionally annotated and compared the gene expression data of human and mouse CPE using the knowledge database Ingenuity. We searched for common and species specific gene expression patterns and function between human and mouse CPE. We also made a comparison with previously published CPE human and mouse gene expression data. Results Overall, the human and mouse CPE transcriptomes are very similar. Their major functionalities included epithelial junctions, transport, energy production, neuro-endocrine signaling, as well as immunological, neurological and hematological functions and disorders. The mouse CPE presented two additional functions not found in the human CPE: carbohydrate metabolism and a more extensive list of (neural) developmental functions. We found three genes specifically expressed in the mouse CPE compared to human CPE, being ACE, PON1 and TRIM3 and no human specifically expressed CPE genes compared to mouse CPE. Conclusion Human and mouse CPE transcriptomes are very similar, and display many common functionalities. Nonetheless, we also identified a few genes and pathways which suggest that the CPE between mouse and man differ with respect to transport and metabolic functions. PMID:24391755

Janssen, Sarah F.; van der Spek, Sophie J. F.; ten Brink, Jacoline B.; Essing, Anke H. W.; Gorgels, Theo G. M. F.; van der Spek, Peter J.; Jansonius, Nomdo M.; Bergen, Arthur A. B.

2013-01-01

12

Global Profiling of Shewanella oneidensis MR-1: Expression of Hypothetical Genes and Improved functional annotations  

SciTech Connect

The y-proteobacterium Shewanella oneidensis strain MR-1 is a respiratory versatile organism that can reduce a wide range of organics, metals, and radionuclides. Similar to most other sequenced organisms, approximately 40% of the predicted ORFs in the MR-1 genome were annotated as uncharacterized ''hypothetical'' genes. We implemented an integrative approach using experimental and computational analyses to provide more detailed insight into their function. Global expression studies were conducted using RNA and protein expression profiling of cells cultivated under aerobic, suboxic, and fumarate reducing conditions, phosphate limitation and UV irradiation. transcriptomic and proteomic analyses confidently identified 538 ''hypothetical'' genes as expressed in S. oneidensis cells both as mRNAs and proteins (33% of all ''hypothetical'' proteins). Publicly available analysis tools and databases and our own expression data were applied to improve the annotation of these genes. The annotation results were scored using a seven-category schema that ranked both confidence and precision of the functional assignment. We identified homologs for nearly all of these ''hypothetical'' proteins (96%), thus allowing us to minimally classify them as ''conserved proteins''. Computational and/or experimental evidence provided more precise functional assignments for 297 genes (categories 1-4; 55%). These improved functional annotations will significantly widen our understanding of vital cellular processes including signal transduction, ion transport, secondary metabolism, and transcription, as well as structural elements, such as cellular membranes. We propose that this integrative approach offers a viable means to undertake the enormous challenge of characterizing the rapidly growing number of ''hypothetical'' proteins with each newly sequenced genome.

Kolker, Eugene; Picone, Alessandro F.; Galperin, Michael Y.; Romine, Margaret F.; Higdon, Roger; Makarova, Kira S.; Kolker, Natali; Anderson, Gordon A.; Qiu, Xiaoyun; Auberry, Kenneth J.; Babnigg, Gyorgy; Beliaev, Alex S.; Edlefsen, Paul; Elias, Dwayne A.; Gorby, Yuri A.; Holzman, Ted; Klappenbach, Joel; Konstantinidis, Kostas; Land, Miriam L.; Lipton, Mary S.; McCue, Lee-Ann; Monroe, Matthew E.; Pasa-Tolic, Liljiana; Pinchuk, Grigoriy E.; Purvine, Samuel O.; Serres, Margaret; Tsapin, Sasha; Zakrajsek, Brian A.; Zhu, Wenhong; Zhou, Jizhong; Larimer, Frank; Lawrence, Charles; Riley, Monica; Collart, Frank R.; Yates, III, John R.; Smith, Richard D.; Giometti, Carol S.; Nealson, Kenneth; Fredrickson, Jim K.; Tiedje, James M.

2005-02-08

13

Genome, Functional Gene Annotation, and Nuclear Transformation of the Heterokont Oleaginous Alga Nannochloropsis oceanica CCMP1779  

PubMed Central

Unicellular marine algae have promise for providing sustainable and scalable biofuel feedstocks, although no single species has emerged as a preferred organism. Moreover, adequate molecular and genetic resources prerequisite for the rational engineering of marine algal feedstocks are lacking for most candidate species. Heterokonts of the genus Nannochloropsis naturally have high cellular oil content and are already in use for industrial production of high-value lipid products. First success in applying reverse genetics by targeted gene replacement makes Nannochloropsis oceanica an attractive model to investigate the cell and molecular biology and biochemistry of this fascinating organism group. Here we present the assembly of the 28.7 Mb genome of N. oceanica CCMP1779. RNA sequencing data from nitrogen-replete and nitrogen-depleted growth conditions support a total of 11,973 genes, of which in addition to automatic annotation some were manually inspected to predict the biochemical repertoire for this organism. Among others, more than 100 genes putatively related to lipid metabolism, 114 predicted transcription factors, and 109 transcriptional regulators were annotated. Comparison of the N. oceanica CCMP1779 gene repertoire with the recently published N. gaditana genome identified 2,649 genes likely specific to N. oceanica CCMP1779. Many of these N. oceanica–specific genes have putative orthologs in other species or are supported by transcriptional evidence. However, because similarity-based annotations are limited, functions of most of these species-specific genes remain unknown. Aside from the genome sequence and its analysis, protocols for the transformation of N. oceanica CCMP1779 are provided. The availability of genomic and transcriptomic data for Nannochloropsis oceanica CCMP1779, along with efficient transformation protocols, provides a blueprint for future detailed gene functional analysis and genetic engineering of Nannochloropsis species by a growing academic community focused on this genus. PMID:23166516

Tsai, Chia-Hong; Bullard, Blair; Cornish, Adam J.; Harvey, Christopher; Reca, Ida-Barbara; Thornburg, Chelsea; Achawanantakun, Rujira; Buehl, Christopher J.; Campbell, Michael S.; Cavalier, David; Childs, Kevin L.; Clark, Teresa J.; Deshpande, Rahul; Erickson, Erika; Armenia Ferguson, Ann; Handee, Witawas; Kong, Que; Li, Xiaobo; Liu, Bensheng; Lundback, Steven; Peng, Cheng; Roston, Rebecca L.; Sanjaya; Simpson, Jeffrey P.; TerBush, Allan; Warakanont, Jaruswan; Zauner, Simone; Farre, Eva M.; Hegg, Eric L.; Jiang, Ning; Kuo, Min-Hao; Lu, Yan; Niyogi, Krishna K.; Ohlrogge, John; Osteryoung, Katherine W.; Shachar-Hill, Yair; Sears, Barbara B.; Sun, Yanni; Takahashi, Hideki; Yandell, Mark; Shiu, Shin-Han; Benning, Christoph

2012-01-01

14

Gene Ontology annotations and resources.  

PubMed

The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new 'phylogenetic annotation' process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources. PMID:23161678

Blake, J A; Dolan, M; Drabkin, H; Hill, D P; Li, Ni; Sitnikov, D; Bridges, S; Burgess, S; Buza, T; McCarthy, F; Peddinti, D; Pillai, L; Carbon, S; Dietze, H; Ireland, A; Lewis, S E; Mungall, C J; Gaudet, P; Chrisholm, R L; Fey, P; Kibbe, W A; Basu, S; Siegele, D A; McIntosh, B K; Renfro, D P; Zweifel, A E; Hu, J C; Brown, N H; Tweedie, S; Alam-Faruque, Y; Apweiler, R; Auchinchloss, A; Axelsen, K; Bely, B; Blatter, M -C; Bonilla, C; Bouguerleret, L; Boutet, E; Breuza, L; Bridge, A; Chan, W M; Chavali, G; Coudert, E; Dimmer, E; Estreicher, A; Famiglietti, L; Feuermann, M; Gos, A; Gruaz-Gumowski, N; Hieta, R; Hinz, C; Hulo, C; Huntley, R; James, J; Jungo, F; Keller, G; Laiho, K; Legge, D; Lemercier, P; Lieberherr, D; Magrane, M; Martin, M J; Masson, P; Mutowo-Muellenet, P; O'Donovan, C; Pedruzzi, I; Pichler, K; Poggioli, D; Porras Millán, P; Poux, S; Rivoire, C; Roechert, B; Sawford, T; Schneider, M; Stutz, A; Sundaram, S; Tognolli, M; Xenarios, I; Foulgar, R; Lomax, J; Roncaglia, P; Khodiyar, V K; Lovering, R C; Talmud, P J; Chibucos, M; Giglio, M Gwinn; Chang, H -Y; Hunter, S; McAnulla, C; Mitchell, A; Sangrador, A; Stephan, R; Harris, M A; Oliver, S G; Rutherford, K; Wood, V; Bahler, J; Lock, A; Kersey, P J; McDowall, D M; Staines, D M; Dwinell, M; Shimoyama, M; Laulederkind, S; Hayman, T; Wang, S -J; Petri, V; Lowry, T; D'Eustachio, P; Matthews, L; Balakrishnan, R; Binkley, G; Cherry, J M; Costanzo, M C; Dwight, S S; Engel, S R; Fisk, D G; Hitz, B C; Hong, E L; Karra, K; Miyasato, S R; Nash, R S; Park, J; Skrzypek, M S; Weng, S; Wong, E D; Berardini, T Z; Huala, E; Mi, H; Thomas, P D; Chan, J; Kishore, R; Sternberg, P; Van Auken, K; Howe, D; Westerfield, M

2013-01-01

15

GeneFarm, structural and functional annotation of Arabidopsis gene and protein families by a network of experts  

Microsoft Academic Search

Genomicprojectsheavilydependongenomeannota- tions and are limited by the current deficiencies in the publishedpredictionsofgenestructureandfunction. It follows that, improved annotation will allow better data mining of genomes, and more secure planning and design of experiments. The purpose of the GeneFarmprojectistoobtainhomogeneous,reliable, documented and traceable annotations for Arabidopsis nuclear genes and gene products, and to enter them into an added-value database. This re- annotation project

Sébastien Aubourg; Véronique Brunaud; Clémence Bruyère; Mark Cock; Richard Cooke; Annick Cottet; Arnaud Couloux; Patrice Déhais; Gilbert Deléage; Aymeric Duclert; Manuel Echeverria; Aimée Eschbach; Denis Falconet; Ghislain Filippi; Christine Gaspin; Christophe Geourjon; Jean-michel Grienenberger; Guy Houlné; Elisabeth Jamet; Frédéric Lechauve; Olivier Leleu; Philippe Leroy; Régis Mache; Christian Meyer; Hafed Nedjari; Ioan Negrutiu; Valérie Orsini; Eric Peyretaillade; Cyril Pommier; Jeroen Raes; Jean-loup Risler; Stéphane Rivière; Stephane Rombauts; Pierre Rouzé; Michel Schneider; Philippe Schwob; Ian Small; Ghislain Soumayet-kampetenga; Darko Stankovski; Claire Toffano; Michael Tognolli; Michel Caboche; Alain Lecharny

2005-01-01

16

On the Use of Gene Ontology Annotations to Assess Functional Similarity among Orthologs and Paralogs: A Short Report.  

PubMed

A recent paper (Nehrt et al., PLoS Comput. Biol. 7:e1002073, 2011) has proposed a metric for the "functional similarity" between two genes that uses only the Gene Ontology (GO) annotations directly derived from published experimental results. Applying this metric, the authors concluded that paralogous genes within the mouse genome or the human genome are more functionally similar on average than orthologous genes between these genomes, an unexpected result with broad implications if true. We suggest, based on both theoretical and empirical considerations, that this proposed metric should not be interpreted as a functional similarity, and therefore cannot be used to support any conclusions about the "ortholog conjecture" (or, more properly, the "ortholog functional conservation hypothesis"). First, we reexamine the case studies presented by Nehrt et al. as examples of orthologs with divergent functions, and come to a very different conclusion: they actually exemplify how GO annotations for orthologous genes provide complementary information about conserved biological functions. We then show that there is a global ascertainment bias in the experiment-based GO annotations for human and mouse genes: particular types of experiments tend to be performed in different model organisms. We conclude that the reported statistical differences in annotations between pairs of orthologous genes do not reflect differences in biological function, but rather complementarity in experimental approaches. Our results underscore two general considerations for researchers proposing novel types of analysis based on the GO: 1) that GO annotations are often incomplete, potentially in a biased manner, and subject to an "open world assumption" (absence of an annotation does not imply absence of a function), and 2) that conclusions drawn from a novel, large-scale GO analysis should whenever possible be supported by careful, in-depth examination of examples, to help ensure the conclusions have a justifiable biological basis. PMID:22359495

Thomas, Paul D; Wood, Valerie; Mungall, Christopher J; Lewis, Suzanna E; Blake, Judith A

2012-01-01

17

First survey and functional annotation of prohormone and convertase genes in the pig  

PubMed Central

Background The pig is a biomedical model to study human and livestock traits. Many of these traits are controlled by neuropeptides that result from the cleavage of prohormones by prohormone convertases. Only 45 prohormones have been confirmed in the pig. Sequence homology can be ineffective to annotate prohormone genes in sequenced species like the pig due to the multifactorial nature of the prohormone processing. The goal of this study is to undertake the first complete survey of prohormone and prohormone convertases genes in the pig genome. These genes were functionally annotated based on 35 gene expression microarray experiments. The cleavage sites of prohormone sequences into potentially active neuropeptides were predicted. Results We identified 95 unique prohormone genes, 2 alternative calcitonin-related sequences, 8 prohormone convertases and 1 cleavage facilitator in the pig genome 10.2 assembly and trace archives. Of these, 11 pig prohormone genes have not been reported in the UniProt, UniGene or Gene databases. These genes are intermedin, cortistatin, insulin-like 5, orexigenic neuropeptide QRFP, prokineticin 2, prolactin-releasing peptide, parathyroid hormone 2, urocortin, urocortin 2, urocortin 3, and urotensin 2-related peptide. In addition, a novel neuropeptide S was identified in the pig genome correcting the previously reported pig sequence that is identical to the rabbit sequence. Most differentially expressed prohormone genes were under-expressed in pigs experiencing immune challenge relative to the un-challenged controls, in non-pregnant relative to pregnant sows, in old relative to young embryos, and in non-neural relative to neural tissues. The cleavage prediction based on human sequences had the best performance with a correct classification rate of cleaved and non-cleaved sites of 92% suggesting that the processing of prohormones in pigs is similar to humans. The cleavage prediction models did not find conclusive evidence supporting the production of the bioactive neuropeptides urocortin 2, urocortin 3, torsin family 2 member A, tachykinin 4, islet amyloid polypeptide, and calcitonin receptor-stimulating peptide 2 in the pig. Conclusions The present genomic and functional characterization supports the use of the pig as an effective animal model to gain a deeper understanding of prohormones, prohormone convertases and neuropeptides in biomedical and agricultural research. PMID:23153308

2012-01-01

18

Annotation of gene function in citrus using gene expression information and co-expression networks  

PubMed Central

Background The genus Citrus encompasses major cultivated plants such as sweet orange, mandarin, lemon and grapefruit, among the world’s most economically important fruit crops. With increasing volumes of transcriptomics data available for these species, Gene Co-expression Network (GCN) analysis is a viable option for predicting gene function at a genome-wide scale. GCN analysis is based on a “guilt-by-association” principle whereby genes encoding proteins involved in similar and/or related biological processes may exhibit similar expression patterns across diverse sets of experimental conditions. While bioinformatics resources such as GCN analysis are widely available for efficient gene function prediction in model plant species including Arabidopsis, soybean and rice, in citrus these tools are not yet developed. Results We have constructed a comprehensive GCN for citrus inferred from 297 publicly available Affymetrix Genechip Citrus Genome microarray datasets, providing gene co-expression relationships at a genome-wide scale (33,000 transcripts). The comprehensive citrus GCN consists of a global GCN (condition-independent) and four condition-dependent GCNs that survey the sweet orange species only, all citrus fruit tissues, all citrus leaf tissues, or stress-exposed plants. All of these GCNs are clustered using genome-wide, gene-centric (guide) and graph clustering algorithms for flexibility of gene function prediction. For each putative cluster, gene ontology (GO) enrichment and gene expression specificity analyses were performed to enhance gene function, expression and regulation pattern prediction. The guide-gene approach was used to infer novel roles of genes involved in disease susceptibility and vitamin C metabolism, and graph-clustering approaches were used to investigate isoprenoid/phenylpropanoid metabolism in citrus peel, and citric acid catabolism via the GABA shunt in citrus fruit. Conclusions Integration of citrus gene co-expression networks, functional enrichment analysis and gene expression information provide opportunities to infer gene function in citrus. We present a publicly accessible tool, Network Inference for Citrus Co-Expression (NICCE, http://citrus.adelaide.edu.au/nicce/home.aspx), for the gene co-expression analysis in citrus. PMID:25023870

2014-01-01

19

Expression profiling of hypothetical genes in Desulfovibrio vulgaris leads to improved functional annotation  

Microsoft Academic Search

Hypothetical (HyP) and conserved HyP genes account for >30% of sequenced bacterial genomes. For the sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough, 347 of the 3634 genes were annotated as conserved HyP (9.5%) along with 887 HyP genes (24.4%). Given the large fraction of the genome, it is plausible that some of these genes serve critical cellular roles. The study goals were

Dwayne A. Elias; Aindrila Mukhopadhyay; Marcine P. Joachimiak; Elliott C. Drury; Alyssa M. Redding; Huei-Che B. Yen; Matthew W. Fields; Terry C. Hazen; Adam P. Arkin; Jay D. Keasling; Judy D. Wall

2009-01-01

20

Functional annotation of hierarchical modularity.  

PubMed

In biological networks of molecular interactions in a cell, network motifs that are biologically relevant are also functionally coherent, or form functional modules. These functionally coherent modules combine in a hierarchical manner into larger, less cohesive subsystems, thus revealing one of the essential design principles of system-level cellular organization and function-hierarchical modularity. Arguably, hierarchical modularity has not been explicitly taken into consideration by most, if not all, functional annotation systems. As a result, the existing methods would often fail to assign a statistically significant functional coherence score to biologically relevant molecular machines. We developed a methodology for hierarchical functional annotation. Given the hierarchical taxonomy of functional concepts (e.g., Gene Ontology) and the association of individual genes or proteins with these concepts (e.g., GO terms), our method will assign a Hierarchical Modularity Score (HMS) to each node in the hierarchy of functional modules; the HMS score and its p-value measure functional coherence of each module in the hierarchy. While existing methods annotate each module with a set of "enriched" functional terms in a bag of genes, our complementary method provides the hierarchical functional annotation of the modules and their hierarchically organized components. A hierarchical organization of functional modules often comes as a bi-product of cluster analysis of gene expression data or protein interaction data. Otherwise, our method will automatically build such a hierarchy by directly incorporating the functional taxonomy information into the hierarchy search process and by allowing multi-functional genes to be part of more than one component in the hierarchy. In addition, its underlying HMS scoring metric ensures that functional specificity of the terms across different levels of the hierarchical taxonomy is properly treated. We have evaluated our method using Saccharomyces cerevisiae data from KEGG and MIPS databases and several other computationally derived and curated datasets. The code and additional supplemental files can be obtained from http://code.google.com/p/functional-annotation-of-hierarchical-modularity/ (Accessed 2012 March 13). PMID:22496762

Padmanabhan, Kanchana; Wang, Kuangyu; Samatova, Nagiza F

2012-01-01

21

Functional Annotation of Hierarchical Modularity  

PubMed Central

In biological networks of molecular interactions in a cell, network motifs that are biologically relevant are also functionally coherent, or form functional modules. These functionally coherent modules combine in a hierarchical manner into larger, less cohesive subsystems, thus revealing one of the essential design principles of system-level cellular organization and function–hierarchical modularity. Arguably, hierarchical modularity has not been explicitly taken into consideration by most, if not all, functional annotation systems. As a result, the existing methods would often fail to assign a statistically significant functional coherence score to biologically relevant molecular machines. We developed a methodology for hierarchical functional annotation. Given the hierarchical taxonomy of functional concepts (e.g., Gene Ontology) and the association of individual genes or proteins with these concepts (e.g., GO terms), our method will assign a Hierarchical Modularity Score (HMS) to each node in the hierarchy of functional modules; the HMS score and its value measure functional coherence of each module in the hierarchy. While existing methods annotate each module with a set of “enriched” functional terms in a bag of genes, our complementary method provides the hierarchical functional annotation of the modules and their hierarchically organized components. A hierarchical organization of functional modules often comes as a bi-product of cluster analysis of gene expression data or protein interaction data. Otherwise, our method will automatically build such a hierarchy by directly incorporating the functional taxonomy information into the hierarchy search process and by allowing multi-functional genes to be part of more than one component in the hierarchy. In addition, its underlying HMS scoring metric ensures that functional specificity of the terms across different levels of the hierarchical taxonomy is properly treated. We have evaluated our method using Saccharomyces cerevisiae data from KEGG and MIPS databases and several other computationally derived and curated datasets. The code and additional supplemental files can be obtained from http://code.google.com/p/functional-annotation-of-hierarchical-modularity/ (Accessed 2012 March 13). PMID:22496762

Padmanabhan, Kanchana; Wang, Kuangyu; Samatova, Nagiza F.

2012-01-01

22

Comparison of Functional Gene Annotation of Toxascaris leonina and Toxocara canis using CLC Genomics Workbench  

PubMed Central

The ascarids, Toxocara canis and Toxascaris leonina, are probably the most common gastrointestinal helminths encountered in dogs. In order to understand biological differences of 2 ascarids, we analyzed gene expression profiles of female adults of T. canis and T. leonina using CLC Genomics Workbench, and the results were compared with those of free-living nematode Caenorhabditis elegans. A total of 2,880 and 7,949 ESTs were collected from T. leonina and T. canis, respectively. The length of ESTs ranged from 106 to 4,637 bp with an average insert size of 820 bp. Overall, our results showed that most functional gene annotations of 2 ascarids were quite similar to each other in 3 major categories, i.e., cellular component, biological process, and molecular function. Although some different transcript expression categories were found, the distance was short and it was not enough to explain their different lifestyles. However, we found distinguished transcript differences between ascarid parasites and free-living nematodes. Understanding evolutionary genetic changes might be helpful for studies of the lifestyle and evolution of parasites. PMID:24327777

Kim, Ki Uk; Park, Sang Kyun; Kang, Shin Ae; Park, Mi Kyung; Cho, Min Kyoung; Jung, Ho-jin; Kim, Kyung-Yun

2013-01-01

23

Phylogenetic molecular function annotation  

NASA Astrophysics Data System (ADS)

It is now easier to discover thousands of protein sequences in a new microbial genome than it is to biochemically characterize the specific activity of a single protein of unknown function. The molecular functions of protein sequences have typically been predicted using homology-based computational methods, which rely on the principle that homologous proteins share a similar function. However, some protein families include groups of proteins with different molecular functions. A phylogenetic approach for predicting molecular function (sometimes called "phylogenomics") is an effective means to predict protein molecular function. These methods incorporate functional evidence from all members of a family that have functional characterizations using the evolutionary history of the protein family to make robust predictions for the uncharacterized proteins. However, they are often difficult to apply on a genome-wide scale because of the time-consuming step of reconstructing the phylogenies of each protein to be annotated. Our automated approach for function annotation using phylogeny, the SIFTER (Statistical Inference of Function Through Evolutionary Relationships) methodology, uses a statistical graphical model to compute the probabilities of molecular functions for unannotated proteins. Our benchmark tests showed that SIFTER provides accurate functional predictions on various protein families, outperforming other available methods.

Engelhardt, Barbara E.; Jordan, Michael I.; Repo, Susanna T.; Brenner, Steven E.

2009-07-01

24

Quality of Computationally Inferred Gene Ontology Annotations  

PubMed Central

Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation. PMID:22693439

Skunca, Nives; Altenhoff, Adrian; Dessimoz, Christophe

2012-01-01

25

Annotating Genes of Known and Unknown Function by Large-Scale Coexpression Analysis1[W][OA  

PubMed Central

About 40% of the proteins encoded in eukaryotic genomes are proteins of unknown function (PUFs). Their functional characterization remains one of the main challenges in modern biology. In this study we identified the PUF encoding genes from Arabidopsis (Arabidopsis thaliana) using a combination of sequence similarity, domain-based, and empirical approaches. Large-scale gene expression analyses of 1,310 publicly available Affymetrix chips were performed to associate the identified PUF genes with regulatory networks and biological processes of known function. To generate quality results, the study was restricted to expression sets with replicated samples. First, genome-wide clustering and gene function enrichment analysis of clusters allowed us to associate 1,541 PUF genes with tightly coexpressed genes for proteins of known function (PKFs). Over 70% of them could be assigned to more specific biological process annotations than the ones available in the current Gene Ontology release. The most highly overrepresented functional categories in the obtained clusters were ribosome assembly, photosynthesis, and cell wall pathways. Interestingly, the majority of the PUF genes appeared to be controlled by the same regulatory networks as most PKF genes, because clusters enriched in PUF genes were extremely rare. Second, large-scale analysis of differentially expressed genes was applied to identify a comprehensive set of abiotic stress-response genes. This analysis resulted in the identification of 269 PKF and 104 PUF genes that responded to a wide variety of abiotic stresses, whereas 608 PKF and 206 PUF genes responded predominantly to specific stress treatments. The provided coexpression and differentially expressed gene data represent an important resource for guiding future functional characterization experiments of PUF and PKF genes. Finally, the public Plant Gene Expression Database (http://bioweb.ucr.edu/PED) was developed as part of this project to provide efficient access and mining tools for the vast gene expression data of this study. PMID:18354039

Horan, Kevin; Jang, Charles; Bailey-Serres, Julia; Mittler, Ron; Shelton, Christian; Harper, Jeff F.; Zhu, Jian-Kang; Cushman, John C.; Gollery, Martin; Girke, Thomas

2008-01-01

26

The Genome Sequence of Leishmania (Leishmania) amazonensis: Functional Annotation and Extended Analysis of Gene Models  

PubMed Central

We present the sequencing and annotation of the Leishmania (Leishmania) amazonensis genome, an etiological agent of human cutaneous leishmaniasis in the Amazon region of Brazil. L. (L.) amazonensis shares features with Leishmania (L.) mexicana but also exhibits unique characteristics regarding geographical distribution and clinical manifestations of cutaneous lesions (e.g. borderline disseminated cutaneous leishmaniasis). Predicted genes were scored for orthologous gene families and conserved domains in comparison with other human pathogenic Leishmania spp. Carboxypeptidase, aminotransferase, and 3?-nucleotidase genes and ATPase, thioredoxin, and chaperone-related domains were represented more abundantly in L. (L.) amazonensis and L. (L.) mexicana species. Phylogenetic analysis revealed that these two species share groups of amastin surface proteins unique to the genus that could be related to specific features of disease outcomes and host cell interactions. Additionally, we describe a hypothetical hybrid interactome of potentially secreted L. (L.) amazonensis proteins and host proteins under the assumption that parasite factors mimic their mammalian counterparts. The model predicts an interaction between an L. (L.) amazonensis heat-shock protein and mammalian Toll-like receptor 9, which is implicated in important immune responses such as cytokine and nitric oxide production. The analysis presented here represents valuable information for future studies of leishmaniasis pathogenicity and treatment. PMID:23857904

Real, Fernando; Vidal, Ramon Oliveira; Carazzolle, Marcelo Falsarella; Mondego, Jorge Mauricio Costa; Costa, Gustavo Gilson Lacerda; Herai, Roberto Hirochi; Wurtele, Martin; de Carvalho, Lucas Miguel; e Ferreira, Renata Carmona; Mortara, Renato Arruda; Barbieri, Clara Lucia; Mieczkowski, Piotr; da Silveira, Jose Franco; Briones, Marcelo Ribeiro da Silva; Pereira, Goncalo Amarante Guimaraes; Bahia, Diana

2013-01-01

27

Predicting potential cancer genes by integrating network properties, sequence features and functional annotations.  

PubMed

The discovery of novel cancer genes is one of the main goals in cancer research. Bioinformatics methods can be used to accelerate cancer gene discovery, which may help in the understanding of cancer and the development of drug targets. In this paper, we describe a classifier to predict potential cancer genes that we have developed by integrating multiple biological evidence, including protein-protein interaction network properties, and sequence and functional features. We detected 55 features that were significantly different between cancer genes and non-cancer genes. Fourteen cancer-associated features were chosen to train the classifier. Four machine learning methods, logistic regression, support vector machines (SVMs), BayesNet and decision tree, were explored in the classifier models to distinguish cancer genes from non-cancer genes. The prediction power of the different models was evaluated by 5-fold cross-validation. The area under the receiver operating characteristic curve for logistic regression, SVM, Baysnet and J48 tree models was 0.834, 0.740, 0.800 and 0.782, respectively. Finally, the logistic regression classifier with multiple biological features was applied to the genes in the Entrez database, and 1976 cancer gene candidates were identified. We found that the integrated prediction model performed much better than the models based on the individual biological evidence, and the network and functional features had stronger powers than the sequence features in predicting cancer genes. PMID:23838808

Liu, Wei; Xie, HongWei

2013-08-01

28

Functional annotation of novel lineage-specific genes using co-expression and promoter analysis  

Microsoft Academic Search

BACKGROUND: The diversity of placental architectures within and among mammalian orders is believed to be the result of adaptive evolution. Although, the genetic basis for these differences is unknown, some may arise from rapidly diverging and lineage-specific genes. Previously, we identified 91 novel lineage-specific transcripts (LSTs) from a cow term-placenta cDNA library, which are excellent candidates for adaptive placental functions

Charu G Kumar; Robin E Everts; Juan J Loor; Harris A Lewin

2010-01-01

29

Functional Annotation Analytics of Rhodopseudomonas palustris Genomes  

PubMed Central

Rhodopseudomonas palustris, a nonsulphur purple photosynthetic bacteria, has been extensively investigated for its metabolic versatility including ability to produce hydrogen gas from sunlight and biomass. The availability of the finished genome sequences of six R. palustris strains (BisA53, BisB18, BisB5, CGA009, HaA2 and TIE-1) combined with online bioinformatics software for integrated analysis presents new opportunities to determine the genomic basis of metabolic versatility and ecological lifestyles of the bacteria species. The purpose of this investigation was to compare the functional annotations available for multiple R. palustris genomes to identify annotations that can be further investigated for strain-specific or uniquely shared phenotypic characteristics. A total of 2,355 protein family Pfam domain annotations were clustered based on presence or absence in the six genomes. The clustering process identified groups of functional annotations including those that could be verified as strain-specific or uniquely shared phenotypes. For example, genes encoding water/glycerol transport were present in the genome sequences of strains CGA009 and BisB5, but absent in strains BisA53, BisB18, HaA2 and TIE-1. Protein structural homology modeling predicted that the two orthologous 240 aa R. palustris aquaporins have water-specific transport function. Based on observations in other microbes, the presence of aquaporin in R. palustris strains may improve freeze tolerance in natural conditions of rapid freezing such as nitrogen fixation at low temperatures where access to liquid water is a limiting factor for nitrogenase activation. In the case of adaptive loss of aquaporin genes, strains may be better adapted to survive in conditions of high-sugar content such as fermentation of biomass for biohydrogen production. Finally, web-based resources were developed to allow for interactive, user-defined selection of the relationship between protein family annotations and the R. palustris genomes. PMID:22084572

Simmons, Shaneka S.; Isokpehi, Raphael D.; Brown, Shyretha D.; McAllister, Donee L.; Hall, Charnia C.; McDuffy, Wanaki M.; Medley, Tamara L.; Udensi, Udensi K.; Rajnarayanan, Rajendram V.; Ayensu, Wellington K.; Cohly, Hari H.P.

2011-01-01

30

Using reasoning to guide annotation with gene ontology terms in GOAT  

Microsoft Academic Search

High-quality annotation of biological data is central to bioinformatics. Annotation using terms from ontologies provides reliable computational access to data. The Gene Ontology (GO), a structured controlled vocabulary of nearly 17,000 terms, is becoming the de facto standard for describing the functionality of gene products. Many prominent biomedical databases use GO as a source of terms for functional annotation of

Michael Bada; Daniele Turi; Robin McEntire; Robert Stevens

2004-01-01

31

Characterizing the state of the art in the computational assignment of gene function: lessons from the first critical assessment of functional annotation (CAFA)  

PubMed Central

The assignment of gene function remains a difficult but important task in computational biology. The establishment of the first Critical Assessment of Functional Annotation (CAFA) was aimed at increasing progress in the field. We present an independent analysis of the results of CAFA, aimed at identifying challenges in assessment and at understanding trends in prediction performance. We found that well-accepted methods based on sequence similarity (i.e., BLAST) have a dominant effect. Many of the most informative predictions turned out to be either recovering existing knowledge about sequence similarity or were "post-dictions" already documented in the literature. These results indicate that deep challenges remain in even defining the task of function assignment, with a particular difficulty posed by the problem of defining function in a way that is not dependent on either flawed gold standards or the input data itself. In particular, we suggest that using the Gene Ontology (or other similar systematizations of function) as a gold standard is unlikely to be the way forward. PMID:23630983

2013-01-01

32

Genetic Annotation of Gain-Of-Function Screens Using RNA Interference and in Situ Hybridization of Candidate Genes in the Drosophila Wing  

PubMed Central

Gain-of-function screens in Drosophila are an effective method with which to identify genes that affect the development of particular structures or cell types. It has been found that a fraction of 2–10% of the genes tested, depending on the particularities of the screen, results in a discernible phenotype when overexpressed. However, it is not clear to what extent a gain-of-function phenotype generated by overexpression is informative about the normal function of the gene. Thus, very few reports attempt to correlate the loss- and overexpression phenotype for collections of genes identified in gain-of-function screens. In this work we use RNA interference and in situ hybridization to annotate a collection of 123 P-GS insertions that in combination with different Gal4 drivers affect the size and/or patterning of the wing. We identify the gene causing the overexpression phenotype by expressing, in a background of overexpression, RNA interference for the genes affected by each P-GS insertion. Then, we compare the loss and gain-of-function phenotypes obtained for each gene and relate them to its expression pattern in the wing disc. We find that 52% of genes identified by their overexpression phenotype are required during normal development. However, only in 9% of the cases analyzed was there some complementarity between the gain- and loss-of-function phenotype, suggesting that, in general, the overexpression phenotypes would not be indicative of the normal requirements of the gene. PMID:22798488

Molnar, Cristina; Casado, Mar; Lopez-Varea, Ana; Cruz, Cristina; de Celis, Jose F.

2012-01-01

33

Functional Annotation, Genome Organization and Phylogeny of the Grapevine (Vitis vinifera) Terpene Synthase Gene Family Based on Genome Assembly, FLcDNA Cloning, and Enzyme Assays  

PubMed Central

Background Terpenoids are among the most important constituents of grape flavour and wine bouquet, and serve as useful metabolite markers in viticulture and enology. Based on the initial 8-fold sequencing of a nearly homozygous Pinot noir inbred line, 89 putative terpenoid synthase genes (VvTPS) were predicted by in silico analysis of the grapevine (Vitis vinifera) genome assembly [1]. The finding of this very large VvTPS family, combined with the importance of terpenoid metabolism for the organoleptic properties of grapevine berries and finished wines, prompted a detailed examination of this gene family at the genomic level as well as an investigation into VvTPS biochemical functions. Results We present findings from the analysis of the up-dated 12-fold sequencing and assembly of the grapevine genome that place the number of predicted VvTPS genes at 69 putatively functional VvTPS, 20 partial VvTPS, and 63 VvTPS probable pseudogenes. Gene discovery and annotation included information about gene architecture and chromosomal location. A dense cluster of 45 VvTPS is localized on chromosome 18. Extensive FLcDNA cloning, gene synthesis, and protein expression enabled functional characterization of 39 VvTPS; this is the largest number of functionally characterized TPS for any species reported to date. Of these enzymes, 23 have unique functions and/or phylogenetic locations within the plant TPS gene family. Phylogenetic analyses of the TPS gene family showed that while most VvTPS form species-specific gene clusters, there are several examples of gene orthology with TPS of other plant species, representing perhaps more ancient VvTPS, which have maintained functions independent of speciation. Conclusions The highly expanded VvTPS gene family underpins the prominence of terpenoid metabolism in grapevine. We provide a detailed experimental functional annotation of 39 members of this important gene family in grapevine and comprehensive information about gene structure and phylogeny for the entire currently known VvTPS gene family. PMID:20964856

2010-01-01

34

Automatically identifying and annotating mouse embryo gene expression patterns  

PubMed Central

Motivation: Deciphering the regulatory and developmental mechanisms for multicellular organisms requires detailed knowledge of gene interactions and gene expressions. The availability of large datasets with both spatial and ontological annotation of the spatio-temporal patterns of gene expression in mouse embryo provides a powerful resource to discover the biological function of embryo organization. Ontological annotation of gene expressions consists of labelling images with terms from the anatomy ontology for mouse development. If the spatial genes of an anatomical component are expressed in an image, the image is then tagged with a term of that anatomical component. The current annotation is done manually by domain experts, which is both time consuming and costly. In addition, the level of detail is variable, and inevitably errors arise from the tedious nature of the task. In this article, we present a new method to automatically identify and annotate gene expression patterns in the mouse embryo with anatomical terms. Results: The method takes images from in situ hybridization studies and the ontology for the developing mouse embryo, it then combines machine learning and image processing techniques to produce classifiers that automatically identify and annotate gene expression patterns in these images. We evaluate our method on image data from the EURExpress study, where we use it to automatically classify nine anatomical terms: humerus, handplate, fibula, tibia, femur, ribs, petrous part, scapula and head mesenchyme. The accuracy of our method lies between 70% and 80% with few exceptions. We show that other known methods have lower classification performance than ours. We have investigated the images misclassified by our method and found several cases where the original annotation was not correct. This shows our method is robust against this kind of noise. Availability: The annotation result and the experimental dataset in the article can be freely accessed at http://www2.docm.mmu.ac.uk/STAFF/L.Han/geneannotation/. Contact: l.han@mmu.ac.uk Supplementary Information: Supplementary data are available at Bioinformatics online. PMID:21357576

Han, Liangxiu; van Hemert, Jano I.; Baldock, Richard A.

2011-01-01

35

DATABASE Open Access DFLAT: functional annotation for human  

E-print Network

of the developing human fetus and neonate have led to a need for widespread characterization of the functional roles developmental context for scientists wishing to study gene function in the human fetus. DescriptionDATABASE Open Access DFLAT: functional annotation for human development Heather C Wick1* , Harold

Kaski, Samuel

36

GFam: a platform for automatic annotation of gene families.  

PubMed

We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam's capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/. PMID:22790981

Sasidharan, Rajkumar; Nepusz, Tamás; Swarbreck, David; Huala, Eva; Paccanaro, Alberto

2012-10-01

37

Systematic condition-dependent annotation of metabolic genes  

E-print Network

of California, San Diego, California 92093-0412, USA; 3 School of Medicine, Tel-Aviv University, Tel-Aviv 69978). Both the ontology and the corresponding annotation are constantly up- dated based on various work has provided several lines of evidence for the condition-dependent nature of gene function

Shamir, Ron

38

Taxonomic Precision of Different Hypervariable Regions of 16S rRNA Gene and Annotation Methods for Functional Bacterial Groups in Biological Wastewater Treatment  

PubMed Central

High throughput sequencing of 16S rRNA gene leads us into a deeper understanding on bacterial diversity for complex environmental samples, but introduces blurring due to the relatively low taxonomic capability of short read. For wastewater treatment plant, only those functional bacterial genera categorized as nutrient remediators, bulk/foaming species, and potential pathogens are significant to biological wastewater treatment and environmental impacts. Precise taxonomic assignment of these bacteria at least at genus level is important for microbial ecological research and routine wastewater treatment monitoring. Therefore, the focus of this study was to evaluate the taxonomic precisions of different ribosomal RNA (rRNA) gene hypervariable regions generated from a mix activated sludge sample. In addition, three commonly used classification methods including RDP Classifier, BLAST-based best-hit annotation, and the lowest common ancestor annotation by MEGAN were evaluated by comparing their consistency. Under an unsupervised way, analysis of consistency among different classification methods suggests there are no hypervariable regions with good taxonomic coverage for all genera. Taxonomic assignment based on certain regions of the 16S rRNA genes, e.g. the V1&V2 regions – provide fairly consistent taxonomic assignment for a relatively wide range of genera. Hence, it is recommended to use these regions for studying functional groups in activated sludge. Moreover, the inconsistency among methods also demonstrated that a specific method might not be suitable for identification of some bacterial genera using certain 16S rRNA gene regions. As a general rule, drawing conclusions based only on one sequencing region and one classification method should be avoided due to the potential false negative results. PMID:24146837

Guo, Feng; Ju, Feng; Cai, Lin; Zhang, Tong

2013-01-01

39

DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists  

Microsoft Academic Search

All tools in the DAVID Bioinformatics Resources aim to provide functional interpretation of large lists of genes derived from genomic studies. The newly updated DAVID Bioinformatics Resources consists of the DAVID Knowledgebase and five integrated, web-based functional annotation tool suites: the DAVID Gene Functional Classification Tool, the DAVID Functional Annotation Tool, the DAVID Gene ID Conversion Tool, the DAVID Gene

Da Wei Huang; Brad T. Sherman; Qina Tan; Joseph Kir; David Liu; David Bryant; Yongjian Guo; Robert M. Stephens; Michael W. Baseler; Richard A. Lempicki

2007-01-01

40

Logical Gene Ontology Annotations (GOAL): Exploring gene ontology annotations with OWL  

E-print Network

-protein coupled receptor activity’ [GO:0004930] and ‘signal transduction’ [GO:0007165]. We can create an OWL class that captures the annotations using the following Manchester OWL syntax [31] (note that an axiom annotation is used to assert the evidence code... classes and create a set of defined classes that enable us to query for gene products. For example, for the GO class G-protein coupled receptor activity we would create a new class that queries for the gene product using the following Manchester OWL syntax...

2012-04-24

41

SFannotation: A Simple and Fast Protein Function Annotation System  

PubMed Central

Owing to the generation of vast amounts of sequencing data by using cost-effective, high-throughput sequencing technologies with improved computational approaches, many putative proteins have been discovered after assembly and structural annotation. Putative proteins are typically annotated using a functional annotation system that uses extant databases, but the expansive size of these databases often causes a bottleneck for rapid functional annotation. We developed SFannotation, a simple and fast functional annotation system that rapidly annotates putative proteins against four extant databases, Swiss-Prot, TIGRFAMs, Pfam, and the non-redundant sequence database, by using a best-hit approach with BLASTP and HMMSEARCH. PMID:25031571

Kim, Byung Kwon

2014-01-01

42

Predicting Novel Human Gene Ontology Annotations Using Semantic Analysis  

PubMed Central

The correct interpretation of many molecular biology experiments depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are meant to act as repositories for our biological knowledge as we acquire and refine it. Hence, by definition, they are incomplete at any given time. In this paper, we describe a technique that improves our previous method for predicting novel GO annotations by extracting implicit semantic relationships between genes and functions. In this work, we use a vector space model and a number of weighting schemes in addition to our previous latent semantic indexing approach. The technique described here is able to take into consideration the hierarchical structure of the Gene Ontology (GO) and can weight differently GO terms situated at different depths. The prediction abilities of 15 different weighting schemes are compared and evaluated. Nine such schemes were previously used in other problem domains, while six of them are introduced in this paper. The best weighting scheme was a novel scheme, n2tn. Out of the top 50 functional annotations predicted using this weighting scheme, we found support in the literature for 84 percent of them, while 6 percent of the predictions were contradicted by the existing literature. For the remaining 10 percent, we did not find any relevant publications to confirm or contradict the predictions. The n2tn weighting scheme also outperformed the simple binary scheme used in our previous approach. PMID:20150671

Done, Bogdan; Khatri, Purvesh; Done, Arina; Draghici, Sorin

2013-01-01

43

HMM-Based Gene Annotation Methods  

SciTech Connect

Development of new statistical methods and computational tools to identify genes in human genomic DNA, and to provide clues to their functions by identifying features such as transcription factor binding sites, tissue, specific expression and splicing patterns, and remove homologies at the protein level with genes of known function.

Haussler, David; Hughey, Richard; Karplus, Keven

1999-09-20

44

Functional annotation of the human retinal pigment epithelium transcriptome  

Microsoft Academic Search

BACKGROUND: To determine level, variability and functional annotation of gene expression of the human retinal pigment epithelium (RPE), the key tissue involved in retinal diseases like age-related macular degeneration and retinitis pigmentosa. Macular RPE cells from six selected healthy human donor eyes (aged 63–78 years) were laser dissected and used for 22k microarray studies (Agilent technologies). Data were analyzed with

Judith C Booij; Simone van Soest; Sigrid MA Swagemakers; Anke HW Essing; Annemieke JMH Verkerk; Peter J van der Spek; Theo GMF Gorgels; Arthur AB Bergen

2009-01-01

45

Visual Presentation as a Welcome Alternative to Textual Presentation of Gene Annotation Information  

PubMed Central

The functions of a gene are traditionally annotated textually using either free text (Gene Reference Into Function or GeneRIF) or controlled vocabularies (e.g., Gene Ontology or Disease Ontology). Inspired by the latest word cloud tools developed by the Information Visualization Group at IBM Research, we have prototyped a visual system for capturing gene annotations, which we named Gene Graph Into Function or GeneGIF. Fully developing the GeneGIF system would be a significant effort. To justify the necessity and to specify the design requirements of GeneGIF, we first surveyed the end-user preferences. From 53 responses, we found that a majority (64%, p < 0.05) of the users were either positive or neutral toward using GeneGIF in their daily work (acceptance); in terms of preference, a slight majority (51%, p > 0.05) of the users favored visual presentation of information (GeneGIF) compared to textual (GeneRIF) information. The results of this study indicate that a visual presentation tool, such as GeneGIF, can complement standard textual presentation of gene annotations. Moreover, the survey participants provided many constructive comments that will specify the development of a phase-two project (http://128.248.174.241/) to visually annotate each gene in the human genome. PMID:20865558

Desai, Jairav; Flatow, Jared M.; Song, Jie; Zhu, Lihua J.; Du, Pan; Huang, Chiang-Ching; Lu, Hui; Lin, Simon M.

2010-01-01

46

Improving functional annotation for industrial microbes: a case study with Pichia pastoris.  

PubMed

The research communities studying microbial model organisms, such as Escherichia coli or Saccharomyces cerevisiae, are well served by model organism databases that have extensive functional annotation. However, this is not true of many industrial microbes that are used widely in biotechnology. In this Opinion piece, we use Pichia (Komagataella) pastoris to illustrate the limitations of the available annotation. We consider the resources that can be implemented in the short term both to improve Gene Ontology (GO) annotation coverage based on annotation transfer, and to establish curation pipelines for the literature corpus of this organism. PMID:24929579

Dikicioglu, Duygu; Wood, Valerie; Rutherford, Kim M; McDowall, Mark D; Oliver, Stephen G

2014-08-01

47

Transcript Annotation in FANTOM3: Mouse Gene Catalog Based on Physical cDNAs  

Microsoft Academic Search

The international FANTOM consortium aims to produce a comprehensive picture of the mammalian transcriptome, based upon an extensive cDNA collection and functional annotation of full-length enriched cDNAs. The previous dataset, FANTOM2, comprised 60,770 full-length enriched cDNAs. Functional annotation revealed that this cDNA dataset contained only about half of the estimated number of mouse protein-coding genes, indicating that a number of

Norihiro Maeda; Takeya Kasukawaa; Rieko Oyama; Julian Gough; Martin Frith; Pär G. Engström; Boris Lenhard; Rajith N. Aturaliya; Serge Batalov; Kirk W. Beisel; Carol J. Bult; Colin F. Fletcher; Alistair R. R. Forrest; Masaaki Furuno; David Hill; Masayoshi Itoh; Mutsumi Kanamori-Katayama; Shintaro Katayama; Masaru Katoh; Tsugumi Kawashima; John Quackenbushb; Timothy Ravasi; Brian Z. Ring; Kazuhiro Shibata; Koji Sugiura; Yoichi Takenaka; Rohan D. Teasdale; Christine A. Wells; Yunxia Zhu; Chikatoshi Kai; Jun Kawai; David A. Hume; Piero Carninci; Yoshihide Hayashizaki

2006-01-01

48

A robust data-driven approach for gene ontology annotation  

PubMed Central

Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks.

Li, Yanpeng; Yu, Hong

2014-01-01

49

A robust data-driven approach for gene ontology annotation.  

PubMed

Gene ontology (GO) and GO annotation are important resources for biological information management and knowledge discovery, but the speed of manual annotation became a major bottleneck of database curation. BioCreative IV GO annotation task aims to evaluate the performance of system that automatically assigns GO terms to genes based on the narrative sentences in biomedical literature. This article presents our work in this task as well as the experimental results after the competition. For the evidence sentence extraction subtask, we built a binary classifier to identify evidence sentences using reference distance estimator (RDE), a recently proposed semi-supervised learning method that learns new features from around 10 million unlabeled sentences, achieving an F1 of 19.3% in exact match and 32.5% in relaxed match. In the post-submission experiment, we obtained 22.1% and 35.7% F1 performance by incorporating bigram features in RDE learning. In both development and test sets, RDE-based method achieved over 20% relative improvement on F1 and AUC performance against classical supervised learning methods, e.g. support vector machine and logistic regression. For the GO term prediction subtask, we developed an information retrieval-based method to retrieve the GO term most relevant to each evidence sentence using a ranking function that combined cosine similarity and the frequency of GO terms in documents, and a filtering method based on high-level GO classes. The best performance of our submitted runs was 7.8% F1 and 22.2% hierarchy F1. We found that the incorporation of frequency information and hierarchy filtering substantially improved the performance. In the post-submission evaluation, we obtained a 10.6% F1 using a simpler setting. Overall, the experimental analysis showed our approaches were robust in both the two tasks. PMID:25425037

Li, Yanpeng; Yu, Hong

2014-01-01

50

Functional annotation of a full-length mouse cDNA collection  

Microsoft Academic Search

The RIKEN Mouse Gene Encyclopaedia Project, a systematic approach to determining the full coding potential of the mouse genome, involves collection and sequencing of full-length complementary DNAs and physical mapping of the corresponding genes to the mouse genome. We organized an international functional annotation meeting (FANTOM) to annotate the first 21,076 cDNAs to be analysed in this project. Here we

J. Kawai; A. Shinagawa; K. Shibata; M. Yoshino; M. Itoh; Y. Ishii; T. Arakawa; A. Hara; Y. Fukunishi; H. Konno; J. Adachi; S. Fukuda; K. Aizawa; M. Izawa; K. Nishi; H. Kiyosawa; S. Kondo; I. Yamanaka; T. Saito; Y. Okazaki; T. Gojobori; H. Bono; T. Kasukawa; R. Saito; K. Kadota; H. Matsuda; M. Ashburner; S. Batalov; T. Casavant; W. Fleischmann; T. Gaasterland; C. Gissi; B. King; H. Kochiwa; P. Kuehl; S. Lewis; Y. Matsuo; I. Nikaido; G. Pesole; J. Quackenbush; L. M. Schriml; F. Staubli; R. Suzuki; M. Tomita; L. Wagner; T. Washio; K. Sakai; T. Okido; M. Furuno; H. Aono; R. Baldarelli; G. Barsh; J. Blake; D. Boffelli; N. Bojunga; P. Carninci; M. F. de Bonaldo; M. J. Brownstein; C. Bult; C. Fletcher; M. Fujita; M. Gariboldi; S. Gustincich; D. Hill; M. Hofmann; D. A. Hume; M. Kamiya; N. H. Lee; P. Lyons; L. Marchionni; J. Mashima; J. Mazzarelli; P. Mombaerts; P. Nordone; B. Ring; M. Ringwald; I. Rodriguez; N. Sakamoto; H. Sasaki; K. Sato; C. Schönbach; T. Seya; Y. Shibata; K.-F. Storch; H. Suzuki; K. Toyo-oka; K. H. Wang; C. Weitz; C. Whittaker; L. Wilming; A. Wynshaw-Boris; K. Yoshida; Y. Hasegawa; H. Kawaji; S. Kohtsuki; Y. Hayashizaki

2001-01-01

51

Mining the Gene Wiki for functional genomic knowledge  

PubMed Central

Background Ontology-based gene annotations are important tools for organizing and analyzing genome-scale biological data. Collecting these annotations is a valuable but costly endeavor. The Gene Wiki makes use of Wikipedia as a low-cost, mass-collaborative platform for assembling text-based gene annotations. The Gene Wiki is comprised of more than 10,000 review articles, each describing one human gene. The goal of this study is to define and assess a computational strategy for translating the text of Gene Wiki articles into ontology-based gene annotations. We specifically explore the generation of structured annotations using the Gene Ontology and the Human Disease Ontology. Results Our system produced 2,983 candidate gene annotations using the Disease Ontology and 11,022 candidate annotations using the Gene Ontology from the text of the Gene Wiki. Based on manual evaluations and comparisons to reference annotation sets, we estimate a precision of 90-93% for the Disease Ontology annotations and 48-64% for the Gene Ontology annotations. We further demonstrate that this data set can systematically improve the results from gene set enrichment analyses. Conclusions The Gene Wiki is a rapidly growing corpus of text focused on human gene function. Here, we demonstrate that the Gene Wiki can be a powerful resource for generating ontology-based gene annotations. These annotations can be used immediately to improve workflows for building curated gene annotation databases and knowledge-based statistical analyses. PMID:22165947

2011-01-01

52

De Novo Assembly, Functional Annotation and Comparative Analysis of Withania somnifera Leaf and Root Transcriptomes to Identify Putative Genes Involved in the Withanolides Biosynthesis  

PubMed Central

Withania somnifera is one of the most valuable medicinal plants used in Ayurvedic and other indigenous medicine systems due to bioactive molecules known as withanolides. As genomic information regarding this plant is very limited, little information is available about biosynthesis of withanolides. To facilitate the basic understanding about the withanolide biosynthesis pathways, we performed transcriptome sequencing for Withania leaf (101L) and root (101R) which specifically synthesize withaferin A and withanolide A, respectively. Pyrosequencing yielded 8,34,068 and 7,21,755 reads which got assembled into 89,548 and 1,14,814 unique sequences from 101L and 101R, respectively. A total of 47,885 (101L) and 54,123 (101R) could be annotated using TAIR10, NR, tomato and potato databases. Gene Ontology and KEGG analyses provided a detailed view of all the enzymes involved in withanolide backbone synthesis. Our analysis identified members of cytochrome P450, glycosyltransferase and methyltransferase gene families with unique presence or differential expression in leaf and root and might be involved in synthesis of tissue-specific withanolides. We also detected simple sequence repeats (SSRs) in transcriptome data for use in future genetic studies. Comprehensive sequence resource developed for Withania, in this study, will help to elucidate biosynthetic pathway for tissue-specific synthesis of secondary plant products in non-model plant organisms as well as will be helpful in developing strategies for enhanced biosynthesis of withanolides through biotechnological approaches. PMID:23667511

Gupta, Parul; Goel, Ridhi; Pathak, Sumya; Srivastava, Apeksha; Singh, Surya Pratap; Sangwan, Rajender Singh; Asif, Mehar Hasan; Trivedi, Prabodh Kumar

2013-01-01

53

The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology  

Microsoft Academic Search

The Gene Ontology Annotation (GOA) database (http:\\/\\/www.ebi.ac.uk\\/GOA) aims to provide high- quality electronic and manual annotations to the UniProt Knowledgebase (Swiss-Prot, TrEMBL and PIR-PSD) using the standardized vocabulary of the Gene Ontology (GO). As a supplementary archive of GO annotation, GOA promotes a high level of integra- tion of the knowledge represented in UniProt with other databases. This is achieved

Evelyn Camon; Michele Magrane; Daniel Barrell; Vivian Lee; Emily Dimmer; John Maslen; David Binns; Nicola Harte; Rodrigo Lopez; Rolf Apweiler

2004-01-01

54

Drosophila gene expression pattern annotation through multi-instance multi-label learning.  

PubMed

In the studies of Drosophila embryogenesis, a large number of two-dimensional digital images of gene expression patterns have been produced to build an atlas of spatio-temporal gene expression dynamics across developmental time. Gene expressions captured in these images have been manually annotated with anatomical and developmental ontology terms using a controlled vocabulary (CV), which are useful in research aimed at understanding gene functions, interactions, and networks. With the rapid accumulation of images, the process of manual annotation has become increasingly cumbersome, and computational methods to automate this task are urgently needed. However, the automated annotation of embryo images is challenging. This is because the annotation terms spatially correspond to local expression patterns of images, yet they are assigned collectively to groups of images and it is unknown which term corresponds to which region of which image in the group. In this paper, we address this problem using a new machine learning framework, Multi-Instance Multi-Label (MIML) learning. We first show that the underlying nature of the annotation task is a typical MIML learning problem. Then, we propose two support vector machine algorithms under the MIML framework for the task. Experimental results on the FlyExpress database (a digital library of standardized Drosophila gene expression pattern images) reveal that the exploitation of MIML framework leads to significant performance improvement over state-of-the-art approaches. PMID:21519115

Li, Ying-Xin; Ji, Shuiwang; Kumar, Sudhir; Ye, Jieping; Zhou, Zhi-Hua

2012-01-01

55

GOChase-II: correcting semantic inconsistencies from Gene Ontology-based annotations for gene products  

PubMed Central

Background The Gene Ontology (GO) provides a controlled vocabulary for describing genes and gene products. In spite of the undoubted importance of GO, several drawbacks associated with GO and GO-based annotations have been introduced. We identified three types of semantic inconsistencies in GO-based annotations; semantically redundant, biological-domain inconsistent and taxonomy inconsistent annotations. Methods To determine the semantic inconsistencies in GO annotation, we used the hierarchical structure of GO graph and tree structure of NCBI taxonomy. Twenty seven biological databases were collected for finding semantic inconsistent annotation. Results The distributions and possible causes of the semantic inconsistencies were investigated using twenty seven biological databases with GO-based annotations. We found that some evidence codes of annotation were associated with the inconsistencies. The numbers of gene products and species in a database that are related to the complexity of database management are also in correlation with the inconsistencies. Consequently, numerous annotation errors arise and are propagated throughout biological databases and GO-based high-level analyses. GOChase-II is developed to detect and correct both syntactic and semantic errors in GO-based annotations. Conclusions We identified some inconsistencies in GO-based annotation and provided software, GOChase-II, for correcting these semantic inconsistencies in addition to the previous corrections for the syntactic errors by GOChase-I. PMID:21342572

2011-01-01

56

eggNOG: automated construction and annotation of orthologous groups of genes.  

PubMed

The identification of orthologous genes forms the basis for most comparative genomics studies. Existing approaches either lack functional annotation of the identified orthologous groups, hampering the interpretation of subsequent results, or are manually annotated and thus lag behind the rapid sequencing of new genomes. Here we present the eggNOG database ('evolutionary genealogy of genes: Non-supervised Orthologous Groups'), which contains orthologous groups constructed from Smith-Waterman alignments through identification of reciprocal best matches and triangular linkage clustering. Applying this procedure to 312 bacterial, 26 archaeal and 35 eukaryotic genomes yielded 43 582 course-grained orthologous groups of which 9724 are extended versions of those from the original COG/KOG database. We also constructed more fine-grained groups for selected subsets of organisms, such as the 19 914 mammalian orthologous groups. We automatically annotated our non-supervised orthologous groups with functional descriptions, which were derived by identifying common denominators for the genes based on their individual textual descriptions, annotated functional categories, and predicted protein domains. The orthologous groups in eggNOG contain 1 241 751 genes and provide at least a broad functional description for 77% of them. Users can query the resource for individual genes via a web interface or download the complete set of orthologous groups at http://eggnog.embl.de. PMID:17942413

Jensen, Lars Juhl; Julien, Philippe; Kuhn, Michael; von Mering, Christian; Muller, Jean; Doerks, Tobias; Bork, Peer

2008-01-01

57

SUS-BAR: a database of pig proteins with statistically validated structural and functional annotation.  

PubMed

Given the relevance of the pig proteome in different studies, including human complex maladies, a statistical validation of the annotation is required for a better understanding of the role of specific genes and proteins in the complex networks underlying biological processes in the animal. Presently, approximately 80% of the pig proteome is still poorly annotated, and the existence of protein sequences is routinely inferred automatically by sequence alignment towards preexisting sequences. In this article, we introduce SUS-BAR, a database that derives information mainly from UniProt Knowledgebase and that includes 26 206 pig protein sequences. In SUS-BAR, 16 675 of the pig protein sequences are endowed with statistically validated functional and structural annotation. Our statistical validation is determined by adopting a cluster-centric annotation procedure that allows transfer of different types of annotation, including structure and function. Each sequence in the database can be associated with a set of statistically validated Gene Ontologies (GOs) of the three main sub-ontologies (Molecular Function, Biological Process and Cellular Component), with Pfam functional domains, and when possible, with a cluster Hidden Markov Model that allows modelling the 3D structure of the protein. A database search allows some statistics demonstrating the enrichment in both GO and Pfam annotations of the pig proteins as compared with UniProt Knowledgebase annotation. Searching in SUS-BAR allows retrieval of the pig protein annotation for further analysis. The search is also possible on the basis of specific GO terms and this allows retrieval of all the pig sequences participating into a given biological process, after annotation with our system. Alternatively, the search is possible on the basis of structural information, allowing retrieval of all the pig sequences with the same structural characteristics. PMID:24065691

Piovesan, Damiano; Profiti, Giuseppe; Martelli, Pier Luigi; Fariselli, Piero; Fontanesi, Luca; Casadio, Rita

2013-01-01

58

Suppression subtractive hybridization (SSH) combined with bioinformatics method: an integrated functional annotation approach for analysis of differentially expressed immune-genes in insects  

PubMed Central

The suppression subtractive hybridization (SSH) approach, a PCR based approach which amplifies differentially expressed cDNAs (complementary DNAs), while simultaneously suppressing amplification of common cDNAs, was employed to identify immuneinducible genes in insects. This technique has been used as a suitable tool for experimental identification of novel genes in eukaryotes as well as prokaryotes; whose genomes have been sequenced, or the species whose genomes have yet to be sequenced. In this article, I have proposed a method for in silico functional characterization of immune-inducible genes from insects. Apart from immune-inducible genes from insects, this method can be applied for the analysis of genes from other species, starting from bacteria to plants and animals. This article is provided with a background of SSH-based method taking specific examples from innate immune-inducible genes in insects, and subsequently a bioinformatics pipeline is proposed for functional characterization of newly sequenced genes. The proposed workflow presented here, can also be applied for any newly sequenced species generated from Next Generation Sequencing (NGS) platforms. PMID:23519487

Badapanda, Chandan

2013-01-01

59

Functional annotation of colon cancer risk SNPs.  

PubMed

Colorectal cancer (CRC) is a leading cause of cancer-related deaths in the United States. Genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) associated with increased risk for CRC. A molecular understanding of the functional consequences of this genetic variation has been complicated because each GWAS SNP is a surrogate for hundreds of other SNPs, most of which are located in non-coding regions. Here we use genomic and epigenomic information to test the hypothesis that the GWAS SNPs and/or correlated SNPs are in elements that regulate gene expression, and identify 23 promoters and 28 enhancers. Using gene expression data from normal and tumour cells, we identify 66 putative target genes of the risk-associated enhancers (10 of which were also identified by promoter SNPs). Employing CRISPR nucleases, we delete one risk-associated enhancer and identify genes showing altered expression. We suggest that similar studies be performed to characterize all CRC risk-associated enhancers. PMID:25268989

Yao, Lijing; Tak, Yu Gyoung; Berman, Benjamin P; Farnham, Peggy J

2014-01-01

60

Functional annotation of colon cancer risk SNPs  

PubMed Central

Colorectal cancer (CRC) is a leading cause of cancer-related deaths in the United States. Genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) associated with increased risk for CRC. A molecular understanding of the functional consequences of this genetic variation has been complicated because each GWAS SNP is a surrogate for hundreds of other SNPs, most of which are located in non-coding regions. Here we use genomic and epigenomic information to test the hypothesis that the GWAS SNPs and/or correlated SNPs are in elements that regulate gene expression, and identify 23 promoters and 28 enhancers. Using gene expression data from normal and tumour cells, we identify 66 putative target genes of the risk-associated enhancers (10 of which were also identified by promoter SNPs). Employing CRISPR nucleases, we delete one risk-associated enhancer and identify genes showing altered expression. We suggest that similar studies be performed to characterize all CRC risk-associated enhancers. PMID:25268989

Yao, Lijing; Tak, Yu Gyoung; Berman, Benjamin P.; Farnham, Peggy J.

2014-01-01

61

Functional annotation of a full-length mouse cDNA collection  

SciTech Connect

The RIKEN Mouse Gene Encyclopedia Project, a systematic approach to determining the full coding potential of the mouse genome, involves collection and sequencing of full-length complementary DNAs and physical mapping of the corresponding genes to the mouse genome. We organized an international functional annotation meeting (FANTOM) to annotate the first 21,076 cDNAs to be analyzed in this project. Here we describe the first RIKEN clone collection, which is one of the largest described for any organism. Analysis of these cDNAs extends known gene families and identifies new ones.

Kawai, J.; Shinagawa, A.; Shibata, K.; Yoshino, M.; Itoh, M.; Ishii, Y.; Arakawa, T.; Hara, A.; Fukunishi, Y.; Konno, H.; Adachi, J.; Fukuda, S.; Aizawa, K.; Izawa, M.; Nishi, K.; Kiyosawa, H.; Kondo, S.; Yamanaka, I.; Saito, T.; Okazaki, Y.; Gojobori, T.; Bono, H.; Kasukawa, T.; Saito, R.; Kadota, K.; Matsuda, H.; Ashburner, M.; Batalov, S.; Casavant, T.; Fleischmann, W.; Gaasterland, T.; Gissi, C.; King, B.; Kochiwa, H.; Kuehl, P.; Lewis, S.; Matsuo, Y.; Nikaido, I.; Pesole, G.; Quackenbush, J.; Schriml, L.M.; Staubli, F.; Suzuki, R.; Tomita, M.; Wagner, L.; Washio, T.; Sakai, K.; Okido, T.; Furuno, M.; Aono, H.; Baldarelli, R.; Barsh, G.; Blake, J.; Boffelli, D.; Bojunga, N.; Carninci, P.; de Bonaldo, M.F.; Brownstein, M.J.; Bult, C.; Fletcher, C.; Fujita, M.; Gariboldi, M.; Gustincich, S.; Hill, D.; Hofmann, M.; Hume, D.A.; Kamiya, M.; Lee, N.H.; Lyons, P.; Marchionni, L.; Mashima, J.; Mazzarelli, J.; Mombaerts, P.; Nordone, P.; Ring, B.; Ringwald, M.; Rodriguez, I.; Sakamoto, N.; Sasaki, H.; Sato, K.; Schonbach, C.; Seya, T.; Shibata, Y.; Storch, K.-F.; Suzuki, H.; Toyo-oka, K.; Wang, K.H.; Weitz, C.; Whittaker, C.; Wilming, L.; Wynshaw-Boris, A.; Yoshida, K.; Hasegawa, Y.; Kawaji, H.; Kohtsuki, S.; Hayashizaki, Y.; RIKEN Genome Exploration Research Group Phase II T; FANTOM Consortium

2001-01-01

62

Lynx web services for annotations and systems analysis of multi-gene disorders  

PubMed Central

Lynx is a web-based integrated systems biology platform that supports annotation and analysis of experimental data and generation of weighted hypotheses on molecular mechanisms contributing to human phenotypes and disorders of interest. Lynx has integrated multiple classes of biomedical data (genomic, proteomic, pathways, phenotypic, toxicogenomic, contextual and others) from various public databases as well as manually curated data from our group and collaborators (LynxKB). Lynx provides tools for gene list enrichment analysis using multiple functional annotations and network-based gene prioritization. Lynx provides access to the integrated database and the analytical tools via REST based Web Services (http://lynx.ci.uchicago.edu/webservices.html). This comprises data retrieval services for specific functional annotations, services to search across the complete LynxKB (powered by Lucene), and services to access the analytical tools built within the Lynx platform. PMID:24948611

Sulakhe, Dinanath; Taylor, Andrew; Balasubramanian, Sandhya; Feng, Bo; Xie, Bingqing; Bornigen, Daniela; Dave, Utpal J.; Foster, Ian T.; Gilliam, T. Conrad; Maltsev, Natalia

2014-01-01

63

Predicting gene ontology annotations of orphan GWAS genes using protein-protein interactions  

PubMed Central

Background The number of genome-wide association studies (GWAS) has increased rapidly in the past couple of years, resulting in the identification of genes associated with different diseases. The next step in translating these findings into biomedically useful information is to find out the mechanism of the action of these genes. However, GWAS studies often implicate genes whose functions are currently unknown; for example, MYEOV, ANKLE1, TMEM45B and ORAOV1 are found to be associated with breast cancer, but their molecular function is unknown. Results We carried out Bayesian inference of Gene Ontology (GO) term annotations of genes by employing the directed acyclic graph structure of GO and the network of protein-protein interactions (PPIs). The approach is designed based on the fact that two proteins that interact biophysically would be in physical proximity of each other, would possess complementary molecular function, and play role in related biological processes. Predicted GO terms were ranked according to their relative association scores and the approach was evaluated quantitatively by plotting the precision versus recall values and F-scores (the harmonic mean of precision and recall) versus varying thresholds. Precisions of ~58% and?~?40% for localization and functions respectively of proteins were determined at a threshold of ~30 (top 30 GO terms in the ranked list). Comparison with function prediction based on semantic similarity among nodes in an ontology and incorporation of those similarities in a k-nearest neighbor classifier confirmed that our results compared favorably. Conclusions This approach was applied to predict the cellular component and molecular function GO terms of all human proteins that have interacting partners possessing at least one known GO annotation. The list of predictions is available at http://severus.dbmi.pitt.edu/engo/GOPRED.html. We present the algorithm, evaluations and the results of the computational predictions, especially for genes identified in GWAS studies to be associated with diseases, which are of translational interest. PMID:24708602

2014-01-01

64

Meeting Report: Towards a Critical Assessment of Functional Annotation Experiment (CAFAE) for bacterial genome annotation  

PubMed Central

It is widely recognized that, with the advent of very high throughput, short read, and highly parallelized sequencing technologies, the generation of new DNA sequences from microbes, plants, metagenomes is outpacing the ability to assign functions to (“annotate”) all this data. To begin to try to address this, on May 18 and 19, 2010, a team of roughly fifty people met to define and scope the possibility of a first Critical Assessment of Functional Annotation Experiment (CAFAE) for bacterial genome annotation in Crystal City, Virginia. Due to the fundamental importance of genomic data to its mission, the Department of Energy (DOE) BER program hosted this workshop, funding the attendance of all invitees. The workshop was co-organized by Dan Drell and Susan Gregurick (DOE), Owen White and Nikos Kyripides. PMID:21304726

Kyrpides, Nikos

2010-01-01

65

Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads  

PubMed Central

To assess the functional capacities of microbial communities, including those inhabiting the human body, shotgun metagenomic reads are often aligned to a database of known genes. Such homology-based annotation practices critically rely on the assumption that short reads can map to orthologous genes of similar function. This assumption, however, and the various factors that impact short read annotation, have not been systematically evaluated. To address this challenge, we generated an extremely large database of simulated reads (totaling 15.9 Gb), spanning over 500,000 microbial genes and 170 curated genomes and including, for many genomes, every possible read of a given length. We annotated each read using common metagenomic protocols, fully characterizing the effect of read length, sequencing error, phylogeny, database coverage, and mapping parameters. We additionally rigorously quantified gene-, genome-, and protocol-specific annotation biases. Overall, our findings provide a first comprehensive evaluation of the capabilities and limitations of functional metagenomic annotation, providing crucial goal-specific best-practice guidelines to inform future metagenomic research. PMID:25148512

Carr, Rogan; Borenstein, Elhanan

2014-01-01

66

Proteomic Detection of Non-Annotated Protein-Coding Genes in Pseudomonas fluorescens Pf0-1  

SciTech Connect

Genome sequences are annotated by computational prediction of coding sequences, followed by similarity searches such as BLAST, which provide a layer of (possible) functional information. While the existence of processes such as alternative splicing complicates matters for eukaryote genomes, the view of bacterial genomes as a linear series of closely spaced genes leads to the assumption that computational annotations which predict such arrangements completely describe the coding capacity of bacterial genomes. We undertook a proteomic study to identify proteins expressed by Pseudomonas fluorescens Pf0-1 from genes which were not predicted during the genome annotation. Mapping peptides to the Pf0-1 genome sequence identified sixteen non-annotated protein-coding regions, of which nine were antisense to predicted genes, six were intergenic, and one read in the same direction as an annotated gene but in a different frame. The expression of all but one of the newly discovered genes was verified by RT-PCR. Few clues as to the function of the new genes were gleaned from informatic analyses, but potential orthologues in other Pseudomonas genomes were identified for eight of the new genes. The 16 newly identified genes improve the quality of the Pf0-1 genome annotation, and the detection of antisense protein-coding genes indicates the under-appreciated complexity of bacterial genome organization.

Kim, Wook; Silby, Mark W.; Purvine, Samuel O.; Nicoll, Julie S.; Hixson, Kim K.; Monroe, Matthew E.; Nicora, Carrie D.; Lipton, Mary S.; Levy, Stuart B.

2009-12-24

67

Annotation of human chromosome 21 for relevance to Down syndrome: gene structure and expression analysis.  

PubMed

Down syndrome is caused by an extra copy of human chromosome 21 and the resultant dosage-related overexpression of genes contained within it. To efficiently direct experiments to determine specific gene-phenotype correlations, it is necessary to identify all genes within 21q and assess their functional associations and expression patterns. Analysis of the complete finished sequence of 21q resulted in annotated 225 genes and gene models, most of which were incomplete and/or had little or no experimental verification. Here we correct or complete the genomic structures of 16 genes, 4 of which were not reported in the annotation of the complete sequence. Our data include the identification of six genes encoding short or ambiguous open reading frames; the identification of three cases in which alternative splicing produces two structurally unrelated protein sequences; and the identification of six genes encoding proteins with functional motifs, two genes with unusually low similarity to their orthologous mouse proteins, and four genes with significant conservation in Drosophila melanogaster. We further demonstrate that an additional nine gene models represent bona fide transcripts and develop expression patterns for these genes plus nine additional novel chromosome 21 genes and four paralogous genes mapping elsewhere in the human genome. These data have implications for generating complete transcript maps of chromosome 21 and for the entire human genome, and for defining expression abnormalities in Down syndrome and mouse models. PMID:12036298

Gardiner, Katheleen; Slavov, Dobromir; Bechtel, Lawrence; Davisson, Muriel

2002-06-01

68

ncFANs: a web server for functional annotation of long non-coding RNAs.  

PubMed

Recent interest in the non-coding transcriptome has resulted in the identification of large numbers of long non-coding RNAs (lncRNAs) in mammalian genomes, most of which have not been functionally characterized. Computational exploration of the potential functions of these lncRNAs will therefore facilitate further work in this field of research. We have developed a practical and user-friendly web interface called ncFANs (non-coding RNA Function ANnotation server), which is the first web service for functional annotation of human and mouse lncRNAs. On the basis of the re-annotated Affymetrix microarray data, ncFANs provides two alternative strategies for lncRNA functional annotation: one utilizing three aspects of a coding-non-coding gene co-expression (CNC) network, the other identifying condition-related differentially expressed lncRNAs. ncFANs introduces a highly efficient way of re-using the abundant pre-existing microarray data. The present version of ncFANs includes re-annotated CDF files for 10 human and mouse Affymetrix microarrays, and the server will be continuously updated with more re-annotated microarray platforms and lncRNA data. ncFANs is freely accessible at http://www.ebiomed.org/ncFANs/ or http://www.noncode.org/ncFANs/. PMID:21715382

Liao, Qi; Xiao, Hui; Bu, Dechao; Xie, Chaoyong; Miao, Ruoyu; Luo, Haitao; Zhao, Guoguang; Yu, Kuntao; Zhao, Haitao; Skogerbø, Geir; Chen, Runsheng; Wu, Zhongdao; Liu, Changning; Zhao, Yi

2011-07-01

69

Functional annotation prediction: all for one and one for all.  

PubMed

In an era of rapid genome sequencing and high-throughput technology, automatic function prediction for a novel sequence is of utter importance in bioinformatics. While automatic annotation methods based on local alignment searches can be simple and straightforward, they suffer from several drawbacks, including relatively low sensitivity and assignment of incorrect annotations that are not associated with the region of similarity. ProtoNet is a hierarchical organization of the protein sequences in the UniProt database. Although the hierarchy is constructed in an unsupervised automatic manner, it has been shown to be coherent with several biological data sources. We extend the ProtoNet system in order to assign functional annotations automatically. By leveraging on the scaffold of the hierarchical classification, the method is able to overcome some frequent annotation pitfalls. PMID:16672244

Sasson, Ori; Kaplan, Noam; Linial, Michal

2006-06-01

70

Functional Annotation and Comparative Analysis of a Zygopteran Transcriptome.  

PubMed

In this paper we present a de novo assembly of the transcriptome of the damselfly, Enallagma hageni, through the use of 454 pyrosequencing. E. hageni is a member of the suborder Zygoptera within the order Odonata, and the Odonata are the basal lineage of the winged insects (Pterygota). To date, sequence data used in phylogenetic analysis of Enallagma species have been derived from either mtDNA or ribosomal nuclear DNA. This transcriptome contained 31,661 contigs that were assembled and translated into 14,813 individual open reading frames. Using these data, we constructed an extensive dataset of 634 orthologous nuclear protein-coding genes across 11 species of Arthropoda, and used Bayesian techniques to elucidate Enallagma's place in the Arthropod phylogenetic tree. Additionally, we demonstrate that the Enallagma transcriptome contains 169 genes that are evolving at rates that differ relative to the rest of the transcriptome (29 accelerated and 140 decreased), and through multiple Gene Ontology searches and clustering methods, we present the first functional-annotation of any palaeopteran's transcriptome in the literature. PMID:23550132

Shanku, Alexander G; McPeek, Mark A; Kern, Andrew D

2013-03-11

71

Functional Annotation and Comparative Analysis of a Zygopteran Transcriptome  

PubMed Central

In this paper we present a de novo assembly of the transcriptome of the damselfly (Enallagma hageni) through the use of 454 pyrosequencing. E. hageni is a member of the suborder Zygoptera, in the order Odonata, and Odonata organisms form the basal lineage of the winged insects (Pterygota). To date, sequence data used in phylogenetic analysis of Enallagma species have been derived from either mitochondrial DNA or ribosomal nuclear DNA. This Enallagma transcriptome contained 31,661 contigs that were assembled and translated into 14,813 individual open reading frames. Using these data, we constructed an extensive dataset of 634 orthologous nuclear protein-encoding genes across 11 species of Arthropoda and used Bayesian techniques to elucidate the position of Enallagma in the arthropod phylogenetic tree. Additionally, we demonstrated that the Enallagma transcriptome contains 169 genes that are evolving at rates that differ relative to those of the rest of the transcriptome (29 accelerated and 140 decreased), and, through multiple Gene Ontology searches and clustering methods, we present the first functional annotation of any palaeopteran’s transcriptome in the literature. PMID:23550132

Shanku, Alexander G.; McPeek, Mark A.; Kern, Andrew D.

2013-01-01

72

A relation based measure of semantic similarity for Gene Ontology annotations  

PubMed Central

Background Various measures of semantic similarity of terms in bio-ontologies such as the Gene Ontology (GO) have been used to compare gene products. Such measures of similarity have been used to annotate uncharacterized gene products and group gene products into functional groups. There are various ways to measure semantic similarity, either using the topological structure of the ontology, the instances (gene products) associated with terms or a mixture of both. We focus on an instance level definition of semantic similarity while using the information contained in the ontology, both in the graphical structure of the ontology and the semantics of relations between terms, to provide constraints on our instance level description. Semantic similarity of terms is extended to annotations by various approaches, either though aggregation operations such as min, max and average or through an extrapolative method. These approaches introduce assumptions about how semantic similarity of terms relates to the semantic similarity of annotations that do not necessarily reflect how terms relate to each other. Results We exploit the semantics of relations in the GO to construct an algorithm called SSA that provides the basis of a framework that naturally extends instance based methods of semantic similarity of terms, such as Resnik's measure, to describing annotations and not just terms. Our measure attempts to correctly interpret how terms combine via their relationships in the ontological hierarchy. SSA uses these relationships to identify the most specific common ancestors between terms. We outline the set of cases in which terms can combine and associate partial order constraints with each case that order the specificity of terms. These cases form the basis for the SSA algorithm. The set of associated constraints also provide a set of principles that any improvement on our method should seek to satisfy. Conclusion We derive a measure of semantic similarity between annotations that exploits all available information without introducing assumptions about the nature of the ontology or data. We preserve the principles underlying instance based methods of semantic similarity of terms at the annotation level. As a result our measure better describes the information contained in annotations associated with gene products and as a result is better suited to characterizing and classifying gene products through their annotations. PMID:18983678

Sheehan, Brendan; Quigley, Aaron; Gaudin, Benoit; Dobson, Simon

2008-01-01

73

Comprehensive annotation of secondary metabolite biosynthetic genes and gene clusters of Aspergillus nidulans, A. fumigatus, A. niger and A. oryzae  

PubMed Central

Background Secondary metabolite production, a hallmark of filamentous fungi, is an expanding area of research for the Aspergilli. These compounds are potent chemicals, ranging from deadly toxins to therapeutic antibiotics to potential anti-cancer drugs. The genome sequences for multiple Aspergilli have been determined, and provide a wealth of predictive information about secondary metabolite production. Sequence analysis and gene overexpression strategies have enabled the discovery of novel secondary metabolites and the genes involved in their biosynthesis. The Aspergillus Genome Database (AspGD) provides a central repository for gene annotation and protein information for Aspergillus species. These annotations include Gene Ontology (GO) terms, phenotype data, gene names and descriptions and they are crucial for interpreting both small- and large-scale data and for aiding in the design of new experiments that further Aspergillus research. Results We have manually curated Biological Process GO annotations for all genes in AspGD with recorded functions in secondary metabolite production, adding new GO terms that specifically describe each secondary metabolite. We then leveraged these new annotations to predict roles in secondary metabolism for genes lacking experimental characterization. As a starting point for manually annotating Aspergillus secondary metabolite gene clusters, we used antiSMASH (antibiotics and Secondary Metabolite Analysis SHell) and SMURF (Secondary Metabolite Unknown Regions Finder) algorithms to identify potential clusters in A. nidulans, A. fumigatus, A. niger and A. oryzae, which we subsequently refined through manual curation. Conclusions This set of 266 manually curated secondary metabolite gene clusters will facilitate the investigation of novel Aspergillus secondary metabolites. PMID:23617571

2013-01-01

74

AIGO: Towards a unified framework for the Analysis and the Inter-comparison of GO functional annotations  

PubMed Central

Background In response to the rapid growth of available genome sequences, efforts have been made to develop automatic inference methods to functionally characterize them. Pipelines that infer functional annotation are now routinely used to produce new annotations at a genome scale and for a broad variety of species. These pipelines differ widely in their inference algorithms, confidence thresholds and data sources for reasoning. This heterogeneity makes a comparison of the relative merits of each approach extremely complex. The evaluation of the quality of the resultant annotations is also challenging given there is often no existing gold-standard against which to evaluate precision and recall. Results In this paper, we present a pragmatic approach to the study of functional annotations. An ensemble of 12 metrics, describing various aspects of functional annotations, is defined and implemented in a unified framework, which facilitates their systematic analysis and inter-comparison. The use of this framework is demonstrated on three illustrative examples: analysing the outputs of state-of-the-art inference pipelines, comparing electronic versus manual annotation methods, and monitoring the evolution of publicly available functional annotations. The framework is part of the AIGO library (http://code.google.com/p/aigo) for the Analysis and the Inter-comparison of the products of Gene Ontology (GO) annotation pipelines. The AIGO library also provides functionalities to easily load, analyse, manipulate and compare functional annotations and also to plot and export the results of the analysis in various formats. Conclusions This work is a step toward developing a unified framework for the systematic study of GO functional annotations. This framework has been designed so that new metrics on GO functional annotations can be added in a very straightforward way. PMID:22054122

2011-01-01

75

Assessing identity, redundancy and confounds in Gene Ontology annotations over time  

PubMed Central

Motivation: The Gene Ontology (GO) is heavily used in systems biology, but the potential for redundancy, confounds with other data sources and problems with stability over time have been little explored. Results: We report that GO annotations are stable over short periods, with 3% of genes not being most semantically similar to themselves between monthly GO editions. However, we find that genes can alter their ‘functional identity’ over time, with 20% of genes not matching to themselves (by semantic similarity) after 2 years. We further find that annotation bias in GO, in which some genes are more characterized than others, has declined in yeast, but generally increased in humans. Finally, we discovered that many entries in protein interaction databases are owing to the same published reports that are used for GO annotations, with 66% of assessed GO groups exhibiting this confound. We provide a case study to illustrate how this information can be used in analyses of gene sets and networks. Availability: Data available at http://chibi.ubc.ca/assessGO. Contact: paul@chibi.ubc.ca Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23297035

Gillis, Jesse; Pavlidis, Paul

2013-01-01

76

iGepros: an integrated gene and protein annotation server for biological nature exploration  

PubMed Central

Background In the post-genomic era, transcriptomics and proteomics provide important information to understand the genomes. With fast development of high-throughput technology, more and more transcriptomics and proteomics data are generated at an unprecedented rate. Therefore, requirement of software to annotate those omics data and explore their biological nature arises. In the past decade, some pioneer works were presented to address this issue, but limitations still exist. Fox example, some of these tools offer command line only, which is not suitable for those users with little or no experience in programming. Besides, some tools don’t support large scale gene and protein analysis. Results To overcome these limitations, an integrated gene and protein annotation server named iGepros has been developed. The server provides user-friendly interfaces and detailed on-line examples, so most researchers even those with little or no programming experience can use it smoothly. Moreover, the server provides many functionalities to compare transcriptomics and proteomics data. Especially, the server is constructed under a model-view-control framework, which makes it easy to incorporate more functions to the server in the future. Conclusions In this paper, we present a server with powerful capability not only for gene and protein functional annotation, but also for transcriptomics and proteomics data comparison. Researchers can survey biological characters behind gene and protein datasets and accelerate their investigation of transcriptome and proteome by applying the server. The server is publicly available at http://www.biosino.org/iGepros/. PMID:22373022

2011-01-01

77

Functional modelling of an equine bronchoalveolar lavage fluid proteome provides experimental confirmation and functional annotation of equine genome sequences.  

PubMed

The equine genome sequence enables the use of high-throughput genomic technologies in equine research, but accurate identification of expressed gene products and interpreting their biological relevance require additional structural and functional genome annotation. Here, we employ the equine genome sequence to identify predicted and known proteins using proteomics and model these proteins into biological pathways, identifying 582 proteins in normal cell-free equine bronchoalveolar lavage fluid (BALF). We improved structural and functional annotation by directly confirming the in vivo expression of 558 (96%) proteins, which were computationally predicted previously, and adding Gene Ontology (GO) annotations for 174 proteins, 108 of which lacked functional annotation. Bronchoalveolar lavage is commonly used to investigate equine respiratory disease, leading us to model the associated proteome and its biological functions. Modelling of protein functions using Ingenuity Pathway Analysis identified carbohydrate metabolism, cell-to-cell signalling, cellular function, inflammatory response, organ morphology, lipid metabolism and cellular movement as key biological processes in normal equine BALF. Comparative modelling of protein functions in normal cell-free bronchoalveolar lavage proteomes from horse, human, and mouse, performed by grouping GO terms sharing common ancestor terms, confirms conservation of functions across species. Ninety-one of 92 human GO categories and 105 of 109 mouse GO categories were conserved in the horse. Our approach confirms the utility of the equine genome sequence to characterize protein networks without antibodies or mRNA quantification, highlights the need for continued structural and functional annotation of the equine genome and provides a framework for equine researchers to aid in the annotation effort. PMID:21749422

Bright, L A; Mujahid, N; Nanduri, B; McCarthy, F M; Costa, L R R; Burgess, S C; Swiderski, C E

2011-08-01

78

Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database  

PubMed Central

The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as ‘GO’ or ‘homology’ or ‘phenotype’. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as ‘papers selected for GO that refer to genes with NO GO annotation’. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications. PMID:23110975

Drabkin, Harold J.; Blake, Judith A.

2012-01-01

79

Functional annotations of diabetes nephropathy susceptibility loci through analysis of genome-wide renal gene expression in rat models of diabetes mellitus  

Microsoft Academic Search

BACKGROUND: Hyperglycaemia in diabetes mellitus (DM) alters gene expression regulation in various organs and contributes to long term vascular and renal complications. We aimed to generate novel renal genome-wide gene transcription data in rat models of diabetes in order to test the responsiveness to hyperglycaemia and renal structural changes of positional candidate genes at selected diabetic nephropathy (DN) susceptibility loci.

Yaomin Hu; Pamela J Kaisaki; Karène Argoud; Steven P Wilder; Karin J Wallace; Peng Y Woon; Christine Blancher; Lise Tarnow; Per-Henrik Groop; Samy Hadjadj; Michel Marre; Hans-Henrik Parving; Martin Farrall; Roger D Cox; Mark Lathrop; Nathalie Vionnet; Marie-Thérèse Bihoreau; Dominique Gauguier

2009-01-01

80

A New Strategy to Identify and Annotate Human RPE-Specific Gene Expression  

Microsoft Academic Search

BackgroundTo identify and functionally annotate cell type-specific gene expression in the human retinal pigment epithelium (RPE), a key tissue involved in age-related macular degeneration and retinitis pigmentosa.MethodologyRPE, photoreceptor and choroidal cells were isolated from selected freshly frozen healthy human donor eyes using laser microdissection. RNA isolation, amplification and hybridization to 44 k microarrays was carried out according to Agilent specifications.

Judith C. Booij; Jacoline B. Ten Brink; Sigrid M. A. Swagemakers; Annemieke J. M. H. Verkerk; Anke H. W. Essing; Peter J. van der Spek; Arthur A. B. Bergen; Thomas A. Reh

2010-01-01

81

Synergistic use of plant-prokaryote comparative genomics for functional annotations  

Microsoft Academic Search

Background  Identifying functions for all gene products in all sequenced organisms is a central challenge of the post-genomic era. However,\\u000a at least 30-50% of the proteins encoded by any given genome are of unknown or vaguely known function, and a large number are\\u000a wrongly annotated. Many of these ‘unknown’ proteins are common to prokaryotes and plants. We set out to predict

Svetlana Gerdes; Basma El Yacoubi; Marc Bailly; Ian K Blaby; Crysten E Blaby-Haas; Linda Jeanguenin; Aurora Lara-Núñez; Anne Pribat; Jeffrey C Waller; Andreas Wilke; Ross Overbeek; Andrew D Hanson; Valérie de Crécy-Lagard

2011-01-01

82

ChipInfo: software for extracting gene annotation and gene ontology information for microarray analysis  

Microsoft Academic Search

To date, assembling comprehensive annotation information for all probe sets of any Affymetrix microarrays remains a time-consuming, error-prone and challenging task. ChipInfo is designed for retrieving annotation information from online data- bases such as NetAffx and Gene Ontology and organizing such information into easily interpretable tabular format outputs. As companion software to dChip and GoSurfer, ChipInfo enables users to independently

Sheng Zhong; Cheng Li; Wing Hung Wong

2003-01-01

83

The evolutionary analysis of "orphans" from the Drosophila genome identifies rapidly diverging and incorrectly annotated genes.  

PubMed

In genome projects of eukaryotic model organisms, a large number of novel genes of unknown function and evolutionary history ("orphans") are being identified. Since many orphans have no known homologs in distant species, it is unclear whether they are restricted to certain taxa or evolve rapidly, either because of a lack of constraints or positive Darwinian selection. Here we use three criteria for the selection of putatively rapidly evolving genes from a single sequence of Drosophila melanogaster. Thirteen candidate genes were chosen from the Adh region on the second chromosome and 1 from the tip of the X chromosome. We succeeded in obtaining sequence from 6 of these in the closely related species D. simulans and D. yakuba. Only 1 of the 6 genes showed a large number of amino acid replacements and in-frame insertions/deletions. A population survey of this gene suggests that its rapid evolution is due to the fixation of many neutral or nearly neutral mutations. Two other genes showed "normal" levels of divergence between species. Four genes had insertions/deletions that destroy the putative reading frame within exons, suggesting that these exons have been incorrectly annotated. The evolutionary analysis of orphan genes in closely related species is useful for the identification of both rapidly evolving and incorrectly annotated genes. PMID:11606536

Schmid, K J; Aquadro, C F

2001-10-01

84

High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource.  

PubMed

The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic models. To overcome these problems, we have developed the PlantSEED, an integrated, metabolism-centric database to support subsystems-based annotation and metabolic model reconstruction for plant genomes. PlantSEED combines SEED subsystems technology, first developed for microbial genomes, with refined protein families and biochemical data to assign fully consistent functional annotations to orthologous genes, particularly those encoding primary metabolic pathways. Seamless integration with its parent, the prokaryotic SEED database, makes PlantSEED a unique environment for cross-kingdom comparative analysis of plant and bacterial genomes. The consistent annotations imposed by PlantSEED permit rapid reconstruction and modeling of primary metabolism for all plant genomes in the database. This feature opens the unique possibility of model-based assessment of the completeness and accuracy of gene annotation and thus allows computational identification of genes and pathways that are restricted to certain genomes or need better curation. We demonstrate the PlantSEED system by producing consistent annotations for 10 reference genomes. We also produce a functioning metabolic model for each genome, gapfilling to identify missing annotations and proposing gene candidates for missing annotations. Models are built around an extended biomass composition representing the most comprehensive published to date. To our knowledge, our models are the first to be published for seven of the genomes analyzed. PMID:24927599

Seaver, Samuel M D; Gerdes, Svetlana; Frelin, Océane; Lerma-Ortiz, Claudia; Bradbury, Louis M T; Zallot, Rémi; Hasnain, Ghulam; Niehaus, Thomas D; El Yacoubi, Basma; Pasternak, Shiran; Olson, Robert; Pusch, Gordon; Overbeek, Ross; Stevens, Rick; de Crécy-Lagard, Valérie; Ware, Doreen; Hanson, Andrew D; Henry, Christopher S

2014-07-01

85

Optimizing high performance computing workflow for protein functional annotation  

PubMed Central

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

2014-01-01

86

Annotation of functional variation in personal genomes using RegulomeDB.  

PubMed

As the sequencing of healthy and disease genomes becomes more commonplace, detailed annotation provides interpretation for individual variation responsible for normal and disease phenotypes. Current approaches focus on direct changes in protein coding genes, particularly nonsynonymous mutations that directly affect the gene product. However, most individual variation occurs outside of genes and, indeed, most markers generated from genome-wide association studies (GWAS) identify variants outside of coding segments. Identification of potential regulatory changes that perturb these sites will lead to a better localization of truly functional variants and interpretation of their effects. We have developed a novel approach and database, RegulomeDB, which guides interpretation of regulatory variants in the human genome. RegulomeDB includes high-throughput, experimental data sets from ENCODE and other sources, as well as computational predictions and manual annotations to identify putative regulatory potential and identify functional variants. These data sources are combined into a powerful tool that scores variants to help separate functional variants from a large pool and provides a small set of putative sites with testable hypotheses as to their function. We demonstrate the applicability of this tool to the annotation of noncoding variants from 69 full sequenced genomes as well as that of a personal genome, where thousands of functionally associated variants were identified. Moreover, we demonstrate a GWAS where the database is able to quickly identify the known associated functional variant and provide a hypothesis as to its function. Overall, we expect this approach and resource to be valuable for the annotation of human genome sequences. PMID:22955989

Boyle, Alan P; Hong, Eurie L; Hariharan, Manoj; Cheng, Yong; Schaub, Marc A; Kasowski, Maya; Karczewski, Konrad J; Park, Julie; Hitz, Benjamin C; Weng, Shuai; Cherry, J Michael; Snyder, Michael

2012-09-01

87

Towards integrative gene functional similarity measurement  

PubMed Central

Background In Gene Ontology, the "Molecular Function" (MF) categorization is a widely used knowledge framework for gene function comparison and prediction. Its structure and annotation provide a convenient way to compare gene functional similarities at the molecular level. The existing gene similarity measures, however, solely rely on one or few aspects of MF without utilizing all the rich information available including structure, annotation, common terms, lowest common parents. Results We introduce a rank-based gene semantic similarity measure called InteGO by synergistically integrating the state-of-the-art gene-to-gene similarity measures. By integrating three GO based seed measures, InteGO significantly improves the performance by about two-fold in all the three species studied (yeast, Arabidopsis and human). Conclusions InteGO is a systematic and novel method to study gene functional associations. The software and description are available at http://www.msu.edu/~jinchen/InteGO. PMID:24564710

2014-01-01

88

Drosophila Gene Expression Pattern Annotation Using Sparse Features and Term-Term Interactions.  

PubMed

The Drosophila gene expression pattern images document the spatial and temporal dynamics of gene expression and they are valuable tools for explicating the gene functions, interaction, and networks during Drosophila embryogenesis. To provide text-based pattern searching, the images in the Berkeley Drosophila Genome Project (BDGP) study are annotated with ontology terms manually by human curators. We present a systematic approach for automating this task, because the number of images needing text descriptions is now rapidly increasing. We consider both improved feature representation and novel learning formulation to boost the annotation performance. For feature representation, we adapt the bag-of-words scheme commonly used in visual recognition problems so that the image group information in the BDGP study is retained. Moreover, images from multiple views can be integrated naturally in this representation. To reduce the quantization error caused by the bag-of-words representation, we propose an improved feature representation scheme based on the sparse learning technique. In the design of learning formulation, we propose a local regularization framework that can incorporate the correlations among terms explicitly. We further show that the resulting optimization problem admits an analytical solution. Experimental results show that the representation based on sparse learning outperforms the bag-of-words representation significantly. Results also show that incorporation of the term-term correlations improves the annotation performance consistently. PMID:21614142

Ji, Shuiwang; Yuan, Lei; Li, Ying-Xin; Zhou, Zhi-Hua; Kumar, Sudhir; Ye, Jieping

2009-06-28

89

Analysis of CATMA transcriptome data identifies hundreds of novel functional genes and improves gene models in the Arabidopsis genome  

Microsoft Academic Search

BACKGROUND: Since the finishing of the sequencing of the Arabidopsis thaliana genome, the Arabidopsis community and the annotator centers have been working on the improvement of gene annotation at the structural and functional levels. In this context, we have used the large CATMA resource on the Arabidopsis transcriptome to search for genes missed by different annotation processes. Probes on the

Sébastien Aubourg; Marie-Laure Martin-Magniette; Véronique Brunaud; Ludivine Taconnat; Frédérique Bitton; Sandrine Balzergue; Pauline E Jullien; Mathieu Ingouff; Vincent Thareau; Thomas Schiex; Alain Lecharny; Jean-Pierre Renou

2007-01-01

90

Functional annotation of the human chromosome 7 "missing" proteins: a bioinformatics approach.  

PubMed

The chromosome-centric human proteome project aims to systematically map all human proteins, chromosome by chromosome, in a gene-centric manner through dedicated efforts from national and international teams. This mapping will lead to a knowledge-based resource defining the full set of proteins encoded in each chromosome and laying the foundation for the development of a standardized approach to analyze the massive proteomic data sets currently being generated. The neXtProt database lists 946 proteins as the human proteome of chromosome 7. However, 170 (18%) proteins of human chromosome 7 have no evidence at the proteomic, antibody, or structural levels and are considered "missing" in this study as they lack experimental support. We have developed a protocol for the functional annotation of these "missing" proteins by integrating several bioinformatics analysis and annotation tools, sequential BLAST homology searches, protein domain/motif and gene ontology (GO) mapping, and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. Using the BLAST search strategy, homologues for reviewed non-human mammalian proteins with protein evidence were identified for 90 "missing" proteins while another 38 had reviewed non-human mammalian homologues. Putative functional annotations were assigned to 27 of the remaining 43 novel proteins. Proteotypic peptides have been computationally generated to facilitate rapid identification of these proteins. Four of the "missing" chromosome 7 proteins have been substantiated by the ENCODE proteogenomic peptide data. PMID:23308364

Ranganathan, Shoba; Khan, Javed M; Garg, Gagan; Baker, Mark S

2013-06-01

91

Protein function annotation by local binding site surface similarity.  

PubMed

Hundreds of protein crystal structures exist for proteins whose function cannot be confidently determined from sequence similarity. Surflex-PSIM, a previously reported surface-based protein similarity algorithm, provides an alternative method for hypothesizing function for such proteins. The method now supports fully automatic binding site detection and is fast enough to screen comprehensive databases of protein binding sites. The binding site detection methodology was validated on apo/holo cognate protein pairs, correctly identifying 91% of ligand binding sites in holo structures and 88% in apo structures where corresponding sites existed. For correctly detected apo binding sites, the cognate holo site was the most similar binding site 87% of the time. PSIM was used to screen a set of proteins that had poorly characterized functions at the time of crystallization, but were later biochemically annotated. Using a fully automated protocol, this set of 8 proteins was screened against ?60,000 ligand binding sites from the PDB. PSIM correctly identified functional matches that predated query protein biochemical annotation for five out of the eight query proteins. A panel of 12 currently unannotated proteins was also screened, resulting in a large number of statistically significant binding site matches, some of which suggest likely functions for the poorly characterized proteins. PMID:24166661

Spitzer, Russell; Cleves, Ann E; Varela, Rocco; Jain, Ajay N

2014-04-01

92

Protein structure prediction and structure-based protein function annotation  

E-print Network

Nature tends to modify rather than invent function of protein molecules, and the log of the modifications is encrypted in the gene sequence. Analysis of these modification events in evolutionarily related genes is important ...

Roy, Ambrish

2011-12-31

93

Combining heterogeneous data sources for accurate functional annotation of proteins.  

PubMed

Combining heterogeneous sources of data is essential for accurate prediction of protein function. The task is complicated by the fact that while sequence-based features can be readily compared across species, most other data are species-specific. In this paper, we present a multi-view extension to GOstruct, a structured-output framework for function annotation of proteins. The extended framework can learn from disparate data sources, with each data source provided to the framework in the form of a kernel. Our empirical results demonstrate that the multi-view framework is able to utilize all available information, yielding better performance than sequence-based models trained across species and models trained from collections of data within a given species. This version of GOstruct participated in the recent Critical Assessment of Functional Annotations (CAFA) challenge; since then we have significantly improved the natural language processing component of the method, which now provides performance that is on par with that provided by sequence information. The GOstruct framework is available for download at http://strut.sourceforge.net. PMID:23514123

Sokolov, Artem; Funk, Christopher; Graim, Kiley; Verspoor, Karin; Ben-Hur, Asa

2013-01-01

94

Insect Innate Immunity Database (IIID): An Annotation Tool for Identifying Immune Genes in Insect Genomes  

E-print Network

Insect Innate Immunity Database (IIID): An Annotation Tool for Identifying Immune Genes in Insect'' project, which aims to sequence 5,000 insect genomes by 2016, many novel insect genomes will soon become publicly available, yet few annotation resources are currently available for insects. Thus, we developed

Bordenstein, Seth

95

Functional Annotation of Putative Regulatory Elements at Cancer Susceptibility Loci  

PubMed Central

Most cancer-associated genetic variants identified from genome-wide association studies (GWAS) do not obviously change protein structure, leading to the hypothesis that the associations are attributable to regulatory polymorphisms. Translating genetic associations into mechanistic insights can be facilitated by knowledge of the causal regulatory variant (or variants) responsible for the statistical signal. Experimental validation of candidate functional variants is onerous, making bioinformatic approaches necessary to prioritize candidates for laboratory analysis. Thus, a systematic approach for recognizing functional (and, therefore, likely causal) variants in noncoding regions is an important step toward interpreting cancer risk loci. This review provides a detailed introduction to current regulatory variant annotations, followed by an overview of how to leverage these resources to prioritize candidate functional polymorphisms in regulatory regions.

Rosse, Stephanie A; Auer, Paul L; Carlson, Christopher S

2014-01-01

96

Towards Experimental Annotation of Genes by High Throughput Sequencing  

SciTech Connect

Andrew Bradbury of Los Alamos National Laboratory discusses turning annotation into a sequencing pipeline on June 3, 2010 at the "Sequencing, Finishing, Analysis in the Future" meeting in Santa Fe, NM

Bradbury, Andrew [Los Alamos National Laboratory

2010-06-03

97

BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources  

PubMed Central

Online gene annotation resources are indispensable for analysis of genomics data. However, the landscape of these online resources is highly fragmented, and scientists often visit dozens of these sites for each gene in a candidate gene list. Here, we introduce BioGPS http://biogps.gnf.org, a centralized gene portal for aggregating distributed gene annotation resources. Moreover, BioGPS embraces the principle of community intelligence, enabling any user to easily and directly contribute to the BioGPS platform. PMID:19919682

2009-01-01

98

ShortStack: Comprehensive annotation and quantification of small RNA genes  

PubMed Central

Small RNA sequencing allows genome-wide discovery, categorization, and quantification of genes producing regulatory small RNAs. Many tools have been described for annotation and quantification of microRNA loci (MIRNAs) from small RNA-seq data. However, in many organisms and tissue types, MIRNA genes comprise only a small fraction of all small RNA-producing genes. ShortStack is a stand-alone application that analyzes reference-aligned small RNA-seq data and performs comprehensive de novo annotation and quantification of the inferred small RNA genes. ShortStack’s output reports multiple parameters of direct relevance to small RNA gene annotation, including RNA size distributions, repetitiveness, strandedness, hairpin-association, MIRNA annotation, and phasing. In this study, ShortStack is demonstrated to perform accurate annotations and useful descriptions of diverse small RNA genes from four plants (Arabidopsis, tomato, rice, and maize) and three animals (Drosophila, mice, and humans). ShortStack efficiently processes very large small RNA-seq data sets using modest computational resources, and its performance compares favorably to previously described tools. Annotation of MIRNA loci by ShortStack is highly specific in both plants and animals. ShortStack is freely available under a GNU General Public License. PMID:23610128

Axtell, Michael J.

2013-01-01

99

GeneSense: a new approach for human gene annotation integrated with protein-protein interaction networks.  

PubMed

Virtually all cellular functions involve protein-protein interactions (PPIs). As an increasing number of PPIs are identified and vast amount of information accumulated, researchers are finding different ways to interrogate the data and understand the interactions in context. However, it is widely recognized that a significant portion of the data is scattered, redundant, not considered high quality, and not readily accessible to researchers in a systematic fashion. In addition, it is challenging to identify the optimal protein targets in the current PPI networks. The GeneSense server was developed to integrate gene annotation and PPI networks in an expandable architecture that incorporates selected databases with the aim to assemble, analyze, evaluate and disseminate protein-protein association information in a comprehensive and user-friendly manner. Three network models including nodenet, leafnet and loopnet are used to identify the optimal protein targets in the complex networks. GeneSense is freely available at www.biomedsense.org/genesense.php. PMID:24667292

Chen, Zhongzhong; Zhang, Tianhong; Lin, Jun; Yan, Zidan; Wang, Yongren; Zheng, Weiqiang; Weng, Kevin C

2014-01-01

100

GeneSense: a new approach for human gene annotation integrated with protein-protein interaction networks  

PubMed Central

Virtually all cellular functions involve protein-protein interactions (PPIs). As an increasing number of PPIs are identified and vast amount of information accumulated, researchers are finding different ways to interrogate the data and understand the interactions in context. However, it is widely recognized that a significant portion of the data is scattered, redundant, not considered high quality, and not readily accessible to researchers in a systematic fashion. In addition, it is challenging to identify the optimal protein targets in the current PPI networks. The GeneSense server was developed to integrate gene annotation and PPI networks in an expandable architecture that incorporates selected databases with the aim to assemble, analyze, evaluate and disseminate protein-protein association information in a comprehensive and user-friendly manner. Three network models including nodenet, leafnet and loopnet are used to identify the optimal protein targets in the complex networks. GeneSense is freely available at www.biomedsense.org/genesense.php. PMID:24667292

Chen, Zhongzhong; Zhang, Tianhong; Lin, Jun; Yan, Zidan; Wang, Yongren; Zheng, Weiqiang; Weng, Kevin C.

2014-01-01

101

Whole genome shotgun sequencing of Brassica oleracea and its application to gene discovery and annotation in Arabidopsis.  

PubMed

Through comparative studies of the model organism Arabidopsis thaliana and its close relative Brassica oleracea, we have identified conserved regions that represent potentially functional sequences overlooked by previous Arabidopsis genome annotation methods. A total of 454,274 whole genome shotgun sequences covering 283 Mb (0.44 x) of the estimated 650 Mb Brassica genome were searched against the Arabidopsis genome, and conserved Arabidopsis genome sequences (CAGSs) were identified. Of these 229,735 conserved regions, 167,357 fell within or intersected existing gene models, while 60,378 were located in previously unannotated regions. After removal of sequences matching known proteins, CAGSs that were close to one another were chained together as potentially comprising portions of the same functional unit. This resulted in 27,347 chains of which 15,686 were sufficiently distant from existing gene annotations to be considered a novel conserved unit. Of 192 conserved regions examined, 58 were found to be expressed in our cDNA populations. Rapid amplification of cDNA ends (RACE) was used to obtain potentially full-length transcripts from these 58 regions. The resulting sequences led to the creation of 21 gene models at 17 new Arabidopsis loci and the addition of splice variants or updates to another 19 gene structures. In addition, CAGSs overlapping already annotated genes in Arabidopsis can provide guidance for manual improvement of existing gene models. Published genome-wide expression data based on whole genome tiling arrays and massively parallel signature sequencing were overlaid on the Brassica-Arabidopsis conserved sequences, and 1399 regions of intersection were identified. Collectively our results and these data sets suggest that several thousand new Arabidopsis genes remain to be identified and annotated. PMID:15805490

Ayele, Mulu; Haas, Brian J; Kumar, Nikhil; Wu, Hank; Xiao, Yongli; Van Aken, Susan; Utterback, Teresa R; Wortman, Jennifer R; White, Owen R; Town, Christopher D

2005-04-01

102

Whole genome shotgun sequencing of Brassica oleracea and its application to gene discovery and annotation in Arabidopsis  

PubMed Central

Through comparative studies of the model organism Arabidopsis thaliana and its close relative Brassica oleracea, we have identified conserved regions that represent potentially functional sequences overlooked by previous Arabidopsis genome annotation methods. A total of 454,274 whole genome shotgun sequences covering 283 Mb (0.44×) of the estimated 650 Mb Brassica genome were searched against the Arabidopsis genome, and conserved Arabidopsis genome sequences (CAGSs) were identified. Of these 229,735 conserved regions, 167,357 fell within or intersected existing gene models, while 60,378 were located in previously unannotated regions. After removal of sequences matching known proteins, CAGSs that were close to one another were chained together as potentially comprising portions of the same functional unit. This resulted in 27,347 chains of which 15,686 were sufficiently distant from existing gene annotations to be considered a novel conserved unit. Of 192 conserved regions examined, 58 were found to be expressed in our cDNA populations. Rapid amplification of cDNA ends (RACE) was used to obtain potentially full-length transcripts from these 58 regions. The resulting sequences led to the creation of 21 gene models at 17 new Arabidopsis loci and the addition of splice variants or updates to another 19 gene structures. In addition, CAGSs overlapping already annotated genes in Arabidopsis can provide guidance for manual improvement of existing gene models. Published genome-wide expression data based on whole genome tiling arrays and massively parallel signature sequencing were overlaid on the Brassica–Arabidopsis conserved sequences, and 1399 regions of intersection were identified. Collectively our results and these data sets suggest that several thousand new Arabidopsis genes remain to be identified and annotated. PMID:15805490

Ayele, Mulu; Haas, Brian J.; Kumar, Nikhil; Wu, Hank; Xiao, Yongli; Van Aken, Susan; Utterback, Teresa R.; Wortman, Jennifer R.; White, Owen R.; Town, Christopher D.

2005-01-01

103

Large-scale collection and annotation of gene models for date palm (Phoenix dactylifera, L.).  

PubMed

The date palm (Phoenix dactylifera L.), famed for its sugar-rich fruits (dates) and cultivated by humans since 4,000 B.C., is an economically important crop in the Middle East, Northern Africa, and increasingly other places where climates are suitable. Despite a long history of human cultivation, the understanding of P. dactylifera genetics and molecular biology are rather limited, hindered by lack of basic data in high quality from genomics and transcriptomics. Here we report a large-scale effort in generating gene models (assembled expressed sequence tags or ESTs and mapped to a genome assembly) for P. dactylifera, using the long-read pyrosequencing platform (Roche/454 GS FLX Titanium) in high coverage. We built fourteen cDNA libraries from different P. dactylifera tissues (cultivar Khalas) and acquired 15,778,993 raw sequencing reads-about one million sequencing reads per library-and the pooled sequences were assembled into 67,651 non-redundant contigs and 301,978 singletons. We annotated 52,725 contigs based on the plant databases and 45 contigs based on functional domains referencing to the Pfam database. From the annotated contigs, we assigned GO (Gene Ontology) terms to 36,086 contigs and KEGG pathways to 7,032 contigs. Our comparative analysis showed that 70.6 % (47,930), 69.4 % (47,089), 68.4 % (46,441), and 69.3 % (47,048) of the P. dactylifera gene models are shared with rice, sorghum, Arabidopsis, and grapevine, respectively. We also assigned our gene models into house-keeping and tissue-specific genes based on their tissue specificity. PMID:22736259

Zhang, Guangyu; Pan, Linlin; Yin, Yuxin; Liu, Wanfei; Huang, Dawei; Zhang, Tongwu; Wang, Lei; Xin, Chengqi; Lin, Qiang; Sun, Gaoyuan; Ba Abdullah, Mohammed M; Zhang, Xiaowei; Hu, Songnian; Al-Mssallem, Ibrahim S; Yu, Jun

2012-08-01

104

Optimization of gene set annotations via entropy minimization over variable clusters (EMVC)  

PubMed Central

Motivation: Gene set enrichment has become a critical tool for interpreting the results of high-throughput genomic experiments. Inconsistent annotation quality and lack of annotation specificity, however, limit the statistical power of enrichment methods and make it difficult to replicate enrichment results across biologically similar datasets. Results: We propose a novel algorithm for optimizing gene set annotations to best match the structure of specific empirical data sources. Our proposed method, entropy minimization over variable clusters (EMVC), filters the annotations for each gene set to minimize a measure of entropy across disjoint gene clusters computed for a range of cluster sizes over multiple bootstrap resampled datasets. As shown using simulated gene sets with simulated data and Molecular Signatures Database collections with microarray gene expression data, the EMVC algorithm accurately filters annotations unrelated to the experimental outcome resulting in increased gene set enrichment power and better replication of enrichment results. Availability and implementation: http://cran.r-project.org/web/packages/EMVC/index.html. Contact: jason.h.moore@dartmouth.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24574114

Frost, H. Robert; Moore, Jason H.

2014-01-01

105

The SOFG Anatomy Entry List (SAEL): An Annotation Tool for Functional Genomics Data  

PubMed Central

A great deal of data in functional genomics studies needs to be annotated with low-resolution anatomical terms. For example, gene expression assays based on manually dissected samples (microarray, SAGE, etc.) need high-level anatomical terms to describe sample origin. First-pass annotation in high-throughput assays (e.g. large-scale in situ gene expression screens or phenotype screens) and bibliographic applications, such as selection of keywords, would also benefit from a minimum set of standard anatomical terms. Although only simple terms are required, the researcher faces serious practical problems of inconsistency and confusion, given the different aims and the range of complexity of existing anatomy ontologies. A Standards and Ontologies for Functional Genomics (SOFG) group therefore initiated discussions between several of the major anatomical ontologies for higher vertebrates. As we report here, one result of these discussions is a simple, accessible, controlled vocabulary of gross anatomical terms, the SOFG Anatomy Entry List (SAEL). The SAEL is available from http://www.sofg.org and is intended as a resource for biologists, curators, bioinformaticians and developers of software supporting functional genomics. It can be used directly for annotation in the contexts described above. Importantly, each term is linked to the corresponding term in each of the major anatomy ontologies. Where the simple list does not provide enough detail or sophistication, therefore, the researcher can use the SAEL to choose the appropriate ontology and move directly to the relevant term as an entry point. The SAEL links will also be used to support computational access to the respective ontologies. PMID:18629134

Parkinson, Helen; Aitken, Stuart; Baldock, Richard A.; Bard, Jonathan B. L.; Burger, Albert; Hayamizu, Terry F.; Rector, Alan; Ringwald, Martin; Rogers, Jeremy; Rosse, Cornelius; Stoeckert, Christian J.

2004-01-01

106

Prediction of yeast protein-protein interaction network: insights from the Gene Ontology and annotations  

PubMed Central

A map of protein–protein interactions provides valuable insight into the cellular function and machinery of a proteome. By measuring the similarity between two Gene Ontology (GO) terms with a relative specificity semantic relation, here, we proposed a new method of reconstructing a yeast protein–protein interaction map that is solely based on the GO annotations. The method was validated using high-quality interaction datasets for its effectiveness. Based on a Z-score analysis, a positive dataset and a negative dataset for protein–protein interactions were derived. Moreover, a gold standard positive (GSP) dataset with the highest level of confidence that covered 78% of the high-quality interaction dataset and a gold standard negative (GSN) dataset with the lowest level of confidence were derived. In addition, we assessed four high-throughput experimental interaction datasets using the positives and the negatives as well as GSPs and GSNs. Our predicted network reconstructed from GSPs consists of 40?753 interactions among 2259 proteins, and forms 16 connected components. We mapped all of the MIPS complexes except for homodimers onto the predicted network. As a result, ?35% of complexes were identified interconnected. For seven complexes, we also identified some nonmember proteins that may be functionally related to the complexes concerned. This analysis is expected to provide a new approach for predicting the protein–protein interaction maps from other completely sequenced genomes with high-quality GO-based annotations. PMID:16641319

Wu, Xiaomei; Zhu, Lei; Guo, Jie; Zhang, Da-Yong; Lin, Kui

2006-01-01

107

Re-annotation of the CAZy genes of Trichoderma reesei and transcription in the presence of lignocellulosic substrates  

PubMed Central

Background Trichoderma reesei is a soft rot Ascomycota fungus utilised for industrial production of secreted enzymes, especially lignocellulose degrading enzymes. About 30 carbohydrate active enzymes (CAZymes) of T. reesei have been biochemically characterised. Genome sequencing has revealed a large number of novel candidates for CAZymes, thus increasing the potential for identification of enzymes with novel activities and properties. Plenty of data exists on the carbon source dependent regulation of the characterised hydrolytic genes. However, information on the expression of the novel CAZyme genes, especially on complex biomass material, is very limited. Results In this study, the CAZyme gene content of the T. reesei genome was updated and the annotations of the genes refined using both computational and manual approaches. Phylogenetic analysis was done to assist the annotation and to identify functionally diversified CAZymes. The analyses identified 201 glycoside hydrolase genes, 22 carbohydrate esterase genes and five polysaccharide lyase genes. Updated or novel functional predictions were assigned to 44 genes, and the phylogenetic analysis indicated further functional diversification within enzyme families or groups of enzymes. GH3 ?-glucosidases, GH27 ?-galactosidases and GH18 chitinases were especially functionally diverse. The expression of the lignocellulose degrading enzyme system of T. reesei was studied by cultivating the fungus in the presence of different inducing substrates and by subjecting the cultures to transcriptional profiling. The substrates included both defined and complex lignocellulose related materials, such as pretreated bagasse, wheat straw, spruce, xylan, Avicel cellulose and sophorose. The analysis revealed co-regulated groups of CAZyme genes, such as genes induced in all the conditions studied and also genes induced preferentially by a certain set of substrates. Conclusions In this study, the CAZyme content of the T. reesei genome was updated, the discrepancies between the different genome versions and published literature were removed and the annotation of many of the genes was refined. Expression analysis of the genes gave information on the enzyme activities potentially induced by the presence of the different substrates. Comparison of the expression profiles of the CAZyme genes under the different conditions identified co-regulated groups of genes, suggesting common regulatory mechanisms for the gene groups. PMID:23035824

2012-01-01

108

Incorporating functional annotation information in prioritizing disease associated SNPs from genome wide association studies.  

PubMed

With recent advances in genotyping and sequencing technologies, many disease susceptibility loci have been identified. However, much of the genetic heritability remains unexplained and the replication rate between independent studies is still low. Meanwhile, there have been increasing efforts on functional annotations of the entire human genome, such as the Encyclopedia of DNA Elements (ENCODE) project and other similar projects. It has been shown that incorporating these functional annotations to prioritize genome wide association signals may help identify true association signals. However, to our knowledge, the extent of the improvement when functional annotation data are considered has not been studied in the literature. In this article, we propose a statistical framework to estimate the improvement in replication rate with annotation data, and apply it to Crohn's disease and DNase I hypersensitive sites. The results show that with cell line specific functional annotations, the expected replication rate is improved, but only at modest level. PMID:25326070

Hou, Lin; Ma, TianZhou; Zhao, HongYu

2014-11-01

109

M-Finder: Uncovering functionally associated proteins from interactome data integrated with GO annotations  

PubMed Central

Background Protein-protein interactions (PPIs) play a key role in understanding the mechanisms of cellular processes. The availability of interactome data has catalyzed the development of computational approaches to elucidate functional behaviors of proteins on a system level. Gene Ontology (GO) and its annotations are a significant resource for functional characterization of proteins. Because of wide coverage, GO data have often been adopted as a benchmark for protein function prediction on the genomic scale. Results We propose a computational approach, called M-Finder, for functional association pattern mining. This method employs semantic analytics to integrate the genome-wide PPIs with GO data. We also introduce an interactive web application tool that visualizes a functional association network linked to a protein specified by a user. The proposed approach comprises two major components. First, the PPIs that have been generated by high-throughput methods are weighted in terms of their functional consistency using GO and its annotations. We assess two advanced semantic similarity metrics which quantify the functional association level of each interacting protein pair. We demonstrate that these measures outperform the other existing methods by evaluating their agreement to other biological features, such as sequence similarity, the presence of common Pfam domains, and core PPIs. Second, the information flow-based algorithm is employed to discover a set of proteins functionally associated with the protein in a query and their links efficiently. This algorithm reconstructs a functional association network of the query protein. The output network size can be flexibly determined by parameters. Conclusions M-Finder provides a useful framework to investigate functional association patterns with any protein. This software will also allow users to perform further systematic analysis of a set of proteins for any specific function. It is available online at http://bionet.ecs.baylor.edu/mfinder PMID:24565382

2013-01-01

110

Functional annotation of the transcriptome of Sorghum bicolor in response to osmotic stress and abscisic acid  

PubMed Central

Background Higher plants exhibit remarkable phenotypic plasticity allowing them to adapt to an extensive range of environmental conditions. Sorghum is a cereal crop that exhibits exceptional tolerance to adverse conditions, in particular, water-limiting environments. This study utilized next generation sequencing (NGS) technology to examine the transcriptome of sorghum plants challenged with osmotic stress and exogenous abscisic acid (ABA) in order to elucidate genes and gene networks that contribute to sorghum's tolerance to water-limiting environments with a long-term aim of developing strategies to improve plant productivity under drought. Results RNA-Seq results revealed transcriptional activity of 28,335 unique genes from sorghum root and shoot tissues subjected to polyethylene glycol (PEG)-induced osmotic stress or exogenous ABA. Differential gene expression analyses in response to osmotic stress and ABA revealed a strong interplay among various metabolic pathways including abscisic acid and 13-lipoxygenase, salicylic acid, jasmonic acid, and plant defense pathways. Transcription factor analysis indicated that groups of genes may be co-regulated by similar regulatory sequences to which the expressed transcription factors bind. We successfully exploited the data presented here in conjunction with published transcriptome analyses for rice, maize, and Arabidopsis to discover more than 50 differentially expressed, drought-responsive gene orthologs for which no function had been previously ascribed. Conclusions The present study provides an initial assemblage of sorghum genes and gene networks regulated by osmotic stress and hormonal treatment. We are providing an RNA-Seq data set and an initial collection of transcription factors, which offer a preliminary look into the cascade of global gene expression patterns that arise in a drought tolerant crop subjected to abiotic stress. These resources will allow scientists to query gene expression and functional annotation in response to drought. PMID:22008187

2011-01-01

111

Functional annotation from the genome sequence of the giant panda.  

PubMed

The giant panda is one of the most critically endangered species due to the fragmentation and loss of its habitat. Studying the functions of proteins in this animal, especially specific trait-related proteins, is therefore necessary to protect the species. In this work, the functions of these proteins were investigated using the genome sequence of the giant panda. Data on 21,001 proteins and their functions were stored in the Giant Panda Protein Database, in which the proteins were divided into two groups: 20,179 proteins whose functions can be predicted by GeneScan formed the known-function group, whereas 822 proteins whose functions cannot be predicted by GeneScan comprised the unknown-function group. For the known-function group, we further classified the proteins by molecular function, biological process, cellular component, and tissue specificity. For the unknown-function group, we developed a strategy in which the proteins were filtered by cross-Blast to identify panda-specific proteins under the assumption that proteins related to the panda-specific traits in the unknown-function group exist. After this filtering procedure, we identified 32 proteins (2 of which are membrane proteins) specific to the giant panda genome as compared against the dog and horse genomes. Based on their amino acid sequences, these 32 proteins were further analyzed by functional classification using SVM-Prot, motif prediction using MyHits, and interacting protein prediction using the Database of Interacting Proteins. Nineteen proteins were predicted to be zinc-binding proteins, thus affecting the activities of nucleic acids. The 32 panda-specific proteins will be further investigated by structural and functional analysis. PMID:22865348

Huo, Tong; Zhang, Yinjie; Lin, Jianping

2012-08-01

112

Mining locus tags in PubMed Central to improve microbial gene annotation  

PubMed Central

Background The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases. We propose to utilize the Open Access (OA) subset of PubMed Central (PMC) as a gene annotation database and have developed an R package called pmcXML to automatically mine and extract locus tags from full text, tables and supplements. Results We mined locus tags from 1835 OA publications in ten microbial genomes and extracted tags mentioned in 30,891 sentences in main text and 20,489 rows in tables. We identified locus tag pairs marking the start and end of a region such as an operon or genomic island and expanded these ranges to add another 13,043 tags. We also searched for locus tags in supplementary tables and publications outside the OA subset in Burkholderia pseudomallei K96243 for comparison. There were 168 publications containing 48,470 locus tags and 83% of mentions were from supplementary materials and 9% from publications outside the OA subset. Conclusions B. pseudomallei locus tags within the full text and tables of OA publications represent only a small fraction of the total mentions in the literature. For microbial genomes with very few functionally characterized proteins, the locus tags mentioned in supplementary tables and within ranges like genomic islands contain the majority of locus tags. Significantly, the functions in the R package provide access to additional resources in the OA subset that are not currently indexed or returned by searching PMC. PMID:24499370

2014-01-01

113

RNA-Seq Analysis of Quercus pubescens Leaves: De Novo Transcriptome Assembly, Annotation and Functional Markers Development  

PubMed Central

Quercus pubescens Willd., a species distributed from Spain to southwest Asia, ranks high for drought tolerance among European oaks. Q. pubescens performs a role of outstanding significance in most Mediterranean forest ecosystems, but few mechanistic studies have been conducted to explore its response to environmental constrains, due to the lack of genomic resources. In our study, we performed a deep transcriptomic sequencing in Q. pubescens leaves, including de novo assembly, functional annotation and the identification of new molecular markers. Our results are a pre-requisite for undertaking molecular functional studies, and may give support in population and association genetic studies. 254,265,700 clean reads were generated by the Illumina HiSeq 2000 platform, with an average length of 98 bp. De novo assembly, using CLC Genomics, produced 96,006 contigs, having a mean length of 618 bp. Sequence similarity analyses against seven public databases (Uniprot, NR, RefSeq and KOGs at NCBI, Pfam, InterPro and KEGG) resulted in 83,065 transcripts annotated with gene descriptions, conserved protein domains, or gene ontology terms. These annotations and local BLAST allowed identify genes specifically associated with mechanisms of drought avoidance. Finally, 14,202 microsatellite markers and 18,425 single nucleotide polymorphisms (SNPs) were, in silico, discovered in assembled and annotated sequences. We completed a successful global analysis of the Q. pubescens leaf transcriptome using RNA-seq. The assembled and annotated sequences together with newly discovered molecular markers provide genomic information for functional genomic studies in Q. pubescens, with special emphasis to response mechanisms to severe constrain of the Mediterranean climate. Our tools enable comparative genomics studies on other Quercus species taking advantage of large intra-specific ecophysiological differences. PMID:25393112

Torre, Sara; Tattini, Massimiliano; Brunetti, Cecilia; Fineschi, Silvia; Fini, Alessio; Ferrini, Francesco; Sebastiani, Federico

2014-01-01

114

Annotation-Modules: a tool for finding significant combinations of multisource annotations for gene lists  

Microsoft Academic Search

Motivation: The ontological analysis of the gene lists obtained from DNA microarray experiments constitutes an important step in under- standing the underlying biology of the analyzed system. Over the last years, many other high-throughput techniques emerged, cover- ing now basically all \\

Michael Hackenberg; Rune Matthiesen

2008-01-01

115

Overcoming function annotation errors in the Gram-positive pathogen Streptococcus suis by a proteomics-driven approach  

PubMed Central

Background Annotation of protein-coding genes is a key step in sequencing projects. Protein functions are mainly assigned on the basis of the amino acid sequence alone by searching of homologous proteins. However, fully automated annotation processes often lead to wrong prediction of protein functions, and therefore time-intensive manual curation is often essential. Here we describe a fast and reliable way to correct function annotation in sequencing projects, focusing on surface proteomes. We use a proteomics approach, previously proven to be very powerful for identifying new vaccine candidates against Gram-positive pathogens. It consists of shaving the surface of intact cells with two proteases, the specific cleavage-site trypsin and the unspecific proteinase K, followed by LC/MS/MS analysis of the resulting peptides. The identified proteins are contrasted by computational analysis and their sequences are inspected to correct possible errors in function prediction. Results When applied to the zoonotic pathogen Streptococcus suis, of which two strains have been recently sequenced and annotated, we identified a set of surface proteins without cytoplasmic contamination: all the proteins identified had exporting or retention signals towards the outside and/or the cell surface, and viability of protease-treated cells was not affected. The combination of both experimental evidences and computational methods allowed us to determine that two of these proteins are putative extracellular new adhesins that had been previously attributed a wrong cytoplasmic function. One of them is a putative component of the pilus of this bacterium. Conclusion We illustrate the complementary nature of laboratory-based and computational methods to examine in concert the localization of a set of proteins in the cell, and demonstrate the utility of this proteomics-based strategy to experimentally correct function annotation errors in sequencing projects. This approach also contributes to provide strong experimental evidences that can be used to annotate those proteins for which a Gene Ontology (GO) term has not been assigned so far. Function annotation correction would then improve the identification of surface-associated proteins in bacterial pathogens, thus accelerating the discovery of new vaccines in infectious disease research. PMID:19061494

Rodriguez-Ortega, Manuel J; Luque, Inmaculada; Tarradas, Carmen; Barcena, Jose A

2008-01-01

116

The Alignment of the Medical Subject Headings to the Gene Ontology and Its Application in Gene Annotation  

Microsoft Academic Search

\\u000a The Gene Ontology (GO) is a controlled vocabulary used for annotation of genes. Assigning such terms to uncategorized genes\\u000a is time-consuming work, and a recurring task in biomedicine. The biomedical citations of the literature database MEDLINE are\\u000a indexed with terms from the Medical Subject Headings (MeSH). We studied whether MeSH terms from gene-related MEDLINE entries\\u000a could be translated to GO,

Henrik Tveit; Torulf Mollestad; Astrid Lægreid

2004-01-01

117

A statistical framework for improving genomic annotations of prokaryotic essential genes.  

PubMed

Large-scale systematic analysis of gene essentiality is an important step closer toward unraveling the complex relationship between genotypes and phenotypes. Such analysis cannot be accomplished without unbiased and accurate annotations of essential genes. In current genomic databases, most of the essential gene annotations are derived from whole-genome transposon mutagenesis (TM), the most frequently used experimental approach for determining essential genes in microorganisms under defined conditions. However, there are substantial systematic biases associated with TM experiments. In this study, we developed a novel Poisson model-based statistical framework to simulate the TM insertion process and subsequently correct the experimental biases. We first quantitatively assessed the effects of major factors that potentially influence the accuracy of TM and subsequently incorporated relevant factors into the framework. Through iteratively optimizing parameters, we inferred the actual insertion events occurred and described each gene's essentiality on probability measure. Evaluated by the definite mapping of essential gene profile in Escherichia coli, our model significantly improved the accuracy of original TM datasets, resulting in more accurate annotations of essential genes. Our method also showed encouraging results in improving subsaturation level TM datasets. To test our model's broad applicability to other bacteria, we applied it to Pseudomonas aeruginosa PAO1 and Francisella tularensis novicida TM datasets. We validated our predictions by literature as well as allelic exchange experiments in PAO1. Our model was correct on six of the seven tested genes. Remarkably, among all three cases that our predictions contradicted the TM assignments, experimental validations supported our predictions. In summary, our method will be a promising tool in improving genomic annotations of essential genes and enabling large-scale explorations of gene essentiality. Our contribution is timely considering the rapidly increasing essential gene sets. A Webserver has been set up to provide convenient access to this tool. All results and source codes are available for download upon publication at http://research.cchmc.org/essentialgene/. PMID:23520492

Deng, Jingyuan; Su, Shengchang; Lin, Xiaodong; Hassett, Daniel J; Lu, Long Jason

2013-01-01

118

SelenoDB 2.0: annotation of selenoprotein genes in animals and their genetic diversity in humans  

PubMed Central

SelenoDB (http://www.selenodb.org) aims to provide high-quality annotations of selenoprotein genes, proteins and SECIS elements. Selenoproteins are proteins that contain the amino acid selenocysteine (Sec) and the first release of the database included annotations for eight species. Since the release of SelenoDB 1.0 many new animal genomes have been sequenced. The annotations of selenoproteins in new genomes usually contain many errors in major databases. For this reason, we have now fully annotated selenoprotein genes in 58 animal genomes. We provide manually curated annotations for human selenoproteins, whereas we use an automatic annotation pipeline to annotate selenoprotein genes in other animal genomes. In addition, we annotate the homologous genes containing cysteine (Cys) instead of Sec. Finally, we have surveyed genetic variation in the annotated genes in humans. We use exon capture and resequencing approaches to identify single-nucleotide polymorphisms in more than 50 human populations around the world. We thus present a detailed view of the genetic divergence of Sec- and Cys-containing genes in animals and their diversity in humans. The addition of these datasets into the second release of the database provides a valuable resource for addressing medical and evolutionary questions in selenium biology. PMID:24194593

Romagne, Frederic; Santesmasses, Didac; White, Louise; Sarangi, Gaurab K.; Mariotti, Marco; Hubler, Ron; Weihmann, Antje; Parra, Genis; Gladyshev, Vadim N.; Guigo, Roderic; Castellano, Sergi

2014-01-01

119

Protein annotation as term categorization in the gene ontology using word proximity networks  

PubMed Central

Background We participated in the BioCreAtIvE Task 2, which addressed the annotation of proteins into the Gene Ontology (GO) based on the text of a given document and the selection of evidence text from the document justifying that annotation. We approached the task utilizing several combinations of two distinct methods: an unsupervised algorithm for expanding words associated with GO nodes, and an annotation methodology which treats annotation as categorization of terms from a protein's document neighborhood into the GO. Results The evaluation results indicate that the method for expanding words associated with GO nodes is quite powerful; we were able to successfully select appropriate evidence text for a given annotation in 38% of Task 2.1 queries by building on this method. The term categorization methodology achieved a precision of 16% for annotation within the correct extended family in Task 2.2, though we show through subsequent analysis that this can be improved with a different parameter setting. Our architecture proved not to be very successful on the evidence text component of the task, in the configuration used to generate the submitted results. Conclusion The initial results show promise for both of the methods we explored, and we are planning to integrate the methods more closely to achieve better results overall. PMID:15960833

Verspoor, Karin; Cohn, Judith; Joslyn, Cliff; Mniszewski, Sue; Rechtsteiner, Andreas; Rocha, Luis M; Simas, Tiago

2005-01-01

120

Identification of novel biomass-degrading enzymes from genomic dark matter: Populating genomic sequence space with functional annotation.  

PubMed

Although recent nucleotide sequencing technologies have significantly enhanced our understanding of microbial genomes, the function of ?35% of genes identified in a genome currently remains unknown. To improve the understanding of microbial genomes and consequently of microbial processes it will be crucial to assign a function to this "genomic dark matter." Due to the urgent need for additional carbohydrate-active enzymes for improved production of transportation fuels from lignocellulosic biomass, we screened the genomes of more than 5,500 microorganisms for hypothetical proteins that are located in the proximity of already known cellulases. We identified, synthesized and expressed a total of 17 putative cellulase genes with insufficient sequence similarity to currently known cellulases to be identified as such using traditional sequence annotation techniques that rely on significant sequence similarity. The recombinant proteins of the newly identified putative cellulases were subjected to enzymatic activity assays to verify their hydrolytic activity towards cellulose and lignocellulosic biomass. Eleven (65%) of the tested enzymes had significant activity towards at least one of the substrates. This high success rate highlights that a gene context-based approach can be used to assign function to genes that are otherwise categorized as "genomic dark matter" and to identify biomass-degrading enzymes that have little sequence similarity to already known cellulases. The ability to assign function to genes that have no related sequence representatives with functional annotation will be important to enhance our understanding of microbial processes and to identify microbial proteins for a wide range of applications. PMID:24728961

Piao, Hailan; Froula, Jeff; Du, Changbin; Kim, Tae-Wan; Hawley, Erik R; Bauer, Stefan; Wang, Zhong; Ivanova, Nathalia; Clark, Douglas S; Klenk, Hans-Peter; Hess, Matthias

2014-08-01

121

Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation  

PubMed Central

Background Microbiome-wide gene expression profiling through high-throughput RNA sequencing (‘metatranscriptomics’) offers a powerful means to functionally interrogate complex microbial communities. Key to successful exploitation of these datasets is the ability to confidently match relatively short sequence reads to known bacterial transcripts. In the absence of reference genomes, such annotation efforts may be enhanced by assembling reads into longer contiguous sequences (‘contigs’), prior to database search strategies. Since reads from homologous transcripts may derive from several species, represented at different abundance levels, it is not clear how well current assembly pipelines perform for metatranscriptomic datasets. Here we evaluate the performance of four currently employed assemblers including de novo transcriptome assemblers - Trinity and Oases; the metagenomic assembler - Metavelvet; and the recently developed metatranscriptomic assembler IDBA-MT. Results We evaluated the performance of the assemblers on a previously published dataset of single-end RNA sequence reads derived from the large intestine of an inbred non-obese diabetic mouse model of type 1 diabetes. We found that Trinity performed best as judged by contigs assembled, reads assigned to contigs, and number of reads that could be annotated to a known bacterial transcript. Only 15.5% of RNA sequence reads could be annotated to a known transcript in contrast to 50.3% with Trinity assembly. Paired-end reads generated from the same mouse samples resulted in modest performance gains. A database search estimated that the assemblies are unlikely to erroneously merge multiple unrelated genes sharing a region of similarity (<2% of contigs). A simulated dataset based on ten species confirmed these findings. A more complex simulated dataset based on 72 species found that greater assembly errors were introduced than is expected by sequencing quality. Through the detailed evaluation of assembly performance, the insights provided by this study will help drive the design of future metatranscriptomic analyses. Conclusion Assembly of metatranscriptome datasets greatly improved read annotation. Of the four assemblers evaluated, Trinity provided the best performance. For more complex datasets, reads generated from transcripts sharing considerable sequence similarity can be a source of significant assembly error, suggesting a need to collate reads on the basis of common taxonomic origin prior to assembly. PMID:25411636

2014-01-01

122

A semi-automated genome annotation comparison and integration scheme  

PubMed Central

Background Different genome annotation services have been developed in recent years and widely used. However, the functional annotation results from different services are often not the same and a scheme to obtain consensus functional annotations by integrating different results is in demand. Results This article presents a semi-automated scheme that is capable of comparing functional annotations from different sources and consequently obtaining a consensus genome functional annotation result. In this study, we used four automated annotation services to annotate a newly sequenced genome--Arcobacter butzleri ED-1. Our scheme is divided into annotation comparison and annotation determination sections. In the functional annotation comparison section, we employed gene synonym lists to tackle term difference problems. Multiple techniques from information retrieval were used to preprocess the functional annotations. Based on the functional annotation comparison results, we designed a decision tree to obtain a consensus functional annotation result. Experimental results show that our approach can greatly reduce the workload of manual comparison by automatically comparing 87% of the functional annotations. In addition, it automatically determined 87% of the functional annotations, leaving only 13% of the genes for manual curation. We applied this approach across six phylogenetically different genomes in order to assess the performance consistency. The results showed that our scheme is able to automatically perform, on average, 73% and 86% of the annotation comparison and determination tasks, respectively. Conclusions We propose a semi-automatic and effective scheme to compare and determine genome functional annotations. It greatly reduces the manual work required in genome functional annotation. As this scheme does not require any specific biological knowledge, it is readily applicable for genome annotation comparison and genome re-annotation projects. PMID:23725374

2013-01-01

123

Identification and computational annotation of genes differentially expressed in pulp development of Cocos nucifera L. by suppression subtractive hybridization  

PubMed Central

Background Coconut (Cocos nucifera L.) is one of the world’s most versatile, economically important tropical crops. Little is known about the physiological and molecular basis of coconut pulp (endosperm) development and only a few coconut genes and gene product sequences are available in public databases. This study identified genes that were differentially expressed during development of coconut pulp and functionally annotated these identified genes using bioinformatics analysis. Results Pulp from three different coconut developmental stages was collected. Four suppression subtractive hybridization (SSH) libraries were constructed (forward and reverse libraries A and B between stages 1 and 2, and C and D between stages 2 and 3), and identified sequences were computationally annotated using Blast2GO software. A total of 1272 clones were obtained for analysis from four SSH libraries with 63% showing similarity to known proteins. Pairwise comparing of stage-specific gene ontology ids from libraries B-D, A-C, B-C and A-D showed that 32 genes were continuously upregulated and seven downregulated; 28 were transiently upregulated and 23 downregulated. KEGG (Kyoto Encyclopedia of Genes and Genomes) analysis showed that 1-acyl-sn-glycerol-3-phosphate acyltransferase (LPAAT), phospholipase D, acetyl-CoA carboxylase carboxyltransferase beta subunit, 3-hydroxyisobutyryl-CoA hydrolase-like and pyruvate dehydrogenase E1 ? subunit were associated with fatty acid biosynthesis or metabolism. Triose phosphate isomerase, cellulose synthase and glucan 1,3-?-glucosidase were related to carbohydrate metabolism, and phosphoenolpyruvate carboxylase was related to both fatty acid and carbohydrate metabolism. Of 737 unigenes, 103 encoded enzymes were involved in fatty acid and carbohydrate biosynthesis and metabolism, and a number of transcription factors and other interesting genes with stage-specific expression were confirmed by real-time PCR, with validation of the SSH results as high as 66.6%. Based on determination of coconut endosperm fatty acids content by gas chromatography–mass spectrometry, a number of candidate genes in fatty acid anabolism were selected for further study. Conclusion Functional annotation of genes differentially expressed in coconut pulp development helped determine the molecular basis of coconut endosperm development. The SSH method identified genes related to fatty acids, carbohydrate and secondary metabolites. The results will be important for understanding gene functions and regulatory networks in coconut fruit. PMID:25084812

2014-01-01

124

Annotation of Genes Having Candidate Somatic Mutations in Acute Myeloid Leukemia with Whole-Exome Sequencing Using Concept Lattice Analysis  

PubMed Central

In cancer genome studies, the annotation of newly detected oncogene/tumor suppressor gene candidates is a challenging process. We propose using concept lattice analysis for the annotation and interpretation of genes having candidate somatic mutations in whole-exome sequencing in acute myeloid leukemia (AML). We selected 45 highly mutated genes with whole-exome sequencing in 10 normal matched samples of the AML-M2 subtype. To evaluate these genes, we performed concept lattice analysis and annotated these genes with existing knowledge databases. PMID:23613681

Lee, Kye Hwa; Lim, Jae Hyeun

2013-01-01

125

A computational approach to candidate gene prioritization for X-linked mental retardation using annotation-based binary filtering and motif-based linear discriminatory analysis  

Microsoft Academic Search

Background  Several computational candidate gene selection and prioritization methods have recently been developed. These in silico selection and prioritization techniques are usually based on two central approaches - the examination of similarities to\\u000a known disease genes and\\/or the evaluation of functional annotation of genes. Each of these approaches has its own caveats.\\u000a Here we employ a previously described method of candidate

Zané Lombard; Kateryna D Makova; Michèle Ramsay

2011-01-01

126

Gene3D: Multi-domain annotations for protein sequence and comparative genome analysis  

PubMed Central

Gene3D (http://gene3d.biochem.ucl.ac.uk) is a database of protein domain structure annotations for protein sequences. Domains are predicted using a library of profile HMMs from 2738 CATH superfamilies. Gene3D assigns domain annotations to Ensembl and UniProt sequence sets including >6000 cellular genomes and >20 million unique protein sequences. This represents an increase of 45% in the number of protein sequences since our last publication. Thanks to improvements in the underlying data and pipeline, we see large increases in the domain coverage of sequences. We have expanded this coverage by integrating Pfam and SUPERFAMILY domain annotations, and we now resolve domain overlaps to provide highly comprehensive composite multi-domain architectures. To make these data more accessible for comparative genome analyses, we have developed novel search algorithms for searching genomes to identify related multi-domain architectures. In addition to providing domain family annotations, we have now developed a pipeline for 3D homology modelling of domains in Gene3D. This has been applied to the human genome and will be rolled out to other major organisms over the next year. PMID:24270792

Lees, Jonathan G.; Lee, David; Studer, Romain A.; Dawson, Natalie L.; Sillitoe, Ian; Das, Sayoni; Yeats, Corin; Dessailly, Benoit H.; Rentzsch, Robert; Orengo, Christine A.

2014-01-01

127

Functional-Network-Based Gene Set Analysis Using Gene-Ontology  

PubMed Central

To account for the functional non-equivalence among a set of genes within a biological pathway when performing gene set analysis, we introduce GOGANPA, a network-based gene set analysis method, which up-weights genes with functions relevant to the gene set of interest. The genes are weighted according to its degree within a genome-scale functional network constructed using the functional annotations available from the gene ontology database. By benchmarking GOGANPA using a well-studied P53 data set and three breast cancer data sets, we will demonstrate the power and reproducibility of our proposed method over traditional unweighted approaches and a competing network-based approach that involves a complex integrated network. GOGANPA’s sole reliance on gene ontology further allows GOGANPA to be widely applicable to the analysis of any gene-ontology-annotated genome. PMID:23418449

Chang, Billy; Kustra, Rafal; Tian, Weidong

2013-01-01

128

Comparative Analysis of Chloroplast Genomes: Functional Annotation, Genome-Based Phylogeny, and Deduced Evolutionary Patterns  

PubMed Central

All protein sequences from 19 complete chloroplast genomes (cpDNA) have been studied using a new computational method able to analyze functional correlations among series of protein sequences contained in complete proteomes. First, all open reading frames (ORFs) from the cpDNAs, comprising a total of 2266 protein sequences, were compared against the 3168 proteins from Synechocystis PCC6803 complete genome to find functionally related orthologous proteins. Additionally, all cpDNA genomes were pairwise compared to find orthologous groups not present in cyanobacteria. Annotations in the cluster of othologous proteins database and CyanoBase were used as reference for the functional assignments. Following this protocol, new functional assignments were made for ORFs of unknown function and for ycfs (hypothetical chloroplast frames), which still lack a functional assignment. Using this information, a matrix of functional relationships was derived from profiles of the presence and/or absence of orthologous proteins; the matrix included 1837 proteins in 277 orthologous clusters. A factor analysis study of this matrix, followed by cluster analysis, allowed us to obtain accurate phylogenetic reconstructions and the detection of genes probably involved in speciation as phylogenetic correlates. Finally, by grouping common evolutionary patterns, we show that it is possible to determine functionally linked protein networks. This has allowed us to suggest putative associations for some unknown ORFs. PMID:11932241

Rivas, Javier De Las; Lozano, Juan Jose; Ortiz, Angel R.

2002-01-01

129

IDconverter and IDClight: Conversion and annotation of gene and protein IDs  

Microsoft Academic Search

BACKGROUND: Researchers involved in the annotation of large numbers of gene, clone or protein identifiers are usually required to perform a one-by-one conversion for each identifier. When the field of research is one such as microarray experiments, this number may be around 30,000. RESULTS: To help researchers map accession numbers and identifiers among clones, genes, proteins and chromosomal positions, we

Andreu Alibés; Patricio Yankilevich; Ramón Díaz-uriarte

2007-01-01

130

CDD: specific functional annotation with the Conserved Domain Database  

Microsoft Academic Search

NCBI's Conserved Domain Database (CDD) is a col- lection of multiple sequence alignments and derived database search models, which represent protein domains conserved in molecular evolution. The col- lection can be accessed at http:\\/\\/www.ncbi.nlm. nih.gov\\/Structure\\/cdd\\/cdd.shtml, and is also part of NCBI's Entrez query and retrieval system, cross- linked to numerous other resources. CDD provides annotation of domain footprints and conserved

Aron Marchler-bauer; John B. Anderson; Farideh Chitsaz; Myra K. Derbyshire; Carol Deweese-scott; Jessica H. Fong; Lewis Y. Geer; Renata C. Geer; Noreen R. Gonzales; Marc Gwadz; Siqian He; David I. Hurwitz; John D. Jackson; Zhaoxi Ke; Christopher J. Lanczycki; Cynthia A. Liebert; Chunlei Liu; Fu Lu; Shennan Lu; Gabriele H. Marchler; Mikhail Mullokandov; James S. Song; Asba Tasneem; Narmada Thanki; Roxanne A. Yamashita; Dachuan Zhang; Naigong Zhang; Stephen H. Bryant

2009-01-01

131

Phylogeny, Functional Annotation, and Protein Interaction Network Analyses of the Xenopus tropicalis Basic Helix-Loop-Helix Transcription Factors  

PubMed Central

The previous survey identified 70 basic helix-loop-helix (bHLH) proteins, but it was proved to be incomplete, and the functional information and regulatory networks of frog bHLH transcription factors were not fully known. Therefore, we conducted an updated genome-wide survey in the Xenopus tropicalis genome project databases and identified 105 bHLH sequences. Among the retrieved 105 sequences, phylogenetic analyses revealed that 103 bHLH proteins belonged to 43 families or subfamilies with 46, 26, 11, 3, 15, and 4 members in the corresponding supergroups. Next, gene ontology (GO) enrichment analyses showed 65 significant GO annotations of biological processes and molecular functions and KEGG pathways counted in frequency. To explore the functional pathways, regulatory gene networks, and/or related gene groups coding for Xenopus tropicalis bHLH proteins, the identified bHLH genes were put into the databases KOBAS and STRING to get the signaling information of pathways and protein interaction networks according to available public databases and known protein interactions. From the genome annotation and pathway analysis using KOBAS, we identified 16 pathways in the Xenopus tropicalis genome. From the STRING interaction analysis, 68 hub proteins were identified, and many hub proteins created a tight network or a functional module within the protein families. PMID:24312906

Chen, Deyu

2013-01-01

132

Annotation and comparative analysis of the glycoside hydrolase genes in Brachypodium distachyon  

SciTech Connect

Background Glycoside hydrolases cleave the bond between a carbohydrate and another carbohydrate, a protein, lipid or other moiety. Genes encoding glycoside hydrolases are found in a wide range of organisms, from archea to animals, and are relatively abundant in plant genomes. In plants, these enzymes are involved in diverse processes, including starch metabolism, defense, and cell-wall remodeling. Glycoside hydrolase genes have been previously cataloged for Oryza sativa (rice), the model dicotyledonous plant Arabidopsis thaliana, and the fast-growing tree Populus trichocarpa (poplar). To improve our understanding of glycoside hydrolases in plants generally and in grasses specifically, we annotated the glycoside hydrolase genes in the grasses Brachypodium distachyon (an emerging monocotyledonous model) and Sorghum bicolor (sorghum). We then compared the glycoside hydrolases across species, both at the whole-genome level and at the level of individual glycoside hydrolase families. Results We identified 356 glycoside hydrolase genes in Brachypodium and 404 in sorghum. The corresponding proteins fell into the same 34 families that are represented in rice, Arabidopsis, and poplar, helping to define a glycoside hydrolase family profile which may be common to flowering plants. Examination of individual glycoside hydrolase familes (GH5, GH13, GH18, GH19, GH28, and GH51) revealed both similarities and distinctions between monocots and dicots, as well as between species. Shared evolutionary histories appear to be modified by lineage-specific expansions or deletions. Within families, the Brachypodium and sorghum proteins generally cluster with those from other monocots. Conclusions This work provides the foundation for further comparative and functional analyses of plant glycoside hydrolases. Defining the Brachypodium glycoside hydrolases sets the stage for Brachypodium to be a monocot model for investigations of these enzymes and their diverse roles in planta. Insights gained from Brachypodium will inform translational research studies, with applications for the improvement of cereal crops and bioenergy grasses.

Tyler, Ludmila [United States Department of Agriculture (USDA), Western Regional Research Center (WRRC), Albany; Bragg, Jennifer [United States Department of Agriculture (USDA), Western Regional Research Center (WRRC), Albany; Wu, Jiajie [United States Department of Agriculture (USDA), Western Regional Research Center (WRRC), Albany; Yang, Xiaohan [ORNL; Tuskan, Gerald A [ORNL; Vogel, John [United States Department of Agriculture (USDA), Western Regional Research Center (WRRC), Albany

2010-01-01

133

Genome Annotation of Burkholderia sp. SJ98 with Special Focus on Chemotaxis Genes  

PubMed Central

Burkholderia sp. strain SJ98 has the chemotactic activity towards nitroaromatic and chloronitroaromatic compounds. Recently our group published draft genome of strain SJ98. In this study, we further sequence and annotate the genome of stain SJ98 to exploit the potential of this bacterium. We specifically annotate its chemotaxis genes and methyl accepting chemotaxis proteins. Genome of Burkholderia sp. SJ98 was annotated using PGAAP pipeline that predicts 7,268 CDSs, 52 tRNAs and 3 rRNAs. Our analysis based on phylogenetic and comparative genomics suggest that Burkholderia sp. YI23 is closest neighbor of the strain SJ98. The genes involved in the chemotaxis of strain SJ98 were compared with genes of closely related Burkholderia strains (i.e. YI23, CCGE 1001, CCGE 1002, CCGE 1003) and with well characterized bacterium E. coli K12. It was found that strain SJ98 has 37 che genes including 19 methyl accepting chemotaxis proteins that involved in sensing of different attractants. Chemotaxis genes have been found in a cluster along with the flagellar motor proteins. We also developed a web resource that provides comprehensive information on strain SJ98 that includes all analysis data (http://crdd.osdd.net/raghava/genomesrs/burkholderia/). PMID:23940608

Kumar, Shailesh; Vikram, Surendra; Raghava, Gajendra Pal Singh

2013-01-01

134

An integrated gene annotation and transcriptional profiling approach towards the full gene content of the Drosophila genome  

PubMed Central

Background While the genome sequences for a variety of organisms are now available, the precise number of the genes encoded is still a matter of debate. For the human genome several stringent annotation approaches have resulted in the same number of potential genes, but a careful comparison revealed only limited overlap. This indicates that only the combination of different computational prediction methods and experimental evaluation of such in silico data will provide more complete genome annotations. In order to get a more complete gene content of the Drosophila melanogaster genome, we based our new D. melanogaster whole-transcriptome microarray, the Heidelberg FlyArray, on the combination of the Berkeley Drosophila Genome Project (BDGP) annotation and a novel ab initio gene prediction of lower stringency using the Fgenesh software. Results Here we provide evidence for the transcription of approximately 2,600 additional genes predicted by Fgenesh. Validation of the developmental profiling data by RT-PCR and in situ hybridization indicates a lower limit of 2,000 novel annotations, thus substantially raising the number of genes that make a fly. Conclusions The successful design and application of this novel Drosophila microarray on the basis of our integrated in silico/wet biology approach confirms our expectation that in silico approaches alone will always tend to be incomplete. The identification of at least 2,000 novel genes highlights the importance of gathering experimental evidence to discover all genes within a genome. Moreover, as such an approach is independent of homology criteria, it will allow the discovery of novel genes unrelated to known protein families or those that have not been strictly conserved between species. PMID:14709175

Hild, M; Beckmann, B; Haas, SA; Koch, B; Solovyev, V; Busold, C; Fellenberg, K; Boutros, M; Vingron, M; Sauer, F; Hoheisel, JD; Paro, R

2004-01-01

135

Likelihood-Based Gene Annotations for Gap Filling and Quality Assessment in Genome-Scale Metabolic Models  

PubMed Central

Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genes and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to obtain a more accurate network. All described workflows are implemented as part of the DOE Systems Biology Knowledgebase (KBase) and are publicly available via API or command-line web interface. PMID:25329157

Benedict, Matthew N.; Mundy, Michael B.; Henry, Christopher S.; Chia, Nicholas; Price, Nathan D.

2014-01-01

136

Likelihood-based gene annotations for gap filling and quality assessment in genome-scale metabolic models.  

PubMed

Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genes and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to obtain a more accurate network. All described workflows are implemented as part of the DOE Systems Biology Knowledgebase (KBase) and are publicly available via API or command-line web interface. PMID:25329157

Benedict, Matthew N; Mundy, Michael B; Henry, Christopher S; Chia, Nicholas; Price, Nathan D

2014-10-01

137

Structural and functional analysis of Rv3214 from Mycobacterium tuberculosis, a protein with conflicting functional annotations, leads to its characterization as a phosphatase.  

PubMed

The availability of complete genome sequences has highlighted the problems of functional annotation of the many gene products that have only limited sequence similarity with proteins of known function. The predicted protein encoded by open reading frame Rv3214 from the Mycobacterium tuberculosis H37Rv genome was originally annotated as EntD through sequence similarity with the Escherichia coli EntD, a 4'-phosphopantetheinyl transferase implicated in siderophore biosynthesis. An alternative annotation, based on slightly higher sequence identity, grouped Rv3214 with proteins of the cofactor-dependent phosphoglycerate mutase (dPGM) family. The crystal structure of this protein has been solved by single-wavelength anomalous dispersion methods and refined at 2.07-Angstroms resolution (R = 0.229; R(free) = 0.245). The protein is dimeric, with a monomer fold corresponding to the classical dPGM alpha/beta structure, albeit with some variations. Closer comparisons of structure and sequence indicate that it most closely corresponds with a broad-spectrum phosphatase subfamily within the dPGM superfamily. This functional annotation has been confirmed by biochemical assays which show negligible mutase activity but acid phosphatase activity with a pH optimum of 5.4 and suggests that Rv3214 may be important for mycobacterial phosphate metabolism in vivo. Despite its weak sequence similarity with the 4'-phosphopantetheinyl transferases (EntD homologues), there is little evidence to support this function. PMID:16672613

Watkins, Harriet A; Baker, Edward N

2006-05-01

138

Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae  

PubMed Central

As we are moving into the post genome-sequencing era, various high-throughput experimental techniques have been developed to characterize biological systems on the genomic scale. Discovering new biological knowledge from the high-throughput biological data is a major challenge to bioinformatics today. To address this challenge, we developed a Bayesian statistical method together with Boltzmann machine and simulated annealing for protein functional annotation in the yeast Saccharomyces cerevisiae through integrating various high-throughput biological data, including yeast two-hybrid data, protein complexes and microarray gene expression profiles. In our approach, we quantified the relationship between functional similarity and high-throughput data, and coded the relationship into ‘functional linkage graph’, where each node represents one protein and the weight of each edge is characterized by the Bayesian probability of function similarity between two proteins. We also integrated the evolution information and protein subcellular localization information into the prediction. Based on our method, 1802 out of 2280 unannotated proteins in yeast were assigned functions systematically. PMID:15585665

Chen, Yu; Xu, Dong

2004-01-01

139

Whole-genome annotation by using evidence integration in functional-linkage networks  

E-print Network

Whole-genome annotation by using evidence integration in functional-linkage networks Ulas Karaoz are ``hypothetical'' proteins (2). Several research groups (3, 4) have popularized the framework of a ``functional-linkage between proteins. In a typical functional-linkage graph, each node corresponds to a protein, and an edge

140

Functional Annotation of Proteomic Data from Chicken Heterophils and Macrophages Induced by Carbon Nanotube Exposure  

PubMed Central

With the expanding applications of carbon nanotubes (CNT) in biomedicine and agriculture, questions about the toxicity and biocompatibility of CNT in humans and domestic animals are becoming matters of serious concern. This study used proteomic methods to profile gene expression in chicken macrophages and heterophils in response to CNT exposure. Two-dimensional gel electrophoresis identified 12 proteins in macrophages and 15 in heterophils, with differential expression patterns in response to CNT co-incubation (0, 1, 10, and 100 ?g/mL of CNT for 6 h) (p < 0.05). Gene ontology analysis showed that most of the differentially expressed proteins are associated with protein interactions, cellular metabolic processes, and cell mobility, suggesting activation of innate immune functions. Western blot analysis with heat shock protein 70, high mobility group protein, and peptidylprolyl isomerase A confirmed the alterations of the profiled proteins. The functional annotations were further confirmed by effective cell migration, promoted interleukin-1? secretion, and more cell death in both macrophages and heterophils exposed to CNT (p < 0.05). In conclusion, results of this study suggest that CNT exposure affects protein expression, leading to activation of macrophages and heterophils, resulting in altered cytoskeleton remodeling, cell migration, and cytokine production, and thereby mediates tissue immune responses. PMID:24823882

Li, Yun-Ze; Cheng, Chung-Shi; Chen, Chao-Jung; Li, Zi-Lin; Lin, Yao-Tung; Chen, Shuen-Ei; Huang, San-Yuan

2014-01-01

141

Functional annotation of proteomic data from chicken heterophils and macrophages induced by carbon nanotube exposure.  

PubMed

With the expanding applications of carbon nanotubes (CNT) in biomedicine and agriculture, questions about the toxicity and biocompatibility of CNT in humans and domestic animals are becoming matters of serious concern. This study used proteomic methods to profile gene expression in chicken macrophages and heterophils in response to CNT exposure. Two-dimensional gel electrophoresis identified 12 proteins in macrophages and 15 in heterophils, with differential expression patterns in response to CNT co-incubation (0, 1, 10, and 100 µg/mL of CNT for 6 h) (p < 0.05). Gene ontology analysis showed that most of the differentially expressed proteins are associated with protein interactions, cellular metabolic processes, and cell mobility, suggesting activation of innate immune functions. Western blot analysis with heat shock protein 70, high mobility group protein, and peptidylprolyl isomerase A confirmed the alterations of the profiled proteins. The functional annotations were further confirmed by effective cell migration, promoted interleukin-1? secretion, and more cell death in both macrophages and heterophils exposed to CNT (p < 0.05). In conclusion, results of this study suggest that CNT exposure affects protein expression, leading to activation of macrophages and heterophils, resulting in altered cytoskeleton remodeling, cell migration, and cytokine production, and thereby mediates tissue immune responses. PMID:24823882

Li, Yun-Ze; Cheng, Chung-Shi; Chen, Chao-Jung; Li, Zi-Lin; Lin, Yao-Tung; Chen, Shuen-Ei; Huang, San-Yuan

2014-01-01

142

Rapid Annotation of Anonymous Sequences from Genome Projects Using Semantic Similarities and a Weighting Scheme in Gene Ontology  

PubMed Central

Background Large-scale sequencing projects have now become routine lab practice and this has led to the development of a new generation of tools involving function prediction methods, bringing the latter back to the fore. The advent of Gene Ontology, with its structured vocabulary and paradigm, has provided computational biologists with an appropriate means for this task. Methodology We present here a novel method called ARGOT (Annotation Retrieval of Gene Ontology Terms) that is able to process quickly thousands of sequences for functional inference. The tool exploits for the first time an integrated approach which combines clustering of GO terms, based on their semantic similarities, with a weighting scheme which assesses retrieved hits sharing a certain number of biological features with the sequence to be annotated. These hits may be obtained by different methods and in this work we have based ARGOT processing on BLAST results. Conclusions The extensive benchmark involved 10,000 protein sequences, the complete S. cerevisiae genome and a small subset of proteins for purposes of comparison with other available tools. The algorithm was proven to outperform existing methods and to be suitable for function prediction of single proteins due to its high degree of sensitivity, specificity and coverage. PMID:19247487

Fontana, Paolo; Cestaro, Alessandro; Velasco, Riccardo; Formentin, Elide; Toppo, Stefano

2009-01-01

143

Annotation and retrieval system of CAD models based on functional semantics  

NASA Astrophysics Data System (ADS)

CAD model retrieval based on functional semantics is more significant than content-based 3D model retrieval during the mechanical conceptual design phase. However, relevant research is still not fully discussed. Therefore, a functional semantic-based CAD model annotation and retrieval method is proposed to support mechanical conceptual design and design reuse, inspire designer creativity through existing CAD models, shorten design cycle, and reduce costs. Firstly, the CAD model functional semantic ontology is constructed to formally represent the functional semantics of CAD models and describe the mechanical conceptual design space comprehensively and consistently. Secondly, an approach to represent CAD models as attributed adjacency graphs(AAG) is proposed. In this method, the geometry and topology data are extracted from STEP models. On the basis of AAG, the functional semantics of CAD models are annotated semi-automatically by matching CAD models that contain the partial features of which functional semantics have been annotated manually, thereby constructing CAD Model Repository that supports model retrieval based on functional semantics. Thirdly, a CAD model retrieval algorithm that supports multi-function extended retrieval is proposed to explore more potential creative design knowledge in the semantic level. Finally, a prototype system, called Functional Semantic-based CAD Model Annotation and Retrieval System(FSMARS), is implemented. A case demonstrates that FSMARS can successfully botain multiple potential CAD models that conform to the desired function. The proposed research addresses actual needs and presents a new way to acquire CAD models in the mechanical conceptual design phase.

Wang, Zhansong; Tian, Ling; Duan, Wenrui

2014-10-01

144

TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant Genomes.  

PubMed

In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5?days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8?h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future. PMID:22645565

Leroy, Philippe; Guilhot, Nicolas; Sakai, Hiroaki; Bernard, Aurélien; Choulet, Frédéric; Theil, Sébastien; Reboux, Sébastien; Amano, Naoki; Flutre, Timothée; Pelegrin, Céline; Ohyanagi, Hajime; Seidel, Michael; Giacomoni, Franck; Reichstadt, Mathieu; Alaux, Michael; Gicquello, Emmanuelle; Legeai, Fabrice; Cerutti, Lorenzo; Numa, Hisataka; Tanaka, Tsuyoshi; Mayer, Klaus; Itoh, Takeshi; Quesneville, Hadi; Feuillet, Catherine

2012-01-01

145

BioBuilder as a database development and functional annotation platform for proteins  

PubMed Central

Background The explosion in biological information creates the need for databases that are easy to develop, easy to maintain and can be easily manipulated by annotators who are most likely to be biologists. However, deployment of scalable and extensible databases is not an easy task and generally requires substantial expertise in database development. Results BioBuilder is a Zope-based software tool that was developed to facilitate intuitive creation of protein databases. Protein data can be entered and annotated through web forms along with the flexibility to add customized annotation features to protein entries. A built-in review system permits a global team of scientists to coordinate their annotation efforts. We have already used BioBuilder to develop Human Protein Reference Database , a comprehensive annotated repository of the human proteome. The data can be exported in the extensible markup language (XML) format, which is rapidly becoming as the standard format for data exchange. Conclusions As the proteomic data for several organisms begins to accumulate, BioBuilder will prove to be an invaluable platform for functional annotation and development of customizable protein centric databases. BioBuilder is open source and is available under the terms of LGPL. PMID:15099404

Navarro, J Daniel; Talreja, Naveen; Peri, Suraj; Vrushabendra, BM; Rashmi, BP; Padma, N; Surendranath, Vineeth; Jonnalagadda, Chandra Kiran; Kousthub, PS; Deshpande, Nandan; Shanker, K; Pandey, Akhilesh

2004-01-01

146

A Fugu-Human Genome Synteny Viewer: web software for graphical display and annotation reports of synteny between Fugu genomic sequence and human genes  

Microsoft Academic Search

A web server has been developed to access anno- tation and graphical reports of synteny and gene order between the Fugu genome and human genes. In this system, the assembled Fugu genomic sequences (also known as scaffolds) are annotated. The annotations for each Fugu scaffold are com- puted, stored and made publicly available. The annotations describe matches to human homo-

Mark Halling-Brown; Clare Sansom; David S. Moss; Greg Elgar; Yvonne J. K. Edwards

2004-01-01

147

The proteome of Toxoplasma gondii: integration with the genome provides novel insights into gene expression and annotation  

PubMed Central

Background Although the genomes of many of the most important human and animal pathogens have now been sequenced, our understanding of the actual proteins expressed by these genomes and how well they predict protein sequence and expression is still deficient. We have used three complementary approaches (two-dimensional electrophoresis, gel-liquid chromatography linked tandem mass spectrometry and MudPIT) to analyze the proteome of Toxoplasma gondii, a parasite of medical and veterinary significance, and have developed a public repository for these data within ToxoDB, making for the first time proteomics data an integral part of this key genome resource. Results The draft genome for Toxoplasma predicts around 8,000 genes with varying degrees of confidence. Our data demonstrate how proteomics can inform these predictions and help discover new genes. We have identified nearly one-third (2,252) of all the predicted proteins, with 2,477 intron-spanning peptides providing supporting evidence for correct splice site annotation. Functional predictions for each protein and key pathways were determined from the proteome. Importantly, we show evidence for many proteins that match alternative gene models, or previously unpredicted genes. For example, approximately 15% of peptides matched more convincingly to alternative gene models. We also compared our data with existing transcriptional data in which we highlight apparent discrepancies between gene transcription and protein expression. Conclusion Our data demonstrate the importance of protein data in expression profiling experiments and highlight the necessity of integrating proteomic with genomic data so that iterative refinements of both annotation and expression models are possible. PMID:18644147

Xia, Dong; Sanderson, Sanya J; Jones, Andrew R; Prieto, Judith H; Yates, John R; Bromley, Elizabeth; Tomley, Fiona M; Lal, Kalpana; Sinden, Robert E; Brunk, Brian P; Roos, David S; Wastling, Jonathan M

2008-01-01

148

Implications of functional similarity for gene regulatory interactions  

PubMed Central

If one gene regulates another, those two genes are likely to be involved in many of the same biological functions. Conversely, shared biological function may be suggestive of the existence and nature of a regulatory interaction. With this in mind, we develop a measure of functional similarity between genes based on annotations made to the Gene Ontology in which the magnitude of their functional relationship is also indicative of a regulatory relationship. In contrast to other measures that have previously been used to quantify the functional similarity between genes, our measure scales the strength of any shared functional annotation by the frequency of that function's appearance across the entire set of annotations. We apply our method to both Escherichia coli and Saccharomyces cerevisiae gene annotations and find that the strength of our scaled similarity measure is more predictive of known regulatory interactions than previously published measures of functional similarity. In addition, we observe that the strength of the scaled similarity measure is correlated with the structural importance of links in the known regulatory network. By contrast, other measures of functional similarity are not indicative of any structural importance in the regulatory network. We therefore conclude that adequately adjusting for the frequency of shared biological functions is important in the construction of a functional similarity measure aimed at elucidating the existence and nature of regulatory interactions. We also compare the performance of the scaled similarity with a high-throughput method for determining regulatory interactions from gene expression data and observe that the ontology-based approach identifies a different subset of regulatory interactions compared with the gene expression approach. We show that combining predictions from the scaled similarity with those from the reconstruction algorithm leads to a significant improvement in the accuracy of the reconstructed network. PMID:22298814

Glass, Kimberly; Ott, Edward; Losert, Wolfgang; Girvan, Michelle

2012-01-01

149

Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones  

SciTech Connect

The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult task, as the varieties of transcripts of a gene may vary to a great extent. We thus performed an exhaustive integrative characterization of 41,118 full-length cDNAs that capture the gene transcripts as complete functional cassettes, providing an unequivocal report of structural and functional diversity at the gene level. Our international collaboration has validated 21,037 human gene candidates by analysis of high-quality full-length cDNA clones through curation using unified criteria. This led to the identification of 5,155 new gene candidates. It also manifested the most reliable way to control the quality of the cDNA clones. We have developed a human gene database, called the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). It provides the following: integrative annotation of human genes, description of gene structures, details of novel alternative splicing isoforms, non-protein-coding RNAs, functional domains, subcellular localizations, metabolic pathways, predictions of protein three-dimensional structure, mapping of known single nucleotide polymorphisms (SNPs), identification of polymorphic microsatellite repeats within human genes, and comparative results with mouse full-length cDNAs. The H-InvDB analysis has shown that up to 4 percent of the human genome sequence (National Center for Biotechnology Information build 34 assembly) may contain misassembled or missing regions. We found that 6.5 percent of the human gene candidates (1,377 loci) did not have a good protein-coding open reading frame, of which 296 loci are strong candidates for nonprotein-coding RNA genes . In addition, among 72,027 uniquely mapped SNPs and insertions/deletions localized within human genes, 13,215 nonsynonymous SNPs, 315 nonsense SNPs, and 452 indels occurred in coding regions. Together with 25 polymorphic microsatellite repeats present in coding regions, they may alter protein structure, causing phenotypic effects or resulting in disease. The H-InvDB platform represents a substantial contribution to resources needed for the exploration of human biology and pathology.

Imanishi, Tadashi; Itoh, Takeshi; Suzuki, Yutaka; O'Donovan, Claire; Fukuchi, Satoshi; Koyanagi, Kanako O.; Barrero, Roberto A.; Tamura, Takuro; Yamaguchi-Kabata, Yumi; Tanino, Motohiko; Yura, Kei; Miyazaki, Satoru; Ikeo, Kazuho; Homma, Keiichi; Kasprzyk, Arek; Nishikawa, Tetsuo; Hirakawa, Mika; Thierry-Mieg, Jean; Thierry-Mieg, Danielle; Ashurst, Jennifer; Jia, Libin; Nakao, Mitsuteru; Thomas, Michael A.; Mulder, Nicola; Karavidopoulou, Youla; Jin, Lihua; Kim, Sangsoo; Yasuda, Tomohiro; Lenhard, Boris; Eveno, Eric; Suzuki, Yoshiyuki; Yamasaki, Chisato; Takeda, Jun-ichi; Gough, Craig; Hilton, Phillip; Fujii, Yasuyuki; Sakai, Hiroaki; Tanaka, Susumu; Amid, Clara; Bellgard, Matthew; de Fatima Bonaldo, Maria; Bono Hidemasa; Bromberg, Susan K.; Brookes, Anthony J.; Bruford, Elspeth; Carninci Piero; Chelala, Claude; Couillault, Christine; de Souza, Sandro J.; Debily, Marie-Anne; Devignes, Marie-Dominique; Dubchak, Inna; Endo, Toshinori; Estreicher, Anne; Eyras, Eduardo; Fukami-Kobayashi, Kaoru; Gopinath, Gopal R.; Graudens, Esther; Hahn, Yoonsoo; Han, Michael; Han, Ze-Guang; Hanada, Kousuke; Hanaoka, Hideki; Harada, Erimi; Hashimoto, Katsuyuki; Hinz, Ursula; Hirai, Momoki; Hishiki, Teruyoshi; Hopkinson, Ian; Imbeaud, Sandrine; Inoko, Hidetoshi; Kanapin, Alexander; Kaneko, Yayoi; Kasukawa, Takeya; Kelso, Janet; Kersey, Paul; Kikuno Reiko; Kimura, Kouichi; Korn, Bernhard; Kuryshev, Vladimir; Makalowska, Izabela; Makino Takashi; Mano, Shuhei; Mariage-Samson, Regine; Mashima, Jun; Matsuda, Hideo; Mewes, Hans-Werner; Minoshima, Shinsei; Nagai, Keiichi; Nagasaki, Hideki; Nagata, Naoki; Nigam, Rajni; Ogasawara, Osamu; Ohara, Osamu; Ohtsubo, Masafumi; Okada, Norihiro; Okido, Toshihisa; Oota, Satoshi; Ota, Motonori; Ota, Toshio; Otsuki, Tetsuji; Piatier-Tonneau, Dominique; Poustka, Annemarie; Ren, Shuang-Xi; Saitou, Naruya; Sakai, Katsunaga; Sakamoto, Shigetaka; Sakate, Ryuichi; Schupp, Ingo; Servant, Florence; Sherry, Stephen; Shiba Rie; et al.

2004-01-15

150

Canine candidate genes for dilated cardiomyopathy: annotation of and polymorphic markers for 14 genes  

Microsoft Academic Search

BACKGROUND: Dilated cardiomyopathy is a myocardial disease occurring in humans and domestic animals and is characterized by dilatation of the left ventricle, reduced systolic function and increased sphericity of the left ventricle. Dilated cardiomyopathy has been observed in several, mostly large and giant, dog breeds, such as the Dobermann and the Great Dane. A number of genes have been identified,

Anje C Wiersma; Peter AJ Leegwater; Bernard A van Oost; William E Ollier; Joanna Dukes-McEwan

2007-01-01

151

Proteomics and transcriptomics of the BABA-induced resistance response in potato using a novel functional annotation approach  

PubMed Central

Background Induced resistance (IR) can be part of a sustainable plant protection strategy against important plant diseases. ?-aminobutyric acid (BABA) can induce resistance in a wide range of plants against several types of pathogens, including potato infected with Phytophthora infestans. However, the molecular mechanisms behind this are unclear and seem to be dependent on the system studied. To elucidate the defence responses activated by BABA in potato, a genome-wide transcript microarray analysis in combination with label-free quantitative proteomics analysis of the apoplast secretome were performed two days after treatment of the leaf canopy with BABA at two concentrations, 1 and 10 mM. Results Over 5000 transcripts were differentially expressed and over 90 secretome proteins changed in abundance indicating a massive activation of defence mechanisms with 10 mM BABA, the concentration effective against late blight disease. To aid analysis, we present a more comprehensive functional annotation of the microarray probes and gene models by retrieving information from orthologous gene families across 26 sequenced plant genomes. The new annotation provided GO terms to 8616 previously un-annotated probes. Conclusions BABA at 10 mM affected several processes related to plant hormones and amino acid metabolism. A major accumulation of PR proteins was also evident, and in the mevalonate pathway, genes involved in sterol biosynthesis were down-regulated, whereas several enzymes involved in the sesquiterpene phytoalexin biosynthesis were up-regulated. Interestingly, abscisic acid (ABA) responsive genes were not as clearly regulated by BABA in potato as previously reported in Arabidopsis. Together these findings provide candidates and markers for improved resistance in potato, one of the most important crops in the world. PMID:24773703

2014-01-01

152

Improved structural annotation of protein-coding genes in the Meloidogyne hapla genome using RNA-Seq  

PubMed Central

As high-throughput cDNA sequencing (RNA-Seq) is increasingly applied to hypothesis-driven biological studies, the prediction of protein coding genes based on these data are usurping strictly in silico approaches. Compared with computationally derived gene predictions, structural annotation is more accurate when based on biological evidence, particularly RNA-Seq data. Here, we refine the current genome annotation for the Meloidogyne hapla genome utilizing RNA-Seq data. Published structural annotation defines 14?420 protein-coding genes in the M. hapla genome. Of these, 25% (3751) were found to exhibit some incongruence with RNA-Seq data. Manual annotation enabled these discrepancies to be resolved. Our analysis revealed 544 new gene models that were missing from the prior annotation. Additionally, 1457 transcribed regions were newly identified on the ends of as-yet-unjoined contigs. We also searched for trans-spliced leaders, and based on RNA-Seq data, identified genes that appear to be trans-spliced. Four 22-bp trans-spliced leaders were identified using our pipeline, including the known trans-spliced leader, which is the M. hapla ortholog of SL1. In silico predictions of trans-splicing were validated by comparison with earlier results derived from an independent cDNA library constructed to capture trans-spliced transcripts. The new annotation, which we term HapPep5, is publically available at www.hapla.org.

Guo, Yuelong; Bird, David McK; Nielsen, Dahlia M

2014-01-01

153

A meta-approach for improving the prediction and the functional annotation of ortholog groups  

PubMed Central

Background In comparative genomics, orthologs are used to transfer annotation from genes already characterized to newly sequenced genomes. Many methods have been developed for finding orthologs in sets of genomes. However, the application of different methods on the same proteome set can lead to distinct orthology predictions. Methods We developed a method based on a meta-approach that is able to combine the results of several methods for orthologous group prediction. The purpose of this method is to produce better quality results by using the overlapping results obtained from several individual orthologous gene prediction procedures. Our method proceeds in two steps. The first aims to construct seeds for groups of orthologous genes; these seeds correspond to the exact overlaps between the results of all or several methods. In the second step, these seed groups are expanded by using HMM profiles. Results We evaluated our method on two standard reference benchmarks, OrthoBench and Orthology Benchmark Service. Our method presents a higher level of accurately predicted groups than the individual input methods of orthologous group prediction. Moreover, our method increases the number of annotated orthologous pairs without decreasing the annotation quality compared to twelve state-of-the-art methods. Conclusions The meta-approach based method appears to be a reliable procedure for predicting orthologous groups. Since a large number of methods for predicting groups of orthologous genes exist, it is quite conceivable to apply this meta-approach to several combinations of different methods.

2014-01-01

154

Gene networks in Drosophila melanogaster: integrating experimental data to predict gene function  

PubMed Central

Background Discovering the functions of all genes is a central goal of contemporary biomedical research. Despite considerable effort, we are still far from achieving this goal in any metazoan organism. Collectively, the growing body of high-throughput functional genomics data provides evidence of gene function, but remains difficult to interpret. Results We constructed the first network of functional relationships for Drosophila melanogaster by integrating most of the available, comprehensive sets of genetic interaction, protein-protein interaction, and microarray expression data. The complete integrated network covers 85% of the currently known genes, which we refined to a high confidence network that includes 20,000 functional relationships among 5,021 genes. An analysis of the network revealed a remarkable concordance with prior knowledge. Using the network, we were able to infer a set of high-confidence Gene Ontology biological process annotations on 483 of the roughly 5,000 previously unannotated genes. We also show that this approach is a means of inferring annotations on a class of genes that cannot be annotated based solely on sequence similarity. Lastly, we demonstrate the utility of the network through reanalyzing gene expression data to both discover clusters of coregulated genes and compile a list of candidate genes related to specific biological processes. Conclusions Here we present the the first genome-wide functional gene network in D. melanogaster. The network enables the exploration, mining, and reanalysis of experimental data, as well as the interpretation of new data. The inferred annotations provide testable hypotheses of previously uncharacterized genes. PMID:19758432

Costello, James C; Dalkilic, Mehmet M; Beason, Scott M; Gehlhausen, Jeff R; Patwardhan, Rupali; Middha, Sumit; Eads, Brian D; Andrews, Justen R

2009-01-01

155

Comprehensive functional annotation of 18 missense mutations found in suspected hemochromatosis type 4 patients.  

PubMed

Hemochromatosis type 4 is a rare form of primary iron overload transmitted as an autosomal dominant trait caused by mutations in the gene encoding the iron transport protein ferroportin 1 (SLC40A1). SLC40A1 mutations fall into two functional categories (loss- versus gain-of-function) underlying two distinct clinical entities (hemochromatosis type 4A versus type 4B). However, the vast majority of SLC40A1 mutations are rare missense variations, with only a few showing strong evidence of causality. The present study reports the results of an integrated approach collecting genetic and phenotypic data from 44 suspected hemochromatosis type 4 patients, with comprehensive structural and functional annotations. Causality was demonstrated for 10 missense variants, showing a clear dichotomy between the two hemochromatosis type 4 subtypes. Two subgroups of loss-of-function mutations were distinguished: one impairing cell-surface expression and one altering only iron egress. Additionally, a new gain-of-function mutation was identified, and the degradation of ferroportin on hepcidin binding was shown to probably depend on the integrity of a large extracellular loop outside of the hepcidin-binding domain. Eight further missense variations, on the other hand, were shown to have no discernible effects at either protein or RNA level; these were found in apparently isolated patients and were associated with a less severe phenotype. The present findings illustrate the importance of combining in silico and biochemical approaches to fully distinguish pathogenic SLC40A1 mutations from benign variants. This has profound implications for patient management. PMID:24714983

Callebaut, Isabelle; Joubrel, Rozenn; Pissard, Serge; Kannengiesser, Caroline; Gérolami, Victoria; Ged, Cécile; Cadet, Estelle; Cartault, François; Ka, Chandran; Gourlaouen, Isabelle; Gourhant, Lénaick; Oudin, Claire; Goossens, Michel; Grandchamp, Bernard; De Verneuil, Hubert; Rochette, Jacques; Férec, Claude; Le Gac, Gérald

2014-09-01

156

Improved systematic tRNA gene annotation allows new insights into the evolution of mitochondrial tRNA structures and into the mechanisms of mitochondrial genome rearrangements  

PubMed Central

Transfer RNAs (tRNAs) are present in all types of cells as well as in organelles. tRNAs of animal mitochondria show a low level of primary sequence conservation and exhibit ‘bizarre’ secondary structures, lacking complete domains of the common cloverleaf. Such sequences are hard to detect and hence frequently missed in computational analyses and mitochondrial genome annotation. Here, we introduce an automatic annotation procedure for mitochondrial tRNA genes in Metazoa based on sequence and structural information in manually curated covariance models. The method, applied to re-annotate 1876 available metazoan mitochondrial RefSeq genomes, allows to distinguish between remaining functional genes and degrading ‘pseudogenes’, even at early stages of divergence. The subsequent analysis of a comprehensive set of mitochondrial tRNA genes gives new insights into the evolution of structures of mitochondrial tRNA sequences as well as into the mechanisms of genome rearrangements. We find frequent losses of tRNA genes concentrated in basal Metazoa, frequent independent losses of individual parts of tRNA genes, particularly in Arthropoda, and wide-spread conserved overlaps of tRNAs in opposite reading direction. Direct evidence for several recent Tandem Duplication-Random Loss events is gained, demonstrating that this mechanism has an impact on the appearance of new mitochondrial gene orders. PMID:22139921

Juhling, Frank; Putz, Joern; Bernt, Matthias; Donath, Alexander; Middendorf, Martin; Florentz, Catherine; Stadler, Peter F.

2012-01-01

157

Proteome Analyst Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors  

E-print Network

Proteome Analyst ­ Transparent High-throughput Protein Annotation: Function, Localization be easily examinable by anyone that wishes to use the prediction. Proteome Analyst (PA) is a web- based system for predicting the properties of each protein in a proteome. PA has three interesting features

Lu, Paul

158

De Novo Assembly, Gene Annotation, and Marker Discovery in Stored-Product Pest Liposcelis entomophila (Enderlein) Using Transcriptome Sequences  

PubMed Central

Background As a major stored-product pest insect, Liposcelis entomophila has developed high levels of resistance to various insecticides in grain storage systems. However, the molecular mechanisms underlying resistance and environmental stress have not been characterized. To date, there is a lack of genomic information for this species. Therefore, studies aimed at profiling the L. entomophila transcriptome would provide a better understanding of the biological functions at the molecular levels. Methodology/Principal Findings We applied Illumina sequencing technology to sequence the transcriptome of L. entomophila. A total of 54,406,328 clean reads were obtained and that de novo assembled into 54,220 unigenes, with an average length of 571 bp. Through a similarity search, 33,404 (61.61%) unigenes were matched to known proteins in the NCBI non-redundant (Nr) protein database. These unigenes were further functionally annotated with gene ontology (GO), cluster of orthologous groups of proteins (COG), and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. A large number of genes potentially involved in insecticide resistance were manually curated, including 68 putative cytochrome P450 genes, 37 putative glutathione S-transferase (GST) genes, 19 putative carboxyl/cholinesterase (CCE) genes, and other 126 transcripts to contain target site sequences or encoding detoxification genes representing eight types of resistance enzymes. Furthermore, to gain insight into the molecular basis of the L. entomophila toward thermal stresses, 25 heat shock protein (Hsp) genes were identified. In addition, 1,100 SSRs and 57,757 SNPs were detected and 231 pairs of SSR primes were designed for investigating the genetic diversity in future. Conclusions/Significance We developed a comprehensive transcriptomic database for L. entomophila. These sequences and putative molecular markers would further promote our understanding of the molecular mechanisms underlying insecticide resistance or environmental stress, and will facilitate studies on population genetics for psocids, as well as providing useful information for functional genomic research in the future. PMID:24244605

Wei, Dan-Dan; Chen, Er-Hu; Ding, Tian-Bo; Chen, Shi-Chun; Dou, Wei; Wang, Jin-Jun

2013-01-01

159

Gene predictions and annotations Roderic Guig (Insitut Municipal d'Investigaci Mdica,  

E-print Network

the nucleus into the cytoplasm for translation into the gene product ­ a functional protein. An example, and coding sequences (CDSs) that code for the gene product (protein). There are also RNA genes such as mi offer even better prediction than ab initio methods for homologous genes. Although gene prediction tools

160

Developmental Gene Discovery in a Hemimetabolous Insect: De Novo Assembly and Annotation of a Transcriptome for the Cricket Gryllus bimaculatus  

PubMed Central

Most genomic resources available for insects represent the Holometabola, which are insects that undergo complete metamorphosis like beetles and flies. In contrast, the Hemimetabola (direct developing insects), representing the basal branches of the insect tree, have very few genomic resources. We have therefore created a large and publicly available transcriptome for the hemimetabolous insect Gryllus bimaculatus (cricket), a well-developed laboratory model organism whose potential for functional genetic experiments is currently limited by the absence of genomic resources. cDNA was prepared using mRNA obtained from adult ovaries containing all stages of oogenesis, and from embryo samples on each day of embryogenesis. Using 454 Titanium pyrosequencing, we sequenced over four million raw reads, and assembled them into 21,512 isotigs (predicted transcripts) and 120,805 singletons with an average coverage per base pair of 51.3. We annotated the transcriptome manually for over 400 conserved genes involved in embryonic patterning, gametogenesis, and signaling pathways. BLAST comparison of the transcriptome against the NCBI non-redundant protein database (nr) identified significant similarity to nr sequences for 55.5% of transcriptome sequences, and suggested that the transcriptome may contain 19,874 unique transcripts. For predicted transcripts without significant similarity to known sequences, we assessed their similarity to other orthopteran sequences, and determined that these transcripts contain recognizable protein domains, largely of unknown function. We created a searchable, web-based database to allow public access to all raw, assembled and annotated data. This database is to our knowledge the largest de novo assembled and annotated transcriptome resource available for any hemimetabolous insect. We therefore anticipate that these data will contribute significantly to more effective and higher-throughput deployment of molecular analysis tools in Gryllus. PMID:23671567

Zeng, Victor; Ewen-Campen, Ben; Horch, Hadley W.; Roth, Siegfried; Mito, Taro; Extavour, Cassandra G.

2013-01-01

161

PIPA: A High-Throughput Pipeline for Protein Function Annotation.  

National Technical Information Service (NTIS)

Traditional experimental methods to determine the functions of proteins encoded in genomic sequences cannot keep pace with the avalanche of sequence data produced by new high-throughput sequencing technologies. This prompted the development of numerous bi...

C. Yu J. Reifman, N. Zavaljevski, V. Desai

2008-01-01

162

Comparison of structure- and threading-based approaches to protein functional annotation  

PubMed Central

To exploit the vast amount of sequence information provided by the Genomic revolution, the biological function of these sequences must be identified. As a practical matter, this is often accomplished by functional inference. Purely sequence-based approaches, particularly in the “twilight zone” of low sequence similarity levels, are complicated by many factors. For proteins, structure-based techniques aim to overcome these problems; however, most require high-quality crystal structures and suffer from complex and equivocal relations between protein fold and function. In this study, in extensive benchmarking, we consider a number of aspects of structure-based functional annotation: binding pocket detection, molecular function assignment and ligand-based virtual screening. We demonstrate that protein threading driven by a strong sequence profile component greatly improves the quality of purely structure-based functional annotation in the “twilight zone”. By detecting evolutionarily related proteins, it considerably reduces the high false positive rate of function inference derived on the basis of global structure similarity alone. Combined evolution/structure-based function assignment emerges as a powerful technique that can make a significant contribution to comprehensive proteome annotation. PMID:19731377

Brylinski, Michal; Skolnick, Jeffrey

2009-01-01

163

Annotation Transfer for Genomics: Measuring Functional Divergence in Multi-Domain Proteins  

PubMed Central

Annotation transfer is a principal process in genome annotation. It involves “transferring” structural and functional annotation to uncharacterized open reading frames (ORFs) in a newly completed genome from experimentally characterized proteins similar in sequence. To prevent errors in genome annotation, it is important that this process be robust and statistically well-characterized, especially with regard to how it depends on the degree of sequence similarity. Previously, we and others have analyzed annotation transfer in single-domain proteins. Multi-domain proteins, which make up the bulk of the ORFs in eukaryotic genomes, present more complex issues in functional conservation. Here we present a large-scale survey of annotation transfer in these proteins, using scop superfamilies to define domain folds and a thesaurus based on SWISS-PROT keywords to define functional categories. Our survey reveals that multi-domain proteins have significantly less functional conservation than single-domain ones, except when they share the exact same combination of domain folds. In particular, we find that for multi-domain proteins, approximate function can be accurately transferred with only 35% certainty for pairs of proteins sharing one structural superfamily. In contrast, this value is 67% for pairs of single-domain proteins sharing the same structural superfamily. On the other hand, if two multi-domain proteins contain the same combination of two structural superfamilies the probability of their sharing the same function increases to 80% in the case of complete coverage along the full length of both proteins, this value increases further to >?90%. Moreover, we found that only 70 of the current total of 455 structural superfamilies are found in both single and multi-domain proteins and only 14 of these were associated with the same function in both categories of proteins. We also investigated the degree to which function could be transferred between pairs of multi-domain proteins with respect to the degree of sequence similarity between them, finding that functional divergence at a given amount of sequence similarity is always about two-fold greater for pairs of multi-domain proteins (sharing similarity over a single domain) in comparison to pairs of single-domain ones, though the overall shape of the relationship is quite similar. Further information is available at http://partslist.org/func or http://bioinfo.mbb.yale.edu/partslist/func. PMID:11591640

Hegyi, Hedi; Gerstein, Mark

2001-01-01

164

Genetic Analysis Workshop 18 single-nucleotide variant prioritization based on protein impact, sequence conservation, and gene annotation  

PubMed Central

Grouping variants based on gene mapping can augment the power of rare variant association tests. Weighting or sorting variants based on their expected functional impact can provide additional benefit. We defined groups of prioritized variants based on systematic annotation of Genetic Analysis Workshop 18 (GAW18) single-nucleotide variants; we focused on variants detected by whole genome sequencing, specifically on the high-quality subset presented in the genotype files. First, we divided variants between coding and noncoding. Coding variants are fewer than 1% of the total and are more likely to have a biological effect than noncoding variants. Coding variants were further stratified into protein changing and protein damaging groups based on the effect on protein amino acid sequence. In particular, missense variants predicted to be damaging, splice-site alterations, and stop gains were assigned to the protein damaging category. Impact of noncoding variants is more difficult to predict. We decided to rely uniquely on conservation: we combined (a) the mammalian phastCons Conserved Element and (b) the PhyloP score, which identify conserved intervals and the single-nucleotide position, respectively. This reduced the noncoding variants to a number comparable to coding variants. Finally, using gene structure definition from the widely used RefSeq database, we mapped variants to genes to support association tests that require collapsing rare variants to genes. Companion GAW18 papers used these variant priority groups and gene mapping; one of these paper specifically found evidence of stronger association signal for protein damaging variants.

2014-01-01

165

A novel genetic island of meningitic Escherichia coli K1 containing the ibeA invasion gene (GimA): functional annotation and carbon-source-regulated invasion of human brain microvascular endothelial cells  

Microsoft Academic Search

.   The IbeA (ibe10) gene is an invasion determinant contributing to E. coli K1 invasion of the blood-brain barrier. This gene has been cloned and characterized from the chromosome of an invasive cerebrospinal\\u000a fluid isolate of E. coli K1, strain RS218 (018:K1: H7). In the present study, a genetic island of meningitic E. coli containing ibeA (GimA) has been identified.

Sheng-He Huang; Yu-Hua Chen; Guoying Kong; Steven H. M. Chen; John Besemer; Mark Borodovsky; Ambrose Jong

2001-01-01

166

Evaluating the accuracy of a functional SNP annotation system.  

PubMed

Many common and chronic diseases are influenced at some level by genetic variation. Research done in population genetics, specifically in the area of single nucleotide polymorphisms (SNPs) is critical to understanding human genetic variation. A key element in assessing role of a given SNP is determining if the variation is likely to result in change in function. The SNP Integration Tool (SNPit) is a comprehensive tool that integrates diverse, existing predictors of SNP functionality, providing the user with information for improved association study analysis. To evaluate the SNPit system, we developed an alternative gold standard to measure accuracy using sensitivity and specificity. The results of our evaluation demonstrated that our alternative gold standard produced encouraging results. PMID:19761565

Shen, Terry H; Carlson, Christopher S; Tarczy-Hornoch, Peter

2009-01-01

167

Annotated genetic linkage maps of Pinus pinaster Ait. from a Central Spain population using microsatellite and gene based markers  

PubMed Central

Background Pinus pinaster Ait. is a major resin producing species in Spain. Genetic linkage mapping can facilitate marker-assisted selection (MAS) through the identification of Quantitative Trait Loci and selection of allelic variants of interest in breeding populations. In this study, we report annotated genetic linkage maps for two individuals (C14 and C15) belonging to a breeding program aiming to increase resin production. We use different types of DNA markers, including last-generation molecular markers. Results We obtained 13 and 14 linkage groups for C14 and C15 maps, respectively. A total of 211 and 215 markers were positioned on each map and estimated genome length was between 1,870 and 2,166 cM respectively, which represents near 65% of genome coverage. Comparative mapping with previously developed genetic linkage maps for P. pinaster based on about 60 common markers enabled aligning linkage groups to this reference map. The comparison of our annotated linkage maps and linkage maps reporting QTL information revealed 11 annotated SNPs in candidate genes that co-localized with previously reported QTLs for wood properties and water use efficiency. Conclusions This study provides genetic linkage maps from a Spanish population that shows high levels of genetic divergence with French populations from which segregating progenies have been previously mapped. These genetic maps will be of interest to construct a reliable consensus linkage map for the species. The importance of developing functional genetic linkage maps is highlighted, especially when working with breeding populations for its future application in MAS for traits of interest. PMID:23036012

2012-01-01

168

SNPit: a federated data integration system for the purpose of functional SNP annotation  

PubMed Central

Genome wide association studies can potentially identify the genetic causes behind the majority of human diseases. With the advent of more advanced genotyping techniques, there is now an explosion of data gathered on single nucleotide polymorphisms (SNPs). The need exists for an integrated system that can provide up-to-date functional annotation information on SNPs. We have developed the SNP Integration Tool (SNPit) system to address this need. Built upon a federated data integration system, SNPit provides current information on a comprehensive list of SNP data sources. Additional logical inference analysis was included through an inference engine plug in. The SNPit web servlet is available online for use. SNPit allows users to go to one source for up-to-date information on the functional annotation of SNPs. A tool that can help to integrate and analyze the potential functional significance of SNPs is important for understanding the results from genome wide association studies. PMID:19327864

Shen, Terry H; Carlson, Christopher S; Tarczy-Hornoch, Peter

2009-01-01

169

Generation, analysis and functional annotation of expressed sequence tags from the ectoparasitic mite Psoroptes ovis  

PubMed Central

Background Sheep scab is caused by Psoroptes ovis and is arguably the most important ectoparasitic disease affecting sheep in the UK. The disease is highly contagious and causes and considerable pruritis and irritation and is therefore a major welfare concern. Current methods of treatment are unsustainable and in order to elucidate novel methods of disease control a more comprehensive understanding of the parasite is required. To date, no full genomic DNA sequence or large scale transcript datasets are available and prior to this study only 484 P. ovis expressed sequence tags (ESTs) were accessible in public databases. Results In order to further expand upon the transcriptomic coverage of P. ovis thus facilitating novel insights into the mite biology we undertook a larger scale EST approach, incorporating newly generated and previously described P. ovis transcript data and representing the largest collection of P. ovis ESTs to date. We sequenced 1,574 ESTs and assembled these along with 484 previously generated P. ovis ESTs, which resulted in the identification of 1,545 unique P. ovis sequences. BLASTX searches identified 961 ESTs with significant hits (E-value < 1E-04) and 584 novel P. ovis ESTs. Gene Ontology (GO) analysis allowed the functional annotation of 880 ESTs and included predictions of signal peptide and transmembrane domains; allowing the identification of potential P. ovis excreted/secreted factors, and mapping of metabolic pathways. Conclusions This dataset currently represents the largest collection of P. ovis ESTs, all of which are publicly available in the GenBank EST database (dbEST) (accession numbers FR748230 - FR749648). Functional analysis of this dataset identified important homologues, including house dust mite allergens and tick salivary factors. These findings offer new insights into the underlying biology of P. ovis, facilitating further investigations into mite biology and the identification of novel methods of intervention. PMID:21781297

2011-01-01

170

tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles  

PubMed Central

The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the ‘tagtog’ system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation. Database URL: www.tagtog.net, www.flybase.org PMID:24715220

Cejuela, Juan Miguel; McQuilton, Peter; Ponting, Laura; Marygold, Steven J.; Stefancsik, Raymund; Millburn, Gillian H.; Rost, Burkhard

2014-01-01

171

Next-Generation Annotation of Prokaryotic Genomes with EuGene-P: Application to Sinorhizobium meliloti 2011  

PubMed Central

The availability of next-generation sequences of transcripts from prokaryotic organisms offers the opportunity to design a new generation of automated genome annotation tools not yet available for prokaryotes. In this work, we designed EuGene-P, the first integrative prokaryotic gene finder tool which combines a variety of high-throughput data, including oriented RNA-Seq data, directly into the prediction process. This enables the automated prediction of coding sequences (CDSs), untranslated regions, transcription start sites (TSSs) and non-coding RNA (ncRNA, sense and antisense) genes. EuGene-P was used to comprehensively and accurately annotate the genome of the nitrogen-fixing bacterium Sinorhizobium meliloti strain 2011, leading to the prediction of 6308 CDSs as well as 1876 ncRNAs. Among them, 1280 appeared as antisense to a CDS, which supports recent findings that antisense transcription activity is widespread in bacteria. Moreover, 4077 TSSs upstream of protein-coding or non-coding genes were precisely mapped providing valuable data for the study of promoter regions. By looking for RpoE2-binding sites upstream of annotated TSSs, we were able to extend the S. meliloti RpoE2 regulon by ?3-fold. Altogether, these observations demonstrate the power of EuGene-P to produce a reliable and high-resolution automatic annotation of prokaryotic genomes. PMID:23599422

Sallet, Erika; Roux, Brice; Sauviac, Laurent; Jardinaud, Marie-Franc,oise; Carrere, Sebastien; Faraut, Thomas; de Carvalho-Niebel, Fernanda; Gouzy, Jerome; Gamas, Pascal; Capela, Delphine; Bruand, Claude; Schiex, Thomas

2013-01-01

172

tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles.  

PubMed

The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the 'tagtog' system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation. DATABASE URL: www.tagtog.net, www.flybase.org. PMID:24715220

Cejuela, Juan Miguel; McQuilton, Peter; Ponting, Laura; Marygold, Steven J; Stefancsik, Raymund; Millburn, Gillian H; Rost, Burkhard

2014-01-01

173

The IGS Standard Operating Procedure for Automated Prokaryotic Annotation.  

PubMed

The Institute for Genome Sciences (IGS) has developed a prokaryotic annotation pipeline that is used for coding gene/RNA prediction and functional annotation of Bacteria and Archaea. The fully automated pipeline accepts one or many genomic sequences as input and produces output in a variety of standard formats. Functional annotation is primarily based on similarity searches and motif finding combined with a hierarchical rule based annotation system. The output annotations can also be loaded into a relational database and accessed through visualization tools. PMID:21677861

Galens, Kevin; Orvis, Joshua; Daugherty, Sean; Creasy, Heather H; Angiuoli, Sam; White, Owen; Wortman, Jennifer; Mahurkar, Anup; Giglio, Michelle Gwinn

2011-04-29

174

Predicting gene function using similarity learning  

PubMed Central

Background Computational methods that make use of heterogeneous biological datasets to predict gene function provide a cost-effective and rapid way for annotating genomes. A common framework shared by many such methods is to construct a combined functional association network from multiple networks representing different sources of data, and use this combined network as input to network-based or kernel-based learning algorithms. In these methods, a key factor contributing to the prediction accuracy is the network quality, which is the ability of the network to reflect the functional relatedness of gene pairs. To improve the network quality, a large effort has been spent on developing methods for network integration. These methods, however, produce networks, which then remain unchanged, and nearly no effort has been made to optimize the networks after their construction. Results Here, we propose an alternative method to improve the network quality. The proposed method takes as input a combined network produced by an existing network integration algorithm, and reconstructs this network to better represent the co-functionality relationships between gene pairs. At the core of the method is a learning algorithm that can learn a measure of functional similarity between genes, which we then use to reconstruct the input network. In experiments with yeast and human, the proposed method produced improved networks and achieved more accurate results than two other leading gene function prediction approaches. Conclusions The results show that it is possible to improve the accuracy of network-based gene function prediction methods by optimizing combined networks with appropriate similarity measures learned from data. The proposed learning procedure can handle noisy training data and scales well to large genomes. PMID:24266903

2013-01-01

175

Automated pipeline for atlas-based annotation of gene expression patterns: application to postnatal day 7 mouse brain.  

PubMed

Massive amounts of image data have been collected and continue to be generated for representing cellular gene expression throughout the mouse brain. Critical to exploiting this key effort of the post-genomic era is the ability to place these data into a common spatial reference that enables rapid interactive queries, analysis, data sharing, and visualization. In this paper, we present a set of automated protocols for generating and annotating gene expression patterns suitable for the establishment of a database. The steps include imaging tissue slices, detecting cellular gene expression levels, spatial registration with an atlas, and textual annotation. Using high-throughput in situ hybridization to generate serial sets of tissues displaying gene expression, this process was applied toward the establishment of a database representing over 200 genes in the postnatal day 7 mouse brain. These data using this protocol are now well-suited for interactive comparisons, analysis, queries, and visualization. PMID:19698790

Carson, James; Ju, Tao; Bello, Musodiq; Thaller, Christina; Warren, Joe; Kakadiaris, Ioannis A; Chiu, Wah; Eichele, Gregor

2010-02-01

176

Genome-wide annotation, expression profiling, and protein interaction studies of the core cell-cycle genes in Phalaenopsis aphrodite.  

PubMed

Orchidaceae is one of the most abundant and diverse families in the plant kingdom and its unique developmental patterns have drawn the attention of many evolutionary biologists. Particular areas of interest have included the co-evolution of pollinators and distinct floral structures, and symbiotic relationships with mycorrhizal flora. However, comprehensive studies to decipher the molecular basis of growth and development in orchids remain scarce. Cell proliferation governed by cell-cycle regulation is fundamental to growth and development of the plant body. We took advantage of recently released transcriptome information to systematically isolate and annotate the core cell-cycle regulators in the moth orchid Phalaenopsis aphrodite. Our data verified that Phalaenopsis cyclin-dependent kinase A (CDKA) is an evolutionarily conserved CDK. Expression profiling studies suggested that core cell-cycle genes functioning during the G1/S, S, and G2/M stages were preferentially enriched in the meristematic tissues that have high proliferation activity. In addition, subcellular localization and pairwise interaction analyses of various combinations of CDKs and cyclins, and of E2 promoter-binding factors and dimerization partners confirmed interactions of the functional units. Furthermore, our data showed that expression of the core cell-cycle genes was coordinately regulated during pollination-induced reproductive development. The data obtained establish a fundamental framework for study of the cell-cycle machinery in Phalaenopsis orchids. PMID:24222213

Lin, Hsiang-Yin; Chen, Jhun-Chen; Wei, Miao-Ju; Lien, Yi-Chen; Li, Huang-Hsien; Ko, Swee-Suak; Liu, Zin-Huang; Fang, Su-Chiung

2014-01-01

177

Functional Characterization of Two M42 Aminopeptidases Erroneously Annotated as Cellulases  

PubMed Central

Several aminopeptidases of the M42 family have been described as tetrahedral-shaped dodecameric (TET) aminopeptidases. A current hypothesis suggests that these enzymes are involved, along with the tricorn peptidase, in degrading peptides produced by the proteasome. Yet the M42 family remains ill defined, as some members have been annotated as cellulases because of their homology with CelM, formerly described as an endoglucanase of Clostridium thermocellum. Here we describe the catalytic functions and substrate profiles CelM and of TmPep1050, the latter having been annotated as an endoglucanase of Thermotoga maritima. Both enzymes were shown to catalyze hydrolysis of nonpolar aliphatic L-amino acid-pNA substrates, the L-leucine derivative appearing as the best substrate. No significant endoglucanase activity was measured, either for TmPep1050 or CelM. Addition of cobalt ions enhanced the activity of both enzymes significantly, while both the chelating agent EDTA and bestatin, a specific inhibitor of metalloaminopeptidases, proved inhibitory. Our results strongly suggest that one should avoid annotating members of the M42 aminopeptidase family as cellulases. In an updated assessment of the distribution of M42 aminopeptidases, we found TET aminopeptidases to be distributed widely amongst archaea and bacteria. We additionally observed that several phyla lack both TET and tricorn. This suggests that other complexes may act downstream from the proteasome. PMID:23226342

Dutoit, Raphael; Brandt, Nathalie; Legrain, Christianne; Bauvois, Cedric

2012-01-01

178

The Gene Wiki in 2011: community intelligence applied to human gene annotation  

PubMed Central

The Gene Wiki is an open-access and openly editable collection of Wikipedia articles about human genes. Initiated in 2008, it has grown to include articles about more than 10?000 genes that, collectively, contain more than 1.4 million words of gene-centric text with extensive citations back to the primary scientific literature. This growing body of useful, gene-centric content is the result of the work of thousands of individuals throughout the scientific community. Here, we describe recent improvements to the automated system that keeps the structured data presented on Gene Wiki articles in sync with the data from trusted primary databases. We also describe the expanding contents, editors and users of the Gene Wiki. Finally, we introduce a new automated system, called WikiTrust, which can effectively compute the quality of Wikipedia articles, including Gene Wiki articles, at the word level. All articles in the Gene Wiki can be freely accessed and edited at Wikipedia, and additional links and information can be found at the project's Wikipedia portal page: http://en.wikipedia.org/wiki/Portal:Gene_Wiki. PMID:22075991

Good, Benjamin M.; Clarke, Erik L.; de Alfaro, Luca; Su, Andrew I.

2012-01-01

179

The Gene Wiki in 2011: community intelligence applied to human gene annotation.  

PubMed

The Gene Wiki is an open-access and openly editable collection of Wikipedia articles about human genes. Initiated in 2008, it has grown to include articles about more than 10,000 genes that, collectively, contain more than 1.4 million words of gene-centric text with extensive citations back to the primary scientific literature. This growing body of useful, gene-centric content is the result of the work of thousands of individuals throughout the scientific community. Here, we describe recent improvements to the automated system that keeps the structured data presented on Gene Wiki articles in sync with the data from trusted primary databases. We also describe the expanding contents, editors and users of the Gene Wiki. Finally, we introduce a new automated system, called WikiTrust, which can effectively compute the quality of Wikipedia articles, including Gene Wiki articles, at the word level. All articles in the Gene Wiki can be freely accessed and edited at Wikipedia, and additional links and information can be found at the project's Wikipedia portal page: http://en.wikipedia.org/wiki/Portal:Gene_Wiki. PMID:22075991

Good, Benjamin M; Clarke, Erik L; de Alfaro, Luca; Su, Andrew I

2012-01-01

180

Development and evaluation of an automated annotation pipeline and cDNA annotation system.  

PubMed

Manual curation has long been held to be the "gold standard" for functional annotation of DNA sequence. Our experience with the annotation of more than 20,000 full-length cDNA sequences revealed problems with this approach, including inaccurate and inconsistent assignment of gene names, as well as many good assignments that were difficult to reproduce using only computational methods. For the FANTOM2 annotation of more than 60,000 cDNA clones, we developed a number of methods and tools to circumvent some of these problems, including an automated annotation pipeline that provides high-quality preliminary annotation for each sequence by introducing an "uninformative filter" that eliminates uninformative annotations, controlled vocabularies to accurately reflect both the functional assignments and the evidence supporting them, and a highly refined, Web-based manual annotation tool that allows users to view a wide array of sequence analyses and to assign gene names and putative functions using a consistent nomenclature. The ultimate utility of our approach is reflected in the low rate of reassignment of automated assignments by manual curation. Based on these results, we propose a new standard for large-scale annotation, in which the initial automated annotations are manually investigated and then computational methods are iteratively modified and improved based on the results of manual curation. PMID:12819153

Kasukawa, Takeya; Furuno, Masaaki; Nikaido, Itoshi; Bono, Hidemasa; Hume, David A; Bult, Carol; Hill, David P; Baldarelli, Richard; Gough, Julian; Kanapin, Alexander; Matsuda, Hideo; Schriml, Lynn M; Hayashizaki, Yoshihide; Okazaki, Yasushi; Quackenbush, John

2003-06-01

181

Mouse Genetics: Determining gene function  

E-print Network

Mouse Genetics: Determining gene function An International Centre for Mouse Genetics Mammalian Genetics Unit #12;Determining gene function · Mutagenesis approaches · Gene-driven, phenotype for Mouse Genetics Mammalian Genetics Unit #12;An International Centre for Mouse Genetics Mammalian Genetics

Goldschmidt, Christina

182

From Gene Networks to Gene Function  

PubMed Central

We propose a novel method to identify functionally related genes based on comparisons of neighborhoods in gene networks. This method does not rely on gene sequence or protein structure homologies, and it can be applied to any organism and a wide variety of experimental data sets. The character of the predicted gene relationships depends on the underlying networks;they concern biological processes rather than the molecular function. We used the method to analyze gene networks derived from genome-wide chromatin immunoprecipitation experiments, a large-scale gene deletion study, and from the genomic positions of consensus binding sites for transcription factors of the yeast Saccharomyces cerevisiae. We identified 816 functional relationships between 159 genes and show that these relationships correspond to protein–protein interactions, co-occurrence in the same protein complexes, and/or co-occurrence in abstracts of scientific articles. Our results suggest functions for seven previously uncharacterized yeast genes: KIN3 and YMR269W may be involved in biological processes related to cell growth and/or maintenance, whereas IES6, YEL008W, YEL033W, YHL029C, YMR010W, and YMR031W-A are likely to have metabolic functions. PMID:14656964

Schlitt, Thomas; Palin, Kimmo; Rung, Johan; Dietmann, Sabine; Lappe, Michael; Ukkonen, Esko; Brazma, Alvis

2003-01-01

183

Integrating genome annotation and QTL position to identify candidate genes for productivity, architecture and water-use efficiency in Populus spp  

PubMed Central

Background Hybrid poplars species are candidates for biomass production but breeding efforts are needed to combine productivity and water use efficiency in improved cultivars. The understanding of the genetic architecture of growth in poplar by a Quantitative Trait Loci (QTL) approach can help us to elucidate the molecular basis of such integrative traits but identifying candidate genes underlying these QTLs remains difficult. Nevertheless, the increase of genomic information together with the accessibility to a reference genome sequence (Populus trichocarpa Nisqually-1) allow to bridge QTL information on genetic maps and physical location of candidate genes on the genome. The objective of the study is to identify QTLs controlling productivity, architecture and leaf traits in a P. deltoides x P. trichocarpa F1 progeny and to identify candidate genes underlying QTLs based on the anchoring of genetic maps on the genome and the gene ontology information linked to genome annotation. The strategy to explore genome annotation was to use Gene Ontology enrichment tools to test if some functional categories are statistically over-represented in QTL regions. Results Four leaf traits and 7 growth traits were measured on 330 F1 P. deltoides x P. trichocarpa progeny. A total of 77 QTLs controlling 11 traits were identified explaining from 1.8 to 17.2% of the variation of traits. For 58 QTLs, confidence intervals could be projected on the genome. An extended functional annotation was built based on data retrieved from the plant genome database Phytozome and from an inference of function using homology between Populus and the model plant Arabidopsis. Genes located within QTL confidence intervals were retrieved and enrichments in gene ontology (GO) terms were determined using different methods. Significant enrichments were found for all traits. Particularly relevant biological processes GO terms were identified for QTLs controlling number of sylleptic branches: intervals were enriched in GO terms of biological process like ‘ripening’ and ‘adventitious roots development’. Conclusion Beyond the simple identification of QTLs, this study is the first to use a global approach of GO terms enrichment analysis to fully explore gene function under QTLs confidence intervals in plants. This global approach may lead to identification of new candidate genes for traits of interest. PMID:23013168

2012-01-01

184

Semantic Particularity Measure for Functional Characterization of Gene Sets Using Gene Ontology  

PubMed Central

Background Genetic and genomic data analyses are outputting large sets of genes. Functional comparison of these gene sets is a key part of the analysis, as it identifies their shared functions, and the functions that distinguish each set. The Gene Ontology (GO) initiative provides a unified reference for analyzing the genes molecular functions, biological processes and cellular components. Numerous semantic similarity measures have been developed to systematically quantify the weight of the GO terms shared by two genes. We studied how gene set comparisons can be improved by considering gene set particularity in addition to gene set similarity. Results We propose a new approach to compute gene set particularities based on the information conveyed by GO terms. A GO term informativeness can be computed using either its information content based on the term frequency in a corpus, or a function of the term's distance to the root. We defined the semantic particularity of a set of GO terms Sg1 compared to another set of GO terms Sg2. We combined our particularity measure with a similarity measure to compare gene sets. We demonstrated that the combination of semantic similarity and semantic particularity measures was able to identify genes with particular functions from among similar genes. This differentiation was not recognized using only a semantic similarity measure. Conclusion Semantic particularity should be used in conjunction with semantic similarity to perform functional analysis of GO-annotated gene sets. The principle is generalizable to other ontologies. PMID:24489737

Bettembourg, Charles; Diot, Christian; Dameron, Olivier

2014-01-01

185

Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome  

Microsoft Academic Search

Background  It is widely accepted that comparative sequence data can aid the functional annotation of genome sequences; however, the most\\u000a informative species and features of genome evolution for comparison remain to be determined.\\u000a \\u000a \\u000a \\u000a \\u000a Results  We analyzed conservation in eight genomic regions (apterous, even-skipped, fushi tarazu, twist, and Rhodopsins 1, 2, 3 and 4) from four Drosophila species (D. erecta, D. pseudoobscura, D.

Casey M Bergman; Barret D Pfeiffer; Diego E Rincón-Limas; Roger A Hoskins; Andreas Gnirke; Chris J Mungall; Adrienne M Wang; Brent Kronmiller; Joanne Pacleb; Soo Park; Mark Stapleton; Kenneth Wan; Reed A George; Pieter J de Jong; Juan Botas; Gerald M Rubin; Susan E Celniker

2002-01-01

186

Parallel-META 2.0: Enhanced Metagenomic Data Analysis with Functional Annotation, High Performance Computing and Advanced Visualization  

PubMed Central

The metagenomic method directly sequences and analyses genome information from microbial communities. The main computational tasks for metagenomic analyses include taxonomical and functional structure analysis for all genomes in a microbial community (also referred to as a metagenomic sample). With the advancement of Next Generation Sequencing (NGS) techniques, the number of metagenomic samples and the data size for each sample are increasing rapidly. Current metagenomic analysis is both data- and computation- intensive, especially when there are many species in a metagenomic sample, and each has a large number of sequences. As such, metagenomic analyses require extensive computational power. The increasing analytical requirements further augment the challenges for computation analysis. In this work, we have proposed Parallel-META 2.0, a metagenomic analysis software package, to cope with such needs for efficient and fast analyses of taxonomical and functional structures for microbial communities. Parallel-META 2.0 is an extended and improved version of Parallel-META 1.0, which enhances the taxonomical analysis using multiple databases, improves computation efficiency by optimized parallel computing, and supports interactive visualization of results in multiple views. Furthermore, it enables functional analysis for metagenomic samples including short-reads assembly, gene prediction and functional annotation. Therefore, it could provide accurate taxonomical and functional analyses of the metagenomic samples in high-throughput manner and on large scale. PMID:24595159

Song, Baoxing; Xu, Jian; Ning, Kang

2014-01-01

187

Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores1  

Microsoft Academic Search

Measuring in a quantitative, statistical sense the degree to which struc- tural and functional information can be ''transferred'' between pairs of related protein sequences at various levels of similarity is an essential prerequisite for robust genome annotation. To this end, we performed pairwise sequence, structure and function comparisons on30,000 pairs of protein domains with known structure and function. Our domain

Cyrus A. Wilson; Julia Kreychman; Mark Gerstein

2000-01-01

188

Annotation extension through protein family annotation coherence metrics  

PubMed Central

Protein functional annotation consists in associating proteins with textual descriptors elucidating their biological roles. The bulk of annotation is done via automated procedures that ultimately rely on annotation transfer. Despite a large number of existing protein annotation procedures the ever growing protein space is never completely annotated. One of the facets of annotation incompleteness derives from annotation uncertainty. Often when protein function cannot be predicted with enough specificity it is instead conservatively annotated with more generic terms. In a scenario of protein families or functionally related (or even dissimilar) sets this leads to a more difficult task of using annotations to compare the extent of functional relatedness among all family or set members. However, we postulate that identifying sub-sets of functionally coherent proteins annotated at a very specific level, can help the annotation extension of other incompletely annotated proteins within the same family or functionally related set. As an example we analyse the status of annotation of a set of CAZy families belonging to the Polysaccharide Lyase class. We show that through the use of visualization methods and semantic similarity based metrics it is possible to identify families and respective annotation terms within them that are suitable for possible annotation extension. Based on our analysis we then propose a semi-automatic methodology leading to the extension of single annotation terms within these partially annotated protein sets or families. PMID:24130572

Bastos, Hugo P.; Clarke, Luka A.; Couto, Francisco M.

2013-01-01

189

Escherichia coli K-12: a cooperatively developed annotation snapshot--2005  

Microsoft Academic Search

The goal of this group project has been to coordinate and bring up-to-date information on all genes of Escherichia coli K-12. Annotation of the genome of an organism entails identification of genes, the boundaries of genes in terms of precise start and end sites, and description of the gene products. Known and predicted functions were assigned to each gene product

Monica Riley; Takashi Abe; Martha B. Arnaud; Mary K. B. Berlyn; Frederick R. Blattner; Roy R. Chaudhuri; Jeremy D. Glasner; Takashi Horiuchi; Ingrid M. Keseler; Takehide Kosuge; Hirotada Mori; Nicole T. Perna; Guy Plunkett; Kenneth E. Rudd; Margrethe H. Serres; Gavin H. Thomas; Nicholas R. Thomson; David Wishart; Barry L. Wanner

2006-01-01

190

Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation  

Microsoft Academic Search

Motivation: Many bioinformatics data resources not only hold data in the form of sequences, but also as annotation. In the majority of cases, annotation is written as scientific natu- ral language: this is suitable for humans, but not particularly useful for machine processing. Ontologies offer a mechanism by which knowledge can be represented in a form capable of such processing.

Phillip W. Lord; Robert D. Stevens; Andy Brass; Carole A. Goble

2003-01-01

191

The Zebrafish GenomeWiki: a crowdsourcing approach to connect the long tail for zebrafish gene annotation  

PubMed Central

A large repertoire of gene-centric data has been generated in the field of zebrafish biology. Although the bulk of these data are available in the public domain, most of them are not readily accessible or available in nonstandard formats. One major challenge is to unify and integrate these widely scattered data sources. We tested the hypothesis that active community participation could be a viable option to address this challenge. We present here our approach to create standards for assimilation and sharing of information and a system of open standards for database intercommunication. We have attempted to address this challenge by creating a community-centric solution for zebrafish gene annotation. The Zebrafish GenomeWiki is a ‘wiki’-based resource, which aims to provide an altruistic shared environment for collective annotation of the zebrafish genes. The Zebrafish GenomeWiki has features that enable users to comment, annotate, edit and rate this gene-centric information. The credits for contributions can be tracked through a transparent microattribution system. In contrast to other wikis, the Zebrafish GenomeWiki is a ‘structured wiki’ or rather a ‘semantic wiki’. The Zebrafish GenomeWiki implements a semantically linked data structure, which in the future would be amenable to semantic search. Database URL: http://genome.igib.res.in/twiki PMID:24578356

Singh, Meghna; Bhartiya, Deeksha; Maini, Jayant; Sharma, Meenakshi; Singh, Angom Ramcharan; Kadarkaraisamy, Subburaj; Rana, Rajiv; Sabharwal, Ankit; Nanda, Srishti; Ramachandran, Aravindhakshan; Mittal, Ashish; Kapoor, Shruti; Sehgal, Paras; Asad, Zainab; Kaushik, Kriti; Vellarikkal, Shamsudheen Karuthedath; Jagga, Divya; Muthuswami, Muthulakshmi; Chauhan, Rajendra K.; Leonard, Elvin; Priyadarshini, Ruby; Halimani, Mahantappa; Malhotra, Sunny; Patowary, Ashok; Vishwakarma, Harinder; Joshi, Prateek; Bhardwaj, Vivek; Bhaumik, Arijit; Bhatt, Bharat; Jha, Aamod; Kumar, Aalok; Budakoti, Prerna; Lalwani, Mukesh Kumar; Meli, Rajeshwari; Jalali, Saakshi; Joshi, Kandarp; Pal, Koustav; Dhiman, Heena; Laddha, Saurabh V.; Jadhav, Vaibhav; Singh, Naresh; Pandey, Vikas; Sachidanandan, Chetana; Ekker, Stephen C.; Klee, Eric W.; Scaria, Vinod; Sivasubbu, Sridhar

2014-01-01

192

An Innovative Plant Genomics and Gene Annotation Program for High School, Community College, and University Faculty  

PubMed Central

Today's biology educators face the challenge of training their students in modern molecular biology techniques including genomics and bioinformatics. The Dolan DNA Learning Center (DNALC) of Cold Spring Harbor Laboratory has developed and disseminated a bench- and computer-based plant genomics curriculum for biology faculty. In 2007, a five-day “Plant Genomics and Gene Annotation” workshop was held at Florida A&M University in Tallahassee, FL, to enhance participants' knowledge and understanding of plant molecular genetics and assist them in developing and honing their laboratory and computer skills. Florida A&M University is a historically black university with over 95% African-American student enrollment. Sixteen participants, including high school (56%) and community college faculty (25%), attended the workshop. Participants carried out in vitro and in silico experiments with maize, Arabidopsis, soybean, and food products to determine the genotype of the samples. Benefits of the workshop included increased awareness of plant biology research for high school and college level students. Participants completed pre- and postworkshop evaluations for the measurement of effectiveness. Participants demonstrated an overall improvement in their postworkshop evaluation scores. This article provides a detailed description of workshop activities, as well as assessment and long-term support for broad classroom implementation. PMID:18765753

Hilgert, Uwe; Nash, E. Bruce; Micklos, David A.

2008-01-01

193

Functional Annotation of Conserved Hypothetical Proteins from Haemophilus influenzae Rd KW20  

PubMed Central

Haemophilus influenzae is a Gram negative bacterium that belongs to the family Pasteurellaceae, causes bacteremia, pneumonia and acute bacterial meningitis in infants. The emergence of multi-drug resistance H. influenzae strain in clinical isolates demands the development of better/new drugs against this pathogen. Our study combines a number of bioinformatics tools for function predictions of previously not assigned proteins in the genome of H. influenzae. This genome was extensively analyzed and found 1,657 functional proteins in which function of 429 proteins are unknown, termed as hypothetical proteins (HPs). Amino acid sequences of all 429 HPs were extensively annotated and we successfully assigned the function to 296 HPs with high confidence. We also characterized the function of 124 HPs precisely, but with less confidence. We believed that sequence of a protein can be used as a framework to explain known functional properties. Here we have combined the latest versions of protein family databases, protein motifs, intrinsic features from the amino acid sequence, pathway and genome context methods to assign a precise function to hypothetical proteins for which no experimental information is available. We found these HPs belong to various classes of proteins such as enzymes, transporters, carriers, receptors, signal transducers, binding proteins, virulence and other proteins. The outcome of this work will be helpful for a better understanding of the mechanism of pathogenesis and in finding novel therapeutic targets for H. influenzae. PMID:24391926

Shahbaaz, Mohd; Md. ImtaiyazHassan; Ahmad, Faizan

2013-01-01

194

PIPA: A High-Throughput Pipeline for Protein Function Annotation Chenggang Yu, Valmik Desai, Nela Zavaljevski, and Jaques Reifman  

E-print Network

Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, MD 21702, USA {cyu of sequence data produced by new high-throughput sequencing technologies. This prompted the development of numerous bioinformatics approaches for automated protein function annotation. However, different function

195

Gene Set Enrichment in eQTL Data Identifies Novel Annotations and Pathway Regulators  

PubMed Central

Genome-wide gene expression profiling has been extensively used to generate biological hypotheses based on differential expression. Recently, many studies have used microarrays to measure gene expression levels across genetic mapping populations. These gene expression phenotypes have been used for genome-wide association analyses, an analysis referred to as expression QTL (eQTL) mapping. Here, eQTL analysis was performed in adipose tissue from 28 inbred strains of mice. We focused our analysis on “trans-eQTL bands”, defined as instances in which the expression patterns of many genes were all associated to a common genetic locus. Genes comprising trans-eQTL bands were screened for enrichments in functional gene sets representing known biological pathways, and genes located at associated trans-eQTL band loci were considered candidate transcriptional modulators. We demonstrate that these patterns were enriched for previously characterized relationships between known upstream transcriptional regulators and their downstream target genes. Moreover, we used this strategy to identify both novel regulators and novel members of known pathways. Finally, based on a putative regulatory relationship identified in our analysis, we identified and validated a previously uncharacterized role for cyclin H in the regulation of oxidative phosphorylation. We believe that the specific molecular hypotheses generated in this study will reveal many additional pathway members and regulators, and that the analysis approaches described herein will be broadly applicable to other eQTL data sets. PMID:18464898

Wu, Chunlei; Delano, David L.; Mitro, Nico; Su, Stephen V.; Janes, Jeff; McClurg, Phillip; Batalov, Serge; Welch, Genevieve L.; Zhang, Jie; Orth, Anthony P.; Walker, John R.; Glynne, Richard J.; Cooke, Michael P.; Takahashi, Joseph S.; Shimomura, Kazuhiro; Kohsaka, Akira; Bass, Joseph; Saez, Enrique; Wiltshire, Tim; Su, Andrew I.

2008-01-01

196

Reannotation and extended community resources for the genome of the non-seed plant Physcomitrella patens provide insights into the evolution of plant gene structures and functions  

PubMed Central

Background The moss Physcomitrella patens as a model species provides an important reference for early-diverging lineages of plants and the release of the genome in 2008 opened the doors to genome-wide studies. The usability of a reference genome greatly depends on the quality of the annotation and the availability of centralized community resources. Therefore, in the light of accumulating evidence for missing genes, fragmentary gene structures, false annotations and a low rate of functional annotations on the original release, we decided to improve the moss genome annotation. Results Here, we report the complete moss genome re-annotation (designated V1.6) incorporating the increased transcript availability from a multitude of developmental stages and tissue types. We demonstrate the utility of the improved P. patens genome annotation for comparative genomics and new extensions to the cosmoss.org resource as a central repository for this plant “flagship” genome. The structural annotation of 32,275 protein-coding genes results in 8387 additional loci including 1456 loci with known protein domains or homologs in Plantae. This is the first release to include information on transcript isoforms, suggesting alternative splicing events for at least 10.8% of the loci. Furthermore, this release now also provides information on non-protein-coding loci. Functional annotations were improved regarding quality and coverage, resulting in 58% annotated loci (previously: 41%) that comprise also 7200 additional loci with GO annotations. Access and manual curation of the functional and structural genome annotation is provided via the http://www.cosmoss.org model organism database. Conclusions Comparative analysis of gene structure evolution along the green plant lineage provides novel insights, such as a comparatively high number of loci with 5’-UTR introns in the moss. Comparative analysis of functional annotations reveals expansions of moss house-keeping and metabolic genes and further possibly adaptive, lineage-specific expansions and gains including at least 13% orphan genes. PMID:23879659

2013-01-01

197

Gene Annotation and Drug Target Discovery in Candida albicans with a Tagged Transposon Mutant Collection  

Microsoft Academic Search

Candida albicans is the most common human fungal pathogen, causing infections that can be lethal in immunocompromised patients. Although Saccharomyces cerevisiae has been used as a model for C. albicans, it lacks C. albicans' diverse morphogenic forms and is primarily non-pathogenic. Comprehensive genetic analyses that have been instrumental for determining gene function in S. cerevisiae are hampered in C. albicans,

Julia Oh; Eula Fung; Ulrich Schlecht; Ronald W. Davis; Guri Giaever; Robert P. St. Onge; Adam Deutschbauer; Corey Nislow

2010-01-01

198

De Novo Assembly and Annotation of the Transcriptome of the Agricultural Weed Ipomoea purpurea Uncovers Gene Expression Changes Associated with Herbicide Resistance  

PubMed Central

Human-mediated selection can lead to rapid evolution in very short time scales, and the evolution of herbicide resistance in agricultural weeds is an excellent example of this phenomenon. The common morning glory, Ipomoea purpurea, is resistant to the herbicide glyphosate, but genetic investigations of this trait have been hampered by the lack of genomic resources for this species. Here, we present the annotated transcriptome of the common morning glory, Ipomoea purpurea, along with an examination of whole genome expression profiling to assess potential gene expression differences between three artificially selected herbicide resistant lines and three susceptible lines. The assembled Ipomoea transcriptome reported in this work contains 65,459 assembled transcripts, ~28,000 of which were functionally annotated by assignment to Gene Ontology categories. Our RNA-seq survey using this reference transcriptome identified 19 differentially expressed genes associated with resistance—one of which, a cytochrome P450, belongs to a large plant family of genes involved in xenobiotic detoxification. The differentially expressed genes also broadly implicated receptor-like kinases, which were down-regulated in the resistant lines, and other growth and defense genes, which were up-regulated in resistant lines. Interestingly, the target of glyphosate—EPSP synthase—was not overexpressed in the resistant Ipomoea lines as in other glyphosate resistant weeds. Overall, this work identifies potential candidate resistance loci for future investigations and dramatically increases genomic resources for this species. The assembled transcriptome presented herein will also provide a valuable resource to the Ipomoea community, as well as to those interested in utilizing the close relationship between the Convolvulaceae and the Solanaceae for phylogenetic and comparative genomics examinations. PMID:25155274

Leslie, Trent; Baucom, Regina S.

2014-01-01

199

De Novo Assembly and Annotation of the Transcriptome of the Agricultural Weed Ipomoea purpurea Uncovers Gene Expression Changes Associated with Herbicide Resistance.  

PubMed

Human-mediated selection can lead to rapid evolution in very short time scales, and the evolution of herbicide resistance in agricultural weeds is an excellent example of this phenomenon. The common morning glory, Ipomoea purpurea, is resistant to the herbicide glyphosate, but genetic investigations of this trait have been hampered by the lack of genomic resources for this species. Here, we present the annotated transcriptome of the common morning glory, Ipomoea purpurea, along with an examination of whole genome expression profiling to assess potential gene expression differences between three artificially selected herbicide resistant lines and three susceptible lines. The assembled Ipomoea transcriptome reported in this work contains 65,459 assembled transcripts, ~28,000 of which were functionally annotated by assignment to Gene Ontology categories. Our RNA-seq survey using this reference transcriptome identified 19 differentially expressed genes associated with resistance-one of which, a cytochrome P450, belongs to a large plant family of genes involved in xenobiotic detoxification. The differentially expressed genes also broadly implicated receptor-like kinases, which were down-regulated in the resistant lines, and other growth and defense genes, which were up-regulated in resistant lines. Interestingly, the target of glyphosate-EPSP synthase-was not overexpressed in the resistant Ipomoea lines as in other glyphosate resistant weeds. Overall, this work identifies potential candidate resistance loci for future investigations and dramatically increases genomic resources for this species. The assembled transcriptome presented herein will also provide a valuable resource to the Ipomoea community, as well as to those interested in utilizing the close relationship between the Convolvulaceae and the Solanaceae for phylogenetic and comparative genomics examinations. PMID:25155274

Leslie, Trent; Baucom, Regina S

2014-01-01

200

Connectionist Approaches for Predicting Mouse Gene Function from Gene Expression  

E-print Network

Therapy. Identifying gene function based on gene expression data is much easier in prokaryotes than ways, especially in Gene Therapy [5]. Identifying gene function in prokaryotes is much easier thanConnectionist Approaches for Predicting Mouse Gene Function from Gene Expression Emad Andrews

Bonner, Anthony

201

The Aspergillus Genome Database: multispecies curation and incorporation of RNA-Seq data to improve structural gene annotations  

PubMed Central

The Aspergillus Genome Database (AspGD; http://www.aspgd.org) is a freely available web-based resource that was designed for Aspergillus researchers and is also a valuable source of information for the entire fungal research community. In addition to being a repository and central point of access to genome, transcriptome and polymorphism data, AspGD hosts a comprehensive comparative genomics toolbox that facilitates the exploration of precomputed orthologs among the 20 currently available Aspergillus genomes. AspGD curators perform gene product annotation based on review of the literature for four key Aspergillus species: Aspergillus nidulans, Aspergillus oryzae, Aspergillus fumigatus and Aspergillus niger. We have iteratively improved the structural annotation of Aspergillus genomes through the analysis of publicly available transcription data, mostly expressed sequenced tags, as described in a previous NAR Database article (Arnaud et al. 2012). In this update, we report substantive structural annotation improvements for A. nidulans, A. oryzae and A. fumigatus genomes based on recently available RNA-Seq data. Over 26 000 loci were updated across these species; although those primarily comprise the addition and extension of untranslated regions (UTRs), the new analysis also enabled over 1000 modifications affecting the coding sequence of genes in each target genome. PMID:24194595

Cerqueira, Gustavo C.; Arnaud, Martha B.; Inglis, Diane O.; Skrzypek, Marek S.; Binkley, Gail; Simison, Matt; Miyasato, Stuart R.; Binkley, Jonathan; Orvis, Joshua; Shah, Prachi; Wymore, Farrell; Sherlock, Gavin; Wortman, Jennifer R.

2014-01-01

202

Characterization of Liaoning Cashmere Goat Transcriptome: Sequencing, De Novo Assembly, Functional Annotation and Comparative Analysis  

PubMed Central

Background Liaoning cashmere goat is a famous goat breed for cashmere wool. In order to increase the transcriptome data and accelerate genetic improvement for this breed, we performed de novo transcriptome sequencing to generate the first expressed sequence tag dataset for the Liaoning cashmere goat, using next-generation sequencing technology. Results Transcriptome sequencing of Liaoning cashmere goat on a Roche 454 platform yielded 804,601 high-quality reads. Clustering and assembly of these reads produced a non-redundant set of 117,854 unigenes, comprising 13,194 isotigs and 104,660 singletons. Based on similarity searches with known proteins, 17,356 unigenes were assigned to 6,700 GO categories, and the terms were summarized into three main GO categories and 59 sub-categories. 3,548 and 46,778 unigenes had significant similarity to existing sequences in the KEGG and COG databases, respectively. Comparative analysis revealed that 42,254 unigenes were aligned to 17,532 different sequences in NCBI non-redundant nucleotide databases. 97,236 (82.51%) unigenes were mapped to the 30 goat chromosomes. 35,551 (30.17%) unigenes were matched to 11,438 reported goat protein-coding genes. The remaining non-matched unigenes were further compared with cattle and human reference genes, 67 putative new goat genes were discovered. Additionally, 2,781 potential simple sequence repeats were initially identified from all unigenes. Conclusion The transcriptome of Liaoning cashmere goat was deep sequenced, de novo assembled, and annotated, providing abundant data to better understand the Liaoning cashmere goat transcriptome. The potential simple sequence repeats provide a material basis for future genetic linkage and quantitative trait loci analyses. PMID:24130835

Liu, Hongliang; Wang, Tingting; Wang, Jinke; Quan, Fusheng; Zhang, Yong

2013-01-01

203

Comparison of lists of genes based on functional profiles  

PubMed Central

Background How to compare studies on the basis of their biological significance is a problem of central importance in high-throughput genomics. Many methods for performing such comparisons are based on the information in databases of functional annotation, such as those that form the Gene Ontology (GO). Typically, they consist of analyzing gene annotation frequencies in some pre-specified GO classes, in a class-by-class way, followed by p-value adjustment for multiple testing. Enrichment analysis, where a list of genes is compared against a wider universe of genes, is the most common example. Results A new global testing procedure and a method incorporating it are presented. Instead of testing separately for each GO class, a single global test for all classes under consideration is performed. The test is based on the distance between the functional profiles, defined as the joint frequencies of annotation in a given set of GO classes. These classes may be chosen at one or more GO levels. The new global test is more powerful and accurate with respect to type I errors than the usual class-by-class approach. When applied to some real datasets, the results suggest that the method may also provide useful information that complements the tests performed using a class-by-class approach if gene counts are sparse in some classes. An R library, goProfiles, implements these methods and is available from Bioconductor, http://bioconductor.org/packages/release/bioc/html/goProfiles.html. Conclusions The method provides an inferential basis for deciding whether two lists are functionally different. For global comparisons it is preferable to the global chi-square test of homogeneity. Furthermore, it may provide additional information if used in conjunction with class-by-class methods. PMID:21999355

2011-01-01

204

BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments.  

PubMed

We present a new version of Babelomics, a complete suite of web tools for functional analysis of genome-scale experiments, with new and improved tools. New functionally relevant terms have been included such as CisRed motifs or bioentities obtained by text-mining procedures. An improved indexing has considerably speeded up several of the modules. An improved version of the FatiScan method for studying the coordinate behaviour of groups of functionally related genes is presented, along with a similar tool, the Gene Set Enrichment Analysis. Babelomics is now more oriented to test systems biology inspired hypotheses. Babelomics can be found at http://www.babelomics.org. PMID:16845052

Al-Shahrour, Fátima; Minguez, Pablo; Tárraga, Joaquín; Montaner, David; Alloza, Eva; Vaquerizas, Juan M; Conde, Lucía; Blaschke, Christian; Vera, Javier; Dopazo, Joaquín

2006-07-01

205

A software framework for microarray and gene expression object model (MAGE-OM) array design annotation  

PubMed Central

Background The MIAME and MAGE-OM standards defined by the MGED society provide a specification and implementation of a software infrastructure to facilitate the submission and sharing of data from microarray studies via public repositories. However, although the MAGE object model is flexible enough to support different annotation strategies, the annotation of array descriptions can be complex. Results We have developed a graphical Java-based application (Adamant) to assist with submission of Microarray designs to public repositories. Output of the application is fully compliant with the standards prescribed by the various public data repositories. Conclusion Adamant will allow researchers to annotate and submit their own array designs to public repositories without requiring programming expertise, knowledge of the MAGE-OM or XML. The application has been used to submit a number of ArrayDesigns to the Array Express database. PMID:18366695

Qureshi, Matloob; Ivens, Alasdair

2008-01-01

206

Annotation of Protein Domains Reveals Remarkable Conservation in the Functional Make up of Proteomes Across Superkingdoms  

PubMed Central

The functional repertoire of a cell is largely embodied in its proteome, the collection of proteins encoded in the genome of an organism. The molecular functions of proteins are the direct consequence of their structure and structure can be inferred from sequence using hidden Markov models of structural recognition. Here we analyze the functional annotation of protein domain structures in almost a thousand sequenced genomes, exploring the functional and structural diversity of proteomes. We find there is a remarkable conservation in the distribution of domains with respect to the molecular functions they perform in the three superkingdoms of life. In general, most of the protein repertoire is spent in functions related to metabolic processes but there are significant differences in the usage of domains for regulatory and extra-cellular processes both within and between superkingdoms. Our results support the hypotheses that the proteomes of superkingdom Eukarya evolved via genome expansion mechanisms that were directed towards innovating new domain architectures for regulatory and extra/intracellular process functions needed for example to maintain the integrity of multicellular structure or to interact with environmental biotic and abiotic factors (e.g., cell signaling and adhesion, immune responses, and toxin production). Proteomes of microbial superkingdoms Archaea and Bacteria retained fewer numbers of domains and maintained simple and smaller protein repertoires. Viruses appear to play an important role in the evolution of superkingdoms. We finally identify few genomic outliers that deviate significantly from the conserved functional design. These include Nanoarchaeum equitans, proteobacterial symbionts of insects with extremely reduced genomes, Tenericutes and Guillardia theta. These organisms spend most of their domains on information functions, including translation and transcription, rather than on metabolism and harbor a domain repertoire characteristic of parasitic organisms. In contrast, the functional repertoire of the proteomes of the Planctomycetes-Verrucomicrobia-Chlamydiae superphylum was no different than the rest of bacteria, failing to support claims of them representing a separate superkingdom. In turn, Protista and Bacteria shared similar functional distribution patterns suggesting an ancestral evolutionary link between these groups. PMID:24710297

Nasir, Arshan; Naeem, Aisha; Khan, Muhammad Jawad; Lopez-Nicora, Horacio D.; Caetano-Anolles, Gustavo

2011-01-01

207

A semantic analysis of the annotations of the human genome  

PubMed Central

The correct interpretation of any biological experiment depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are ubiquitous and used by all life scientists in most experiments. However, it is well known that such databases are incomplete and many annotations may also be incorrect. In this paper we describe a technique that can be used to analyze the semantic content of such annotation databases. Our approach is able to extract implicit semantic relationships between genes and functions. This ability allows us to discover novel functions for known genes. This approach is able to identify missing and inaccurate annotations in existing annotation databases, and thus help improve their accuracy. We used our technique to analyze the current annotations of the human genome. From this body of annotations, we were able to predict 212 additional gene–function assignments. A subsequent literature search found that 138 of these gene–functions assignments are supported by existing peer-reviewed papers. An additional 23 assignments have been confirmed in the meantime by the addition of the respective annotations in later releases of the Gene Ontology database. Overall, the 161 confirmed assignments represent 75.95% of the proposed gene–function assignments. Only one of our predictions (0.4%) was contradicted by the existing literature. We could not find any relevant articles for 50 of our predictions (23.58%). The method is independent of the organism and can be used to analyze and improve the quality of the data of any public or private annotation database. Availability http://vortex.cs.wayne.edu/papers/semantic_analysis_bioinfo.pdf Contact sod@cs.wayne.edu PMID:15955782

Khatri, Purvesh; Done, Bogdan; Rao, Archana; Done, Arina

2008-01-01

208

Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation  

E-print Network

) that are concatenated for efficient sequencing and mapped to genome sequences to demarcate the transcription boundaries technological advance for genome annotation. With the completion of sequencing of the human1­3 and other 5¢ and 3¢ short tag sequences for each transcript, map these terminal `signatures' to the genome

Cai, Long

209

The mammalian gene function resource: the International Knockout Mouse Consortium.  

PubMed

In 2007, the International Knockout Mouse Consortium (IKMC) made the ambitious promise to generate mutations in virtually every protein-coding gene of the mouse genome in a concerted worldwide action. Now, 5 years later, the IKMC members have developed high-throughput gene trapping and, in particular, gene-targeting pipelines and generated more than 17,400 mutant murine embryonic stem (ES) cell clones and more than 1,700 mutant mouse strains, most of them conditional. A common IKMC web portal (www.knockoutmouse.org) has been established, allowing easy access to this unparalleled biological resource. The IKMC materials considerably enhance functional gene annotation of the mammalian genome and will have a major impact on future biomedical research. PMID:22968824

Bradley, Allan; Anastassiadis, Konstantinos; Ayadi, Abdelkader; Battey, James F; Bell, Cindy; Birling, Marie-Christine; Bottomley, Joanna; Brown, Steve D; Bürger, Antje; Bult, Carol J; Bushell, Wendy; Collins, Francis S; Desaintes, Christian; Doe, Brendan; Economides, Aris; Eppig, Janan T; Finnell, Richard H; Fletcher, Colin; Fray, Martin; Frendewey, David; Friedel, Roland H; Grosveld, Frank G; Hansen, Jens; Hérault, Yann; Hicks, Geoffrey; Hörlein, Andreas; Houghton, Richard; Hrabé de Angelis, Martin; Huylebroeck, Danny; Iyer, Vivek; de Jong, Pieter J; Kadin, James A; Kaloff, Cornelia; Kennedy, Karen; Koutsourakis, Manousos; Lloyd, K C Kent; Marschall, Susan; Mason, Jeremy; McKerlie, Colin; McLeod, Michael P; von Melchner, Harald; Moore, Mark; Mujica, Alejandro O; Nagy, Andras; Nefedov, Mikhail; Nutter, Lauryl M; Pavlovic, Guillaume; Peterson, Jane L; Pollock, Jonathan; Ramirez-Solis, Ramiro; Rancourt, Derrick E; Raspa, Marcello; Remacle, Jacques E; Ringwald, Martin; Rosen, Barry; Rosenthal, Nadia; Rossant, Janet; Ruiz Noppinger, Patricia; Ryder, Ed; Schick, Joel Zupicich; Schnütgen, Frank; Schofield, Paul; Seisenberger, Claudia; Selloum, Mohammed; Simpson, Elizabeth M; Skarnes, William C; Smedley, Damian; Stanford, William L; Stewart, A Francis; Stone, Kevin; Swan, Kate; Tadepally, Hamsa; Teboul, Lydia; Tocchini-Valentini, Glauco P; Valenzuela, David; West, Anthony P; Yamamura, Ken-ichi; Yoshinaga, Yuko; Wurst, Wolfgang

2012-10-01

210

A novel method to quantify gene set functional association based on gene ontology  

PubMed Central

Numerous gene sets have been used as molecular signatures for exploring the genetic basis of complex disorders. These gene sets are distinct but related to each other in many cases; therefore, efforts have been made to compare gene sets for studies such as those evaluating the reproducibility of different experiments. Comparison in terms of biological function has been demonstrated to be helpful to biologists. We improved the measurement of semantic similarity to quantify the functional association between gene sets in the context of gene ontology and developed a web toolkit named Gene Set Functional Similarity (GSFS; http://bioinfo.hrbmu.edu.cn/GSFS). Validation based on protein complexes for which the functional associations are known demonstrated that the GSFS scores tend to be correlated with sequence similarity scores and that complexes with high GSFS scores tend to be involved in the same functional catalogue. Compared with the pairwise method and the annotation method, the GSFS shows better discrimination and more accurately reflects the known functional catalogues shared between complexes. Case studies comparing differentially expressed genes of prostate tumour samples from different microarray platforms and identifying coronary heart disease susceptibility pathways revealed that the method could contribute to future studies exploring the molecular basis of complex disorders. PMID:21998111

Lv, Sali; Li, Yan; Wang, Qianghu; Ning, Shangwei; Huang, Teng; Wang, Peng; Sun, Jie; Zheng, Yan; Liu, Weisha; Ai, Jing; Li, Xia

2012-01-01

211

Automated annotation of functional imaging experiments via multi-label classification  

PubMed Central

Identifying the experimental methods in human neuroimaging papers is important for grouping meaningfully similar experiments for meta-analyses. Currently, this can only be done by human readers. We present the performance of common machine learning (text mining) methods applied to the problem of automatically classifying or labeling this literature. Labeling terms are from the Cognitive Paradigm Ontology (CogPO), the text corpora are abstracts of published functional neuroimaging papers, and the methods use the performance of a human expert as training data. We aim to replicate the expert's annotation of multiple labels per abstract identifying the experimental stimuli, cognitive paradigms, response types, and other relevant dimensions of the experiments. We use several standard machine learning methods: naive Bayes (NB), k-nearest neighbor, and support vector machines (specifically SMO or sequential minimal optimization). Exact match performance ranged from only 15% in the worst cases to 78% in the best cases. NB methods combined with binary relevance transformations performed strongly and were robust to overfitting. This collection of results demonstrates what can be achieved with off-the-shelf software components and little to no pre-processing of raw text. PMID:24409112

Turner, Matthew D.; Chakrabarti, Chayan; Jones, Thomas B.; Xu, Jiawei F.; Fox, Peter T.; Luger, George F.; Laird, Angela R.; Turner, Jessica A.

2013-01-01

212

Comparative annotation of functional regions in the human genome using epigenomic data.  

PubMed

Epigenetic regulation is dynamic and cell-type dependent. The recently available epigenomic data in multiple cell types provide an unprecedented opportunity for a comparative study of epigenetic landscape. We developed a machine-learning method called ChroModule to annotate the epigenetic states in eight ENCyclopedia Of DNA Elements cell types. The trained model successfully captured the characteristic histone-modification patterns associated with regulatory elements, such as promoters and enhancers, and showed superior performance on identifying enhancers compared with the state-of-art methods. In addition, given the fixed number of epigenetic states in the model, ChroModule allows straightforward illustration of epigenetic variability in multiple cell types. Using this feature, we found that invariable and variable epigenetic states across cell types correspond to housekeeping functions and stimulus response, respectively. Especially, we observed that enhancers, but not the other regulatory elements, dictate cell specificity, as similar cell types share common enhancers, and cell-type-specific enhancers are often bound by transcription factors playing critical roles in that cell type. More interestingly, we found some genomic regions are dormant in cell type but primed to become active in other cell types. These observations highlight the usefulness of ChroModule in comparative analysis and interpretation of multiple epigenomes. PMID:23482391

Won, Kyoung-Jae; Zhang, Xian; Wang, Tao; Ding, Bo; Raha, Debasish; Snyder, Michael; Ren, Bing; Wang, Wei

2013-04-01

213

Comparative annotation of functional regions in the human genome using epigenomic data  

PubMed Central

Epigenetic regulation is dynamic and cell-type dependent. The recently available epigenomic data in multiple cell types provide an unprecedented opportunity for a comparative study of epigenetic landscape. We developed a machine-learning method called ChroModule to annotate the epigenetic states in eight ENCyclopedia Of DNA Elements cell types. The trained model successfully captured the characteristic histone-modification patterns associated with regulatory elements, such as promoters and enhancers, and showed superior performance on identifying enhancers compared with the state-of-art methods. In addition, given the fixed number of epigenetic states in the model, ChroModule allows straightforward illustration of epigenetic variability in multiple cell types. Using this feature, we found that invariable and variable epigenetic states across cell types correspond to housekeeping functions and stimulus response, respectively. Especially, we observed that enhancers, but not the other regulatory elements, dictate cell specificity, as similar cell types share common enhancers, and cell-type–specific enhancers are often bound by transcription factors playing critical roles in that cell type. More interestingly, we found some genomic regions are dormant in cell type but primed to become active in other cell types. These observations highlight the usefulness of ChroModule in comparative analysis and interpretation of multiple epigenomes. PMID:23482391

Won, Kyoung-Jae; Zhang, Xian; Wang, Tao; Ding, Bo; Raha, Debasish; Snyder, Michael; Ren, Bing; Wang, Wei

2013-01-01

214

A transcriptomic analysis of striped catfish (Pangasianodon hypophthalmus) in response to salinity adaptation: De novo assembly, gene annotation and marker discovery.  

PubMed

The striped catfish (Pangasianodon hypophthalmus) culture industry in the Mekong Delta in Vietnam has developed rapidly over the past decade. The culture industry now however, faces some significant challenges, especially related to climate change impacts notably from predicted extensive saltwater intrusion into many low topographical coastal provinces across the Mekong Delta. This problem highlights a need for development of culture stocks that can tolerate more saline culture environments as a response to expansion of saline water-intruded land. While a traditional artificial selection program can potentially address this need, understanding the genomic basis of salinity tolerance can assist development of more productive culture lines. The current study applied a transcriptomic approach using Ion PGM technology to generate expressed sequence tag (EST) resources from the intestine and swim bladder from striped catfish reared at a salinity level of 9ppt which showed best growth performance. Total sequence data generated was 467.8Mbp, consisting of 4,116,424 reads with an average length of 112bp. De novo assembly was employed that generated 51,188 contigs, and allowed identification of 16,116 putative genes based on the GenBank non-redundant database. GO annotation, KEGG pathway mapping, and functional annotation of the EST sequences recovered with a wide diversity of biological functions and processes. In addition, more than 11,600 simple sequence repeats were also detected. This is the first comprehensive analysis of a striped catfish transcriptome, and provides a valuable genomic resource for future selective breeding programs and functional or evolutionary studies of genes that influence salinity tolerance in this important culture species. PMID:24841517

Thanh, Nguyen Minh; Jung, Hyungtaek; Lyons, Russell E; Chand, Vincent; Tuan, Nguyen Viet; Thu, Vo Thi Minh; Mather, Peter

2014-06-01

215

Systematic Learning of Gene Functional Classes From DNA Array Expression Data by Using  

E-print Network

10598, USA Recent advances in microarray technology have opened new ways for functional annotation negatives). A closer analysis reveals that false positives (and negatives) in a machine-learning context positives and negatives and contains genes that are biologically related to the original class, allowing

Gerstein, Mark

216

Integrative Annotation of 21,037 Human Genes Validated by Full-Length cDNA Clones  

Microsoft Academic Search

The human genome sequence defines our inherent biological potential; the realization of the biology encoded therein requires knowledge of the function of each gene. Currently, our knowledge in this area is still limited. Several lines of investigation have been used to elucidate the structure and function of the genes in the human genome. Even so, gene prediction remains a difficult

Tadashi Imanishi; Takeshi Itoh; Yutaka Suzuki; Claire ODonovan; Satoshi Fukuchi; Kanako O. Koyanagi; Roberto A. Barrero; Takuro Tamura; Yumi Yamaguchi-Kabata; Motohiko Tanino; Kei Yura; Satoru Miyazaki; Kazuho Ikeo; Keiichi Homma; Arek Kasprzyk; Tetsuo Nishikawa; Mika Hirakawa; Jean Thierry-Mieg; Danielle Thierry-Mieg; Jennifer Ashurst; Libin Jia; Mitsuteru Nakao; Michael A. Thomas; Nicola Mulder; Youla Karavidopoulou; Lihua Jin; Sangsoo Kim; Tomohiro Yasuda; Boris Lenhard; Eric Eveno; Yoshiyuki Suzuki; Chisato Yamasaki; Jun-ichi Takeda; Craig Gough; Phillip Hilton; Yasuyuki Fujii; Hiroaki Sakai; Susumu Tanaka; Clara Amid; Matthew Bellgard; Maria de Fatima Bonaldo; Hidemasa Bono; Susan K. Bromberg; Anthony J. Brookes; Elspeth Bruford; Piero Carninci; Claude Chelala; Christine Couillault; Sandro J. de Souza; Marie-Anne Debily; Marie-Dominique Devignes; Inna Dubchak; Toshinori Endo; Anne Estreicher; Eduardo Eyras; Kaoru Fukami-Kobayashi; Gopal R. Gopinath; Esther Graudens; Yoonsoo Hahn; Michael Han; Ze-Guang Han; Kousuke Hanada; Hideki Hanaoka; Erimi Harada; Katsuyuki Hashimoto; Ursula Hinz; Momoki Hirai; Teruyoshi Hishiki; Ian Hopkinson; Sandrine Imbeaud; Hidetoshi Inoko; Alexander Kanapin; Yayoi Kaneko; Takeya Kasukawa; Janet Kelso; Paul Kersey; Reiko Kikuno; Kouichi Kimura; Bernhard Korn; Vladimir Kuryshev; Izabela Makalowska; Takashi Makino; Shuhei Mano; Regine Mariage-Samson; Jun Mashima; Hideo Matsuda; Hans-Werner Mewes; Shinsei Minoshima; Keiichi Nagai; Hideki Nagasaki; Naoki Nagata; Rajni Nigam; Osamu Ogasawara; Osamu Ohara; Masafumi Ohtsubo; Norihiro Okada; Toshihisa Okido; Satoshi Oota; Motonori Ota; Toshio Ota; Tetsuji Otsuki; Dominique Piatier-Tonneau; Annemarie Poustka; Shuang-Xi Ren; Naruya Saitou; Katsunaga Sakai; Shigetaka Sakamoto; Ryuichi Sakate; Ingo Schupp; Florence Servant; Stephen Sherry; Rie Shiba; Nobuyoshi Shimizu; Mary Shimoyama; Andrew J Simpson; Bento Soares; Charles Steward; Makiko Suwa; Mami Suzuki; Aiko Takahashi; Gen Tamiya; Hiroshi Tanaka; Todd Taylor; Joseph D Terwilliger; Per Unneberg; Vamsi Veeramachaneni; Shinya Watanabe; Laurens Wilming; Norikazu Yasuda; Hyang-Sook Yoo; Marvin Stodolsky; Wojciech Makalowski; Mitiko Go; Kenta Nakai; Toshihisa Takagi; Minoru Kanehisa; Yoshiyuki Sakaki; John Quackenbush; Yasushi Okazaki; Yoshihide Hayashizaki; Winston Hide; Ranajit Chakraborty; Ken Nishikawa; Hideaki Sugawara; Yoshio Tateno; Zhu Chen; Michio Oishi; Peter Tonellato; Rolf Apweiler; Kousaku Okubo; Lukas Wagner; Stefan Wiemann; Robert L Strausberg; Takao Isogai; Charles Auffray; Nobuo Nomura; Takashi Gojobori; Sumio Sugano

2004-01-01

217

Construction and accessibility of a cross-species phenotype ontology along with gene annotations for biomedical research.  

PubMed

Phenotype analyses, e.g. investigating metabolic processes, tissue formation, or organism behavior, are an important element of most biological and medical research activities. Biomedical researchers are making increased use of ontological standards and methods to capture the results of such analyses, with one focus being the comparison and analysis of phenotype information between species. We have generated a cross-species phenotype ontology for human, mouse and zebrafish that contains classes from the Human Phenotype Ontology, Mammalian Phenotype Ontology, and generated classes for zebrafish phenotypes. We also provide up-to-date annotation data connecting human genes to phenotype classes from the generated ontology. We have included the data generation pipeline into our continuous integration system ensuring stable and up-to-date releases. This article describes the data generation process and is intended to help interested researchers access both the phenotype annotation data and the associated cross-species phenotype ontology. The resource described here can be used in sophisticated semantic similarity and gene set enrichment analyses for phenotype data across species. The stable releases of this resource can be obtained from http://purl.obolibrary.org/obo/hp/uberpheno/. PMID:24358873

Köhler, Sebastian; Doelken, Sandra C; Ruef, Barbara J; Bauer, Sebastian; Washington, Nicole; Westerfield, Monte; Gkoutos, George; Schofield, Paul; Smedley, Damian; Lewis, Suzanna E; Robinson, Peter N; Mungall, Christopher J

2013-01-01

218

Construction and accessibility of a cross-species phenotype ontology along with gene annotations for biomedical research  

PubMed Central

Phenotype analyses, e.g. investigating metabolic processes, tissue formation, or organism behavior, are an important element of most biological and medical research activities. Biomedical researchers are making increased use of ontological standards and methods to capture the results of such analyses, with one focus being the comparison and analysis of phenotype information between species. We have generated a cross-species phenotype ontology for human, mouse and zebrafish that contains classes from the Human Phenotype Ontology, Mammalian Phenotype Ontology, and generated classes for zebrafish phenotypes. We also provide up-to-date annotation data connecting human genes to phenotype classes from the generated ontology. We have included the data generation pipeline into our continuous integration system ensuring stable and up-to-date releases. This article describes the data generation process and is intended to help interested researchers access both the phenotype annotation data and the associated cross-species phenotype ontology. The resource described here can be used in sophisticated semantic similarity and gene set enrichment analyses for phenotype data across species. The stable releases of this resource can be obtained from http://purl.obolibrary.org/obo/hp/uberpheno/. PMID:24358873

Köhler, Sebastian; Mungall, Christopher J

2014-01-01

219

Comprehensive Functional Annotation of Seventy-One Breast Cancer Risk Loci  

PubMed Central

Breast Cancer (BCa) genome-wide association studies revealed allelic frequency differences between cases and controls at index single nucleotide polymorphisms (SNPs). To date, 71 loci have thus been identified and replicated. More than 320,000 SNPs at these loci define BCa risk due to linkage disequilibrium (LD). We propose that BCa risk resides in a subgroup of SNPs that functionally affects breast biology. Such a shortlist will aid in framing hypotheses to prioritize a manageable number of likely disease-causing SNPs. We extracted all the SNPs, residing in 1 Mb windows around breast cancer risk index SNP from the 1000 genomes project to find correlated SNPs. We used FunciSNP, an R/Bioconductor package developed in-house, to identify potentially functional SNPs at 71 risk loci by coinciding them with chromatin biofeatures. We identified 1,005 SNPs in LD with the index SNPs (r2?0.5) in three categories; 21 in exons of 18 genes, 76 in transcription start site (TSS) regions of 25 genes, and 921 in enhancers. Thirteen SNPs were found in more than one category. We found two correlated and predicted non-benign coding variants (rs8100241 in exon 2 and rs8108174 in exon 3) of the gene, ANKLE1. Most putative functional LD SNPs, however, were found in either epigenetically defined enhancers or in gene TSS regions. Fifty-five percent of these non-coding SNPs are likely functional, since they affect response element (RE) sequences of transcription factors. Functionality of these SNPs was assessed by expression quantitative trait loci (eQTL) analysis and allele-specific enhancer assays. Unbiased analyses of SNPs at BCa risk loci revealed new and overlooked mechanisms that may affect risk of the disease, thereby providing a valuable resource for follow-up studies. PMID:23717510

Rhie, Suhn Kyong; Coetzee, Simon G.; Noushmehr, Houtan; Yan, Chunli; Kim, Jae Mun; Haiman, Christopher A.; Coetzee, Gerhard A.

2013-01-01

220

Transcriptome sequencing and annotation of the microalgae Dunaliella tertiolecta: Pathway description and gene discovery for production of next-generation biofuels  

PubMed Central

Background Biodiesel or ethanol derived from lipids or starch produced by microalgae may overcome many of the sustainability challenges previously ascribed to petroleum-based fuels and first generation plant-based biofuels. The paucity of microalgae genome sequences, however, limits gene-based biofuel feedstock optimization studies. Here we describe the sequencing and de novo transcriptome assembly for the non-model microalgae species, Dunaliella tertiolecta, and identify pathways and genes of importance related to biofuel production. Results Next generation DNA pyrosequencing technology applied to D. tertiolecta transcripts produced 1,363,336 high quality reads with an average length of 400 bases. Following quality and size trimming, ~ 45% of the high quality reads were assembled into 33,307 isotigs with a 31-fold coverage and 376,482 singletons. Assembled sequences and singletons were subjected to BLAST similarity searches and annotated with Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology (KO) identifiers. These analyses identified the majority of lipid and starch biosynthesis and catabolism pathways in D. tertiolecta. Conclusions The construction of metabolic pathways involved in the biosynthesis and catabolism of fatty acids, triacylglycrols, and starch in D. tertiolecta as well as the assembled transcriptome provide a foundation for the molecular genetics and functional genomics required to direct metabolic engineering efforts that seek to enhance the quantity and character of microalgae-based biofuel feedstock. PMID:21401935

2011-01-01

221

Genes of the antioxidant system of the honey bee: annotation and phylogeny  

PubMed Central

Antioxidant enzymes perform a variety of vital functions including the reduction of life-shortening oxidative damage. We used the honey bee genome sequence to identify the major components of the honey bee antioxidant system. A comparative analysis of honey bee with Drosophila melanogaster and Anopheles gambiae shows that although the basic components of the antioxidant system are conserved, there are important species differences in the number of paralogs. These include the duplication of thioredoxin reductase and the expansion of the thioredoxin family in fly; lack of expansion of the Theta, Delta and Omega GST classes in bee and no expansion of the Sigma class in dipteran species. The differential expansion of antioxidant gene families among honey bees and dipteran species might reflect the marked differences in life history and ecological niches between social and solitary species. PMID:17069640

Corona, M; Robinson, G E

2006-01-01

222

Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases.  

PubMed

The available curated data lag behind current biological knowledge contained in the literature. Text mining can assist biologists and curators to locate and access this knowledge, for instance by characterizing the functional profile of publications. Gene Ontology (GO) category assignment in free text already supports various applications, such as powering ontology-based search engines, finding curation-relevant articles (triage) or helping the curator to identify and encode functions. Popular text mining tools for GO classification are based on so called thesaurus-based--or dictionary-based--approaches, which exploit similarities between the input text and GO terms themselves. But their effectiveness remains limited owing to the complex nature of GO terms, which rarely occur in text. In contrast, machine learning approaches exploit similarities between the input text and already curated instances contained in a knowledge base to infer a functional profile. GO Annotations (GOA) and MEDLINE make possible to exploit a growing amount of curated abstracts (97 000 in November 2012) for populating this knowledge base. Our study compares a state-of-the-art thesaurus-based system with a machine learning system (based on a k-Nearest Neighbours algorithm) for the task of proposing a functional profile for unseen MEDLINE abstracts, and shows how resources and performances have evolved. Systems are evaluated on their ability to propose for a given abstract the GO terms (2.8 on average) used for curation in GOA. We show that since 2006, although a massive effort was put into adding synonyms in GO (+300%), our thesaurus-based system effectiveness is rather constant, reaching from 0.28 to 0.31 for Recall at 20 (R20). In contrast, thanks to its knowledge base growth, our machine learning system has steadily improved, reaching from 0.38 in 2006 to 0.56 for R20 in 2012. Integrated in semi-automatic workflows or in fully automatic pipelines, such systems are more and more efficient to provide assistance to biologists. DATABASE URL: http://eagl.unige.ch/GOCat/ PMID:23842461

Gobeill, Julien; Pasche, Emilie; Vishnyakova, Dina; Ruch, Patrick

2013-01-01

223

Solving the Problem: Genome Annotation Standards before the Data Deluge  

PubMed Central

The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries. PMID:22180819

Klimke, William; O'Donovan, Claire; White, Owen; Brister, J. Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D.; Tatusova, Tatiana

2011-01-01

224

Re-annotating the Mycoplasma pneumoniae genome sequence: adding value, function and reading frames  

Microsoft Academic Search

Four years after the original sequence submission, we have re-annotated the genome of Mycoplasma pneumoniae to incorporate novel data. The total number of ORFss has been increased from 677 to 688 (10 new proteins were predicted in intergenic regions, two further were newly identified by mass spectrometry and one protein ORF was dismissed) and the number of RNAs from 39

Thomas Dandekar; Martijn Huynen; Jörg Thomas Regula; Barbara Ueberle; Carl Ulrich Zimmermann; Miguel A. Andrade; Tobias Doerks; Luis Sánchez-Pulido; Berend Snel; Mikita Suyama; P. Yuan; Richard Herrmann; Peer Bork

2000-01-01

225

Functional Cohesion of Gene Sets Determined by Latent Semantic Indexing of PubMed Abstracts  

PubMed Central

High-throughput genomic technologies enable researchers to identify genes that are co-regulated with respect to specific experimental conditions. Numerous statistical approaches have been developed to identify differentially expressed genes. Because each approach can produce distinct gene sets, it is difficult for biologists to determine which statistical approach yields biologically relevant gene sets and is appropriate for their study. To address this issue, we implemented Latent Semantic Indexing (LSI) to determine the functional coherence of gene sets. An LSI model was built using over 1 million Medline abstracts for over 20,000 mouse and human genes annotated in Entrez Gene. The gene-to-gene LSI-derived similarities were used to calculate a literature cohesion p-value (LPv) for a given gene set using a Fisher's exact test. We tested this method against genes in more than 6,000 functional pathways annotated in Gene Ontology (GO) and found that approximately 75% of gene sets in GO biological process category and 90% of the gene sets in GO molecular function and cellular component categories were functionally cohesive (LPv<0.05). These results indicate that the LPv methodology is both robust and accurate. Application of this method to previously published microarray datasets demonstrated that LPv can be helpful in selecting the appropriate feature extraction methods. To enable real-time calculation of LPv for mouse or human gene sets, we developed a web tool called Gene-set Cohesion Analysis Tool (GCAT). GCAT can complement other gene set enrichment approaches by determining the overall functional cohesion of data sets, taking into account both explicit and implicit gene interactions reported in the biomedical literature. Availability GCAT is freely available at http://binf1.memphis.edu/gcat PMID:21533142

Xu, Lijing; Furlotte, Nicholas; Lin, Yunyue; Heinrich, Kevin; Berry, Michael W.; George, Ebenezer O.; Homayouni, Ramin

2011-01-01

226

Mining and gene ontology based annotation of SSR markers from expressed sequence tags of Humulus lupulus  

PubMed Central

Humulus lupulus is commonly known as hops, a member of the family moraceae. Currently many projects are underway leading to the accumulation of voluminous genomic and expressed sequence tag sequences in public databases. The genetically characterized domains in these databases are limited due to non-availability of reliable molecular markers. The large data of EST sequences are available in hops. The simple sequence repeat markers extracted from EST data are used as molecular markers for genetic characterization, in the present study. 25,495 EST sequences were examined and assembled to get full-length sequences. Maximum frequency distribution was shown by mononucleotide SSR motifs i.e. 60.44% in contig and 62.16% in singleton where as minimum frequency are observed for hexanucleotide SSR in contig (0.09%) and pentanucleotide SSR in singletons (0.12%). Maximum trinucleotide motifs code for Glutamic acid (GAA) while AT/TA were the most frequent repeat of dinucleotide SSRs. Flanking primer pairs were designed in-silico for the SSR containing sequences. Functional categorization of SSRs containing sequences was done through gene ontology terms like biological process, cellular component and molecular function. PMID:22368382

Singh, Swati; Gupta, Sanchita; Mani, Ashutosh; Chaturvedi, Anoop

2012-01-01

227

Towards precise classification of cancers based on robust gene functional expression profiles  

PubMed Central

Background Development of robust and efficient methods for analyzing and interpreting high dimension gene expression profiles continues to be a focus in computational biology. The accumulated experiment evidence supports the assumption that genes express and perform their functions in modular fashions in cells. Therefore, there is an open space for development of the timely and relevant computational algorithms that use robust functional expression profiles towards precise classification of complex human diseases at the modular level. Results Inspired by the insight that genes act as a module to carry out a highly integrated cellular function, we thus define a low dimension functional expression profile for data reduction. After annotating each individual gene to functional categories defined in a proper gene function classification system such as Gene Ontology applied in this study, we identify those functional categories enriched with differentially expressed genes. For each functional category or functional module, we compute a summary measure (s) for the raw expression values of the annotated genes to capture the overall activity level of the module. In this way, we can treat the gene expressions within a functional module as an integrative data point to replace the multiple values of individual genes. We compare the classification performance of decision trees based on functional expression profiles with the conventional gene expression profiles using four publicly available datasets, which indicates that precise classification of tumour types and improved interpretation can be achieved with the reduced functional expression profiles. Conclusion This modular approach is demonstrated to be a powerful alternative approach to analyzing high dimension microarray data and is robust to high measurement noise and intrinsic biological variance inherent in microarray data. Furthermore, efficient integration with current biological knowledge has facilitated the interpretation of the underlying molecular mechanisms for complex human diseases at the modular level. PMID:15774002

Guo, Zheng; Zhang, Tianwen; Li, Xia; Wang, Qi; Xu, Jianzhen; Yu, Hui; Zhu, Jing; Wang, Haiyun; Wang, Chenguang; Topol, Eric J; Wang, Qing; Rao, Shaoqi

2005-01-01

228

Analysis and functional annotation of expressed sequence tags from the fall armyworm Spodoptera frugiperda  

PubMed Central

Background Little is known about the genome sequences of lepidopteran insects, although this group of insects has been studied extensively in the fields of endocrinology, development, immunity, and pathogen-host interactions. In addition, cell lines derived from Spodoptera frugiperda and other lepidopteran insects are routinely used for baculovirus foreign gene expression. This study reports the results of an expressed sequence tag (EST) sequencing project in cells from the lepidopteran insect S. frugiperda, the fall armyworm. Results We have constructed an EST database using two cDNA libraries from the S. frugiperda-derived cell line, SF-21. The database consists of 2,367 ESTs which were assembled into 244 contigs and 951 singlets for a total of 1,195 unique sequences. Conclusion S. frugiperda is an agriculturally important pest insect and genomic information will be instrumental for establishing initial transcriptional profiling and gene function studies, and for obtaining information about genes manipulated during infections by insect pathogens such as baculoviruses. PMID:17052344

Deng, Youping; Dong, Yinghua; Thodima, Venkata; Clem, Rollie J; Passarelli, A Lorena

2006-01-01

229

Bioinformatic approaches for functional annotation and pathway inference in metagenomics data  

PubMed Central

Metagenomic approaches are increasingly recognized as a baseline for understanding the ecology and evolution of microbial ecosystems. The development of methods for pathway inference from metagenomics data is of paramount importance to link a phenotype to a cascade of events stemming from a series of connected sets of genes or proteins. Biochemical and regulatory pathways have until recently been thought and modelled within one cell type, one organism, one species. This vision is being dramatically changed by the advent of whole microbiome sequencing studies, revealing the role of symbiotic microbial populations in fundamental biochemical functions. The new landscape we face requires a clear picture of the potentialities of existing tools and development of new tools to characterize, reconstruct and model biochemical and regulatory pathways as the result of integration of function in complex symbiotic interactions of ontologically and evolutionary distinct cell types. PMID:23175748

De Filippo, Carlotta; Ramazzotti, Matteo; Fontana, Paolo; Cavalieri, Duccio

2012-01-01

230

Variation ontology: annotator guide  

PubMed Central

Background Systematic representation of information related to genetic and non-genetic variations is required to allow large scale studies, data mining and data integration, and to make it possible to reveal novel relationships between genotype and phenotype. Although lots of variation data is available it is often difficult to use due to lack of systematics. Results A novel ontology, Variation Ontology (VariO http://variationontology.org), was developed for annotation of effects, consequences and mechanisms of variations. In this article instructions are provided on how VariO annotations are made. The major levels for description are the three molecules, namely DNA, RNA and protein. They are further divided to four major sublevels: variation type, function, structure, and property, and further up to eight sublevels. VariO annotation summarizes existing knowledge about a variation and its effects and formalizes it so that computational analyses are efficient. The annotations should be made on as many levels as possible. VariO annotations are made in reference to normal states, which vary for each data item including e.g. reference sequences, wild type properties, and activities. Conclusions Detailed instructions together with examples are provided to indicate how VariO can be used for annotation of variations and their effects. A dedicated tool has been developed for annotation and will be further developed to cover also evidence for the annotations. VariO is suitable for annotation of data in many types of databases. As several different kinds of databases are in a process of adapting VariO annotations it is important to have guidelines to guarantee consistent annotation. PMID:24533660

2014-01-01

231

Integration of Cluster Ensemble and EM based Text Mining for Microarray Gene Cluster Identification and Annotation  

E-print Network

In this paper, we design and develop a unified system GE-Miner (Gene Expression Miner) to integrate cluster ensemble, text clustering and multi document summarization and provide an environment for comprehensive gene expression data analysis. We present a novel cluster ensemble approach to generate high quality gene cluster. In our text summarization module, given a gene cluster, our Expectation Maximization (EM) based algorithm can automatically identify subtopics and extract most probable terms for each topic. Then, the extracted top k topical terms from each subtopic are combined to form the biological explanation of each gene cluster. Experimental results demonstrate that our system can obtain high quality clusters and provide informative key terms for the gene clusters.

Xiaohua Hu

232

Annotation of sheep keratin intermediate filament genes and their patterns of expression.  

PubMed

Keratin IF (KRT) and keratin-associated protein genes encode the majority of wool and hair proteins. We have identified cDNA sequences representing nine novel sheep KRT genes, increasing the known active genes from eight to 17, a number comparable to that in the human. However, the absence of KRT37 in the type I family and the discovery of type II KRT87 in sheep exemplify species-specific compositional differences in hair KRT genes. Phylogenetic analysis of hair KRT genes within type I and type II families in the sheep, cattle and human genomes revealed a high degree of consistency in their sequence conservation and grouping. However, there were differences in the fibre compartmentalisation and keratinisation zones for the expression of six ovine KRT genes compared with their human orthologs. Transcripts of three genes (KRT40, KRT82 and KRT84) were only present in the fibre cuticle. KRT32, KRT35 and KRT85 were expressed in both the cuticle and the fibre cortex. The remaining 11 genes (KRT31, KRT33A, KRT33B, KRT34, KRT36, KRT38-39, KRT81, KRT83 and KRT86-87) were expressed only in the cortex. Species-specific differences in the expressed keratin gene sets, their relative expression levels and compartmentalisation are discussed in the context of their underlying roles in wool and hair developmental programmes and the distinctive characteristics of the fibres produced. PMID:21554405

Yu, Zhidong; Wildermoth, Janet E; Wallace, Olivia A M; Gordon, Steven W; Maqbool, Nauman J; Maclean, Paul H; Nixon, Allan J; Pearson, Allan J

2011-07-01

233

Automation of Drosophila gene expression pattern image annotation : development of web-based image annotation tool and application of machine learning methods  

E-print Network

Large-scale in situ hybridization screens are providing an abundance of spatio-temporal patterns of gene expression data that is valuable for understanding the mechanisms of gene regulation. Drosophila gene expression ...

Ayuso, Anna Maria E

2011-01-01

234

Predictive screening for regulators of conserved functional gene modules (gene batteries) in mammals  

PubMed Central

Background The expression of gene batteries, genomic units of functionally linked genes which are activated by similar sets of cis- and trans-acting regulators, has been proposed as a major determinant of cell specialization in metazoans. We developed a predictive procedure to screen the mouse and human genomes and transcriptomes for cases of gene-battery-like regulation. Results In a screen that covered ~40 per cent of all annotated protein-coding genes, we identified 21 co-expressed gene clusters with statistically supported sharing of cis-regulatory sequence elements. 66 predicted cases of over-represented transcription factor binding motifs were validated against the literature and fell into three categories: (i) previously described cases of gene battery-like regulation, (ii) previously unreported cases of gene battery-like regulation with some support in a limited number of genes, and (iii) predicted cases that currently lack experimental support. The novel predictions include for example Sox 17 and RFX transcription factor binding sites that were detected in ~10% of all testis specific genes, and HNF-1 and 4 binding sites that were detected in ~30% of all kidney specific genes respectively. The results are publicly available at . Conclusion 21 co-expressed gene clusters were enriched for a total of 66 shared cis-regulatory sequence elements. A majority of these predictions represent novel cases of potential co-regulation of functionally coupled proteins. Critical technical parameters were evaluated, and the results and the methods provide a valuable resource for future experimental design. PMID:15882449

Nelander, Sven; Larsson, Erik; Kristiansson, Erik; Mansson, Robert; Nerman, Olle; Sigvardsson, Mikael; Mostad, Petter; Lindahl, Per

2005-01-01

235

Annotated embryonic CNS expression patterns of 5000 GMR GAL4 lines: a resource for manipulating gene expression and analyzing cis-regulatory modules  

PubMed Central

Here we describe the embryonic CNS expression of 5,000 GAL4 lines made using molecularly defined cis-regulatory DNA inserted into a single attP genomic location. We document and annotate the patterns in early embryos when neurogenesis is at its peak, and in older embryos where there is maximal neuronal diversity and the first neural circuits are established. We note expression in other tissues such as the lateral body wall (muscle, sensory neurons, trachea) and viscera. Companion papers report on the adult brain and larval imaginal discs, and the integrated datasets are available online (www.janelia.org/flylight/gal4-gen1). This collection of embryonically-expressed GAL4 lines will be valuable for determining neuronal morphology and function; the 1862 lines expressed in small subsets of neurons (<20/segment) will be especially valuable for characterizing interneuronal diversity and function, as interneurons comprise the majority of all CNS neurons, yet their gene expression profile and function remain virtually unexplored. PMID:23063363

Manning, Laurina; Heckscher, Ellie S.; Purice, Maria D.; Roberts, Jourdain; Bennett, Alysha L.; Kroll, Jason R.; Pollard, Jill L.; Strader, Marie E.; Lupton, Josh R.; Dyukareva, Anna V.; Doan, Phuong Nam; Bauer, David M.; Wilbur, Allison N.; Tanner, Stephanie; Kelly, Jimmy J.; Lai, Sen-Lin; Tran, Khoa D.; Kohwi, Minoree; Laverty, Todd R.; Pearson, Joseph C.; Crews, Stephen T.; Rubin, Gerald M.; Doe, Chris Q.

2012-01-01

236

Accumulation, functional annotation, and comparative analysis of expressed sequence tags in eggplant (Solanum melongena L.), the third pole of the genus Solanum species after tomato and potato.  

PubMed

Eggplant (Solanum melongena L.) is a widely grown vegetable crop that belongs to the genus Solanum, which is comprised of more than 1000 species of wide genetic and phenotypic variation. Unlike tomato and potato, Solanum crops that belong to subgenus Potatoe and have been targets for comprehensive genomic studies, eggplant is endemic to the Old World and belongs to a different subgenus, Leptostemonum, and therefore, would be a unique member for comparative molecular biology in Solanum. In this study, more than 60,000 eggplant cDNA clones from various tissues and treatments were sequenced from both the 5'- and 3'-ends, and a unigene set consisting of 16,245 unique sequences was constructed. Functional annotations based on sequence similarity to known plant reference datasets revealed a distribution of functional categories almost similar to that of tomato, while 1316 unigenes were suggested to be eggplant-specific. Sequence-based comparative analysis using putative orthologous gene groups setup by reciprocal sequence comparison among six solanaceous species suggested that eggplant and its wild ally Solanum torvum were clustered separately from subgenus Potatoe species, and then, all Solanum species were clustered separately from the genus Capsicum. Microsatellite motif distribution was different among species and likely to be coincident with the phylogenetic relationships. Furthermore, the eggplant unigene dataset exhibited its utility in transcriptome analysis by the SAGE strategy where a considerable number of short tag sequences of interest were successfully assigned to unigenes and their functional annotations. The eggplant ESTs and 16k unigene set developed in this study would be a useful resource not only for molecular genetics and breeding in eggplant itself, but for expanding the scope of comparative biology in Solanum species. PMID:19857557

Fukuoka, Hiroyuki; Yamaguchi, Hirotaka; Nunome, Tsukasa; Negoro, Satomi; Miyatake, Koji; Ohyama, Akio

2010-01-15

237

De Novo Whole-Genome Sequence and Genome Annotation of Lichtheimia ramosa  

PubMed Central

We report the annotated draft genome sequence of Lichtheimia ramosa (JMRC FSU:6197). It has been reported to be a causative organism of mucormycosis, a rare but rapidly progressive infection in immunocompromised humans. The functionally annotated genomic sequence consists of 74 scaffolds with a total number of 11,510 genes. PMID:25212617

Linde, Jorg; Schwartze, Volker; Binder, Ulrike; Lass-Florl, Cornelia

2014-01-01

238

DAVID: Database for Annotation, Visualization, and Integrated Discovery  

Microsoft Academic Search

BACKGROUND: Functional annotation of differentially expressed genes is a necessary and critical step in the analysis of microarray data. The distributed nature of biological knowledge frequently requires researchers to navigate through numerous web-accessible databases gathering information one gene at a time. A more judicious approach is to provide query-based access to an integrated database that disseminates biologically rich information across

Glynn Dennis Jr; Brad T Sherman; Douglas A Hosack; Jun Yang; Wei Gao; Richard A Lempicki

2003-01-01

239

Function of the DISC1 Gene  

NSDL National Science Digital Library

As a result of the human genome project, we now know largely where our genes are, and what structure they have. The search to uncover each gene's function, on the other hand, is only in its infancy. Functional genomics is an area of research dedicated to studying what protein is produced by a gene, and what happens in the body when it is activated. Understanding gene function is the next major hurdle in genomic research, which holds the key to developing revolutionary therapeutics.

2009-04-14

240

De Novo Assembly, Gene Annotation and Marker Development Using Illumina Paired-End Transcriptome Sequences in Celery (Apium graveolens L.)  

PubMed Central

Background Celery is an increasing popular vegetable species, but limited transcriptome and genomic data hinder the research to it. In addition, a lack of celery molecular markers limits the process of molecular genetic breeding. High-throughput transcriptome sequencing is an efficient method to generate a large transcriptome sequence dataset for gene discovery, molecular marker development and marker-assisted selection breeding. Principal Findings Celery transcriptomes from four tissues were sequenced using Illumina paired-end sequencing technology. De novo assembling was performed to generate a collection of 42,280 unigenes (average length of 502.6 bp) that represent the first transcriptome of the species. 78.43% and 48.93% of the unigenes had significant similarity with proteins in the National Center for Biotechnology Information (NCBI) non-redundant protein database (Nr) and Swiss-Prot database respectively, and 10,473 (24.77%) unigenes were assigned to Clusters of Orthologous Groups (COG). 21,126 (49.97%) unigenes harboring Interpro domains were annotated, in which 15,409 (36.45%) were assigned to Gene Ontology(GO) categories. Additionally, 7,478 unigenes were mapped onto 228 pathways using the Kyoto Encyclopedia of Genes and Genomes Pathway database (KEGG). Large numbers of simple sequence repeats (SSRs) were indentified, and then the rate of successful amplication and polymorphism were investigated among 31 celery accessions. Conclusions This study demonstrates the feasibility of generating a large scale of sequence information by Illumina paired-end sequencing and efficient assembling. Our results provide a valuable resource for celery research. The developed molecular markers are the foundation of further genetic linkage analysis and gene localization, and they will be essential to accelerate the process of breeding. PMID:23469050

Fu, Nan; Wang, Qian; Shen, Huo-Lin

2013-01-01

241

Novel semantic similarity measure improves an integrative approach to predicting gene functional associations  

PubMed Central

Background Elucidation of the direct/indirect protein interactions and gene associations is required to fully understand the workings of the cell. This can be achieved through the use of both low- and high-throughput biological experiments and in silico methods. We present GAP (Gene functional Association Predictor), an integrative method for predicting and characterizing gene functional associations. GAP integrates different biological features using a novel taxonomy-based semantic similarity measure in predicting and prioritizing high-quality putative gene associations. The proposed similarity measure increases information gain from the available gene annotations. The annotation information is incorporated from several public pathway databases, Gene Ontology annotations as well as drug and disease associations from the scientific literature. Results We evaluated GAP by comparing its prediction performance with several other well-known functional interaction prediction tools over a comprehensive dataset of known direct and indirect interactions, and observed significantly better prediction performance. We also selected a small set of GAP’s highly-scored novel predicted pairs (i.e., currently not found in any known database or dataset), and by manually searching the literature for experimental evidence accessible in the public domain, we confirmed different categories of predicted functional associations with available evidence of interaction. We also provided extra supporting evidence for subset of the predicted functionally-associated pairs using an expert curated database of genes associated to autism spectrum disorders. Conclusions GAP’s predicted “functional interactome” contains ?1M highly-scored predicted functional associations out of which about 90% are novel (i.e., not experimentally validated). GAP’s novel predictions connect disconnected components and singletons to the main connected component of the known interactome. It can, therefore, be a valuable resource for biologists by providing corroborating evidence for and facilitating the prioritization of potential direct or indirect interactions for experimental validation. GAP is freely accessible through a web portal: http://ophid.utoronto.ca/gap. PMID:23497449

2013-01-01

242

In Situ Proteomic Analysis of Human Breast Cancer Epithelial Cells Using Laser Capture Microdissection: Annotation by Protein Set Enrichment Analysis and Gene Ontology*  

PubMed Central

Identification of molecular signatures that allow detection of the transition from normal breast epithelial cells to malignant invasive cells is a critical component in the development of diagnostic, therapeutic, and preventative strategies for human breast cancer. Substantial efforts have been devoted to deciphering breast cancer etiology at the genome level, but only a limited number of studies have appeared at the proteome level. In this work, we compared individual in situ proteome profiles of nonpatient matched nine noncancerous, normal breast epithelial (NBE) samples with nine estrogen receptor (ER)-positive (luminal subtype), invasive malignant breast epithelial (MBE) samples by combining laser capture microdissection (LCM) and quantitative shotgun proteomics. A total of 12,970 unique peptides were identified from the 18 samples, and 1623 proteins were selected for quantitative analysis using spectral index (SpI) as a measure of protein abundance. A total of 298 proteins were differentially expressed between NBE and MBE at 95% confidence level, and this differential expression correlated well with immunohistochemistry (IHC) results reported in the Human Protein Atlas (HPA) database. To assess pathway level patterns in the observed expression changes, we developed protein set enrichment analysis (PSEA), a modification of a well-known approach in gene expression analysis, Gene Set Enrichment Analysis (GSEA). Unlike single gene-based functional term enrichment analyses that only examines pathway overrepresentation of proteins above a given significance threshold, PSEA applies a weighted running sum statistic to the entire expression data to discover significantly enriched protein groups. Application of PSEA to the expression data in this study revealed not only well-known ER-dependent and cellular morphology-dependent protein abundance changes, but also significant alterations of downstream targets for multiple transcription factors (TFs), suggesting a role for specific gene regulatory pathways in breast tumorigenesis. A parallel GOMiner analysis revealed both confirmatory and complementary data to PSEA. The combination of the two annotation approaches yielded extensive biological feature mapping for in depth analysis of the quantitative proteomic data. PMID:20739354

Cha, Sangwon; Imielinski, Marcin B.; Rejtar, Tomas; Richardson, Elizabeth A.; Thakur, Dipak; Sgroi, Dennis C.; Karger, Barry L.

2010-01-01

243

Interferome v2.0: an updated database of annotated interferon-regulated genes.  

PubMed

Interferome v2.0 (http://interferome.its.monash.edu.au/interferome/) is an update of an earlier version of the Interferome DB published in the 2009 NAR database edition. Vastly improved computational infrastructure now enables more complex and faster queries, and supports more data sets from types I, II and III interferon (IFN)-treated cells, mice or humans. Quantitative, MIAME compliant data are collected, subjected to thorough, standardized, quantitative and statistical analyses and then significant changes in gene expression are uploaded. Comprehensive manual collection of metadata in v2.0 allows flexible, detailed search capacity including the parameters: range of -fold change, IFN type, concentration and time, and cell/tissue type. There is no limit to the number of genes that can be used to search the database in a single query. Secondary analysis such as gene ontology, regulatory factors, chromosomal location or tissue expression plots of IFN-regulated genes (IRGs) can be performed in Interferome v2.0, or data can be downloaded in convenient text formats compatible with common secondary analysis programs. Given the importance of IFN to innate immune responses in infectious, inflammatory diseases and cancer, this upgrade of the Interferome to version 2.0 will facilitate the identification of gene signatures of importance in the pathogenesis of these diseases. PMID:23203888

Rusinova, Irina; Forster, Sam; Yu, Simon; Kannan, Anitha; Masse, Marion; Cumming, Helen; Chapman, Ross; Hertzog, Paul J

2013-01-01

244

Automatic Annotation Techniques for Gene Expression Images of the Fruit Fly Embryo  

E-print Network

Institute, ASU, provided support for this research. #12;background removal, edge fitting and resizing oper melanogaster), a model organism to study gene interaction. The aim is to determine the view (lateral versus of what is meant by "view", "orientation", and "stage" of development for the images in which the fruit

Kumar, Sudhir

245

Interestingness measures and strategies for mining multi-ontology multi-level association rules from gene ontology annotations for the discovery of new GO relationships.  

PubMed

The Gene Ontology (GO), a set of three sub-ontologies, is one of the most popular bio-ontologies used for describing gene product characteristics. GO annotation data containing terms from multiple sub-ontologies and at different levels in the ontologies is an important source of implicit relationships between terms from the three sub-ontologies. Data mining techniques such as association rule mining that are tailored to mine from multiple ontologies at multiple levels of abstraction are required for effective knowledge discovery from GO annotation data. We present a data mining approach, Multi-ontology data mining at All Levels (MOAL) that uses the structure and relationships of the GO to mine multi-ontology multi-level association rules. We introduce two interestingness measures: Multi-ontology Support (MOSupport) and Multi-ontology Confidence (MOConfidence) customized to evaluate multi-ontology multi-level association rules. We also describe a variety of post-processing strategies for pruning uninteresting rules. We use publicly available GO annotation data to demonstrate our methods with respect to two applications (1) the discovery of co-annotation suggestions and (2) the discovery of new cross-ontology relationships. PMID:23850840

Manda, Prashanti; McCarthy, Fiona; Bridges, Susan M

2013-10-01

246

Developmental Stage Annotation of Drosophila Gene Expression Pattern Images via an Entire Solution Path for LDA.  

PubMed

Gene expression in a developing embryo occurs in particular cells (spatial patterns) in a time-specific manner (temporal patterns), which leads to the differentiation of cell fates. Images of a Drosophila melanogaster embryo at a given developmental stage, showing a particular gene expression pattern revealed by a gene-specific probe, can be compared for spatial overlaps. The comparison is fundamentally important to formulating and testing gene interaction hypotheses. Expression pattern comparison is most biologically meaningful when images from a similar time point (developmental stage) are compared. In this paper, we present LdaPath, a novel formulation of Linear Discriminant Analysis (LDA) for automatic developmental stage range classification. It employs multivariate linear regression with the L(1)-norm penalty controlled by a regularization parameter for feature extraction and visualization. LdaPath computes an entire solution path for all values of regularization parameter with essentially the same computational cost as fitting one LDA model. Thus, it facilitates efficient model selection. It is based on the equivalence relationship between LDA and the least squares method for multi-class classifications. This equivalence relationship is established under a mild condition, which we show empirically to hold for many high-dimensional datasets, such as expression pattern images. Our experiments on a collection of 2705 expression pattern images show the effectiveness of the proposed algorithm. Results also show that the LDA model resulting from LdaPath is sparse, and irrelevant features may be removed. Thus, LdaPath provides a general framework for simultaneous feature selection and feature extraction. PMID:18769656

Ye, Jieping; Chen, Jianhui; Janardan, Ravi; Kumar, Sudhir

2008-03-01

247

Progress and challenges in the computational prediction of gene function using networks  

PubMed Central

In this opinion piece, we attempt to unify recent arguments we have made that serious confounds affect the use of network data to predict and characterize gene function. The development of computational approaches to determine gene function is a major strand of computational genomics research. However, progress beyond using BLAST to transfer annotations has been surprisingly slow. We have previously argued that a large part of the reported success in using "guilt by association" in network data is due to the tendency of methods to simply assign new functions to already well-annotated genes. While such predictions will tend to be correct, they are generic; it is true, but not very helpful, that a gene with many functions is more likely to have any function. We have also presented evidence that much of the remaining performance in cross-validation cannot be usefully generalized to new predictions, making progressive improvement in analysis difficult to engineer. Here we summarize our findings about how these problems will affect network analysis, discuss some ongoing responses within the field to these issues, and consolidate some recommendations and speculation, which we hope will modestly increase the reliability and specificity of gene function prediction. PMID:23936626

Pavlidis, Paul; Gillis, Jesse

2012-01-01

248

Automatic annotation of organellar genomes with DOGMA  

SciTech Connect

Dual Organellar GenoMe Annotator (DOGMA) automates the annotation of extra-nuclear organellar (chloroplast and animal mitochondrial) genomes. It is a web-based package that allows the use of comparative BLAST searches to identify and annotate genes in a genome. DOGMA presents a list of putative genes to the user in a graphical format for viewing and editing. Annotations are stored on our password-protected server. Complete annotations can be extracted for direct submission to GenBank. Furthermore, intergenic regions of specified length can be extracted, as well the nucleotide sequences and amino acid sequences of the genes.

Wyman, Stacia; Jansen, Robert K.; Boore, Jeffrey L.

2004-06-01

249

FunGene: the functional gene pipeline and repository  

PubMed Central

Ribosomal RNA genes have become the standard molecular markers for microbial community analysis for good reasons, including universal occurrence in cellular organisms, availability of large databases, and ease of rRNA gene region amplification and analysis. As markers, however, rRNA genes have some significant limitations. The rRNA genes are often present in multiple copies, unlike most protein-coding genes. The slow rate of change in rRNA genes means that multiple species sometimes share identical 16S rRNA gene sequences, while many more species share identical sequences in the short 16S rRNA regions commonly analyzed. In addition, the genes involved in many important processes are not distributed in a phylogenetically coherent manner, potentially due to gene loss or horizontal gene transfer. While rRNA genes remain the most commonly used markers, key genes in ecologically important pathways, e.g., those involved in carbon and nitrogen cycling, can provide important insights into community composition and function not obtainable through rRNA analysis. However, working with ecofunctional gene data requires some tools beyond those required for rRNA analysis. To address this, our Functional Gene Pipeline and Repository (FunGene; http://fungene.cme.msu.edu/) offers databases of many common ecofunctional genes and proteins, as well as integrated tools that allow researchers to browse these collections and choose subsets for further analysis, build phylogenetic trees, test primers and probes for coverage, and download aligned sequences. Additional FunGene tools are specialized to process coding gene amplicon data. For example, FrameBot produces frameshift-corrected protein and DNA sequences from raw reads while finding the most closely related protein reference sequence. These tools can help provide better insight into microbial communities by directly studying key genes involved in important ecological processes. PMID:24101916

Fish, Jordan A.; Chai, Benli; Wang, Qiong; Sun, Yanni; Brown, C. Titus; Tiedje, James M.; Cole, James R.

2013-01-01

250

A new set of ESTs and cDNA clones from full-length and normalized libraries for gene discovery and functional characterization in citrus  

Microsoft Academic Search

BACKGROUND: Interpretation of ever-increasing raw sequence information generated by modern genome sequencing technologies faces multiple challenges, such as gene function analysis and genome annotation. Indeed, nearly 40% of genes in plants encode proteins of unknown function. Functional characterization of these genes is one of the main challenges in modern biology. In this regard, the availability of full-length cDNA clones may

M. Carmen Marques; Hugo Alonso-Cantabrana; Javier Forment; Raquel Arribas; Santiago Alamar; Vicente Conejero; Miguel A Perez-Amador

2009-01-01

251

Syntactic Annotation of a German Newspaper Corpus  

Microsoft Academic Search

\\u000a We report on the syntactic annotation of a German newspaper corpus. The annotations consist of context-free structures, additionally\\u000a allowing crossing branches, with labeled nodes (phrases) and edges (grammatical functions). Furthermore, we present a new,\\u000a interactive semi-automatic annotation process that allows efficient and reliable annotations. The annotation process is sped\\u000a up by incrementally presenting structures and by automatically highlighting unreliable assignments.

Thorsten Brants; Wojciech Skut; Hans Uszkoreit

252

Genome annotation assessment in Drosophila melanogaster.  

PubMed

Computational methods for automated genome annotation are critical to our community's ability to make full use of the large volume of genomic sequence being generated and released. To explore the accuracy of these automated feature prediction tools in the genomes of higher organisms, we evaluated their performance on a large, well-characterized sequence contig from the Adh region of Drosophila melanogaster. This experiment, known as the Genome Annotation Assessment Project (GASP), was launched in May 1999. Twelve groups, applying state-of-the-art tools, contributed predictions for features including gene structure, protein homologies, promoter sites, and repeat elements. We evaluated these predictions using two standards, one based on previously unreleased high-quality full-length cDNA sequences and a second based on the set of annotations generated as part of an in-depth study of the region by a group of Drosophila experts. Although these standard sets only approximate the unknown distribution of features in this region, we believe that when taken in context the results of an evaluation based on them are meaningful. The results were presented as a tutorial at the conference on Intelligent Systems in Molecular Biology (ISMB-99) in August 1999. Over 95% of the coding nucleotides in the region were correctly identified by the majority of the gene finders, and the correct intron/exon structures were predicted for >40% of the genes. Homology-based annotation techniques recognized and associated functions with almost half of the genes in the region; the remainder were only identified by the ab initio techniques. This experiment also presents the first assessment of promoter prediction techniques for a significant number of genes in a large contiguous region. We discovered that the promoter predictors' high false-positive rates make their predictions difficult to use. Integrating gene finding and cDNA/EST alignments with promoter predictions decreases the number of false-positive classifications but discovers less than one-third of the promoters in the region. We believe that by establishing standards for evaluating genomic annotations and by assessing the performance of existing automated genome annotation tools, this experiment establishes a baseline that contributes to the value of ongoing large-scale annotation projects and should guide further research in genome informatics. PMID:10779488

Reese, M G; Hartzell, G; Harris, N L; Ohler, U; Abril, J F; Lewis, S E

2000-04-01

253

Developmental Gene Discovery in a Hemimetabolous Insect: De Novo Assembly and Annotation of a  

E-print Network

insects), representing the basal branches of the insect tree, have very few genomic resources. We have bimaculatus (cricket), a well-developed laboratory model organism whose potential for functional genetic and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have

Extavour, Cassandra

254

The functional diversity of essential genes required for mammalian cardiac development  

PubMed Central

Genes required for an organism to develop to maturity (for which no other gene can compensate) are considered essential. The continuing functional annotation of the mouse genome has enabled the identification of many essential genes required for specific developmental processes including cardiac development. Patterns are now emerging regarding the functional nature of genes required at specific points throughout gestation. Essential genes required for development beyond cardiac progenitor cell migration and induction include a small and functionally homogenous group encoding transcription factors, ligands and receptors. Actions of core cardiogenic transcription factors from the Gata, Nkx, Mef, Hand, and Tbx families trigger a marked expansion in the functional diversity of essential genes from midgestation onwards. As the embryo grows in size and complexity, genes required to maintain a functional heartbeat and to provide muscular strength and regulate blood flow are well represented. These essential genes regulate further specialization and polarization of cell types along with proliferative, migratory, adhesive, contractile, and structural processes. The identification of patterns regarding the functional nature of essential genes across numerous developmental systems may aid prediction of further essential genes and those important to development and/or progression of disease. genesis 52:713–737, 2014. PMID:24866031

Clowes, Christopher; Boylan, Michael GS; Ridge, Liam A; Barnes, Emma; Wright, Jayne A; Hentges, Kathryn E

2014-01-01

255

Sma3s: A Three-Step Modular Annotator for Large Sequence Datasets  

PubMed Central

Automatic sequence annotation is an essential component of modern ‘omics’ studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ?85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. PMID:24501397

Munoz-Merida, Antonio; Viguera, Enrique; Claros, M. Gonzalo; Trelles, Oswaldo; Perez-Pulido, Antonio J.

2014-01-01

256

Discovery of Tumor Suppressor Gene Function.  

ERIC Educational Resources Information Center

This is an update of a 1991 review on tumor suppressor genes written at a time when understanding of how the genes work was limited. A recent major breakthrough in the understanding of the function of tumor suppressor genes is discussed. (LZ)

Oppenheimer, Steven B.

1995-01-01

257

Functional Genes and Proteins of Clonorchis sinensis  

PubMed Central

During the past several decades, researches on parasite genetics have progressed from biochemical and serodiagnostic studies to protein chemistry, molecular biology, and functional gene studies. Nowadays, bioinformatics, genomics, and proteomics approaches are being applied by Korean parasitology researchers. As for Clonorchis sinensis, investigations have been carried out to identify its functional genes using forward and reverse genetic approaches and to characterize the biochemical and biological properties of its gene products. The authors review the proteins of cloned genes, which include antigenic proteins, physiologic and metabolic enzymes, and the gene expression profile of Clonorchis sinensis. PMID:19885336

Kim, Tae Im; Na, Byoung-Kuk

2009-01-01

258

Functional genomics annotation of a statistical epistasis network associated with bladder cancer susceptibility  

PubMed Central

Background Several different genetic and environmental factors have been identified as independent risk factors for bladder cancer in population-based studies. Recent studies have turned to understanding the role of gene-gene and gene-environment interactions in determining risk. We previously developed the bioinformatics framework of statistical epistasis networks (SEN) to characterize the global structure of interacting genetic factors associated with a particular disease or clinical outcome. By applying SEN to a population-based study of bladder cancer among Caucasians in New Hampshire, we were able to identify a set of connected genetic factors with strong and significant interaction effects on bladder cancer susceptibility. Findings To support our statistical findings using networks, in the present study, we performed pathway enrichment analyses on the set of genes identified using SEN, and found that they are associated with the carcinogen benzo[a]pyrene, a component of tobacco smoke. We further carried out an mRNA expression microarray experiment to validate statistical genetic interactions, and to determine if the set of genes identified in the SEN were differentially expressed in a normal bladder cell line and a bladder cancer cell line in the presence or absence of benzo[a]pyrene. Significant nonrandom sets of genes from the SEN were found to be differentially expressed in response to benzo[a]pyrene in both the normal bladder cells and the bladder cancer cells. In addition, the patterns of gene expression were significantly different between these two cell types. Conclusions The enrichment analyses and the gene expression microarray results support the idea that SEN analysis of bladder in population-based studies is able to identify biologically meaningful statistical patterns. These results bring us a step closer to a systems genetic approach to understanding cancer susceptibility that integrates population and laboratory-based studies. PMID:24725556

2014-01-01

259

Surrogate Splicing for Functional Analysis of Sesquiterpene Synthase Genes1[w  

PubMed Central

A method for the recovery of full-length cDNAs from predicted terpene synthase genes containing introns is described. The approach utilizes Agrobacterium-mediated transient expression coupled with a reverse transcription-polydeoxyribonucleotide chain reaction assay to facilitate expression cloning of processed transcripts. Subsequent expression of intronless cDNAs in a suitable prokaryotic host provides for direct functional testing of the encoded gene product. The method was optimized by examining the expression of an intron-containing ?-glucuronidase gene agroinfiltrated into petunia (Petunia hybrida) leaves, and its utility was demonstrated by defining the function of two previously uncharacterized terpene synthases. A tobacco (Nicotiana tabacum) terpene synthase-like gene containing six predicted introns was characterized as having 5-epi-aristolochene synthase activity, while an Arabidopsis (Arabidopsis thaliana) gene previously annotated as a terpene synthase was shown to possess a novel sesquiterpene synthase activity for ?-barbatene, thujopsene, and ?-chamigrene biosynthesis. PMID:15965019

Wu, Shuiqin; Schoenbeck, Mark A.; Greenhagen, Bryan T.; Takahashi, Shunji; Lee, Sungbeom; Coates, Robert M.; Chappell, Joseph

2005-01-01

260

Systematic Learning of Gene Functional Classes From DNA Array Expression Data by Using Multilayer Perceptrons  

PubMed Central

Recent advances in microarray technology have opened new ways for functional annotation of previously uncharacterised genes on a genomic scale. This has been demonstrated by unsupervised clustering of co-expressed genes and, more importantly, by supervised learning algorithms. Using prior knowledge, these algorithms can assign functional annotations based on more complex expression signatures found in existing functional classes. Previously, support vector machines (SVMs) and other machine-learning methods have been applied to a limited number of functional classes for this purpose. Here we present, for the first time, the comprehensive application of supervised neural networks (SNNs) for functional annotation. Our study is novel in that we report systematic results for ?100 classes in the Munich Information Center for Protein Sequences (MIPS) functional catalog. We found that only ?10% of these are learnable (based on the rate of false negatives). A closer analysis reveals that false positives (and negatives) in a machine-learning context are not necessarily “false” in a biological sense. We show that the high degree of interconnections among functional classes confounds the signatures that ought to be learned for a unique class. We term this the “Borges effect” and introduce two new numerical indices for its quantification. Our analysis indicates that classification systems with a lower Borges effect are better suitable for machine learning. Furthermore, we introduce a learning procedure for combining false positives with the original class. We show that in a few iterations this process converges to a gene set that is learnable with considerably low rates of false positives and negatives and contains genes that are biologically related to the original class, allowing for a coarse reconstruction of the interactions between associated biological pathways. We exemplify this methodology using the well-studied tricarboxylic acid cycle. PMID:12421757

Mateos, Alvaro; Dopazo, Joaquin; Jansen, Ronald; Tu, Yuhai; Gerstein, Mark; Stolovitzky, Gustavo

2002-01-01

261

NetAffx: Affymetrix probesets and annotations  

Microsoft Academic Search

NetAffx (http:\\/\\/www.affymetrix.com) details and annotates probesets on Affymetrix GeneChip micro- arrays. These annotations include (i) static informa- tion specific to the probeset composition; (ii) sequence annotations extracted from public data- bases; and (iii) protein sequence-level annotations derived from public domain programs, as well as libraries of hidden Markov models (HMMs) devel- oped at Affymetrix. For each probeset, NetAffx lists the

Guoying Liu; Ann E. Loraine; Ron Shigeta; Melissa S. Cline; Jill Cheng; Venu Valmeekam; Shaw Sun; David Kulp; Michael A. Siani-rose

2003-01-01

262

The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation  

Microsoft Academic Search

BACKGROUND: Automated protein function prediction methods are needed to keep pace with high-throughput sequencing. With the existence of many programs and databases for inferring different protein functions, a pipeline that properly integrates these resources will benefit from the advantages of each method. However, integrated systems usually do not provide mechanisms to generate customized databases to predict particular protein functions. Here,

Chenggang Yu; Nela Zavaljevski; Valmik Desai; Seth Johnson; Fred J. Stevens; Jaques Reifman

2008-01-01

263

Disease candidate gene identification and prioritization using protein interaction networks  

Microsoft Academic Search

BACKGROUND: Although most of the current disease candidate gene identification and prioritization methods depend on functional annotations, the coverage of the gene functional annotations is a limiting factor. In the current study, we describe a candidate gene prioritization method that is entirely based on protein-protein interaction network (PPIN) analyses. RESULTS: For the first time, extended versions of the PageRank and

Jing Chen; Bruce J. Aronow; Anil G. Jegga

2009-01-01

264

Annotated Bibliography  

NSDL National Science Digital Library

Annotations are short and cannot give detailed information, but they should cover these points: 1. The general contents of the work. What does it discuss and how detailed is it? This is the main portion of the annotation. 2. The author's qualifications. Is the writer a trained scholar? A journalist? Someone relating a personal experience? 3. An evaluation of the reliability. Is the information given reliable? Are facts or opinions stressed? 4. The intended audience. Is it for a general reader or a specialist? How much, if any, background knowledge is needed to understand it? Was is easy or difficult to read?

Davis, Leslie

265

H-InvDB in 2013: an omics study platform for human functional gene and transcript discovery  

PubMed Central

H-InvDB (http://www.h-invitational.jp/) is a comprehensive human gene database started in 2004. In the latest version, H-InvDB 8.0, a total of 244 709 human complementary DNA was mapped onto the hg19 reference genome and 43 829 gene loci, including nonprotein-coding ones, were identified. Of these loci, 35 631 were identified as potential protein-coding genes, and 22 898 of these were identical to known genes. In our analysis, 19 309 annotated genes were specific to H-InvDB and not found in RefSeq and Ensembl. In fact, 233 genes of the 19 309 turned out to have protein functions in this version of H-InvDB; they were annotated as unknown protein functions in the previous version. Furthermore, 11 genes were identified as known Mendelian disorder genes. It is advantageous that many biologically functional genes are hidden in the H-InvDB unique genes. As large-scale proteomic projects have been conducted to elucidate the functions of all human proteins, we have enhanced the proteomic information with an advanced protein view and new subdatabase of protein complexes (Protein Complex Database with quality index). We propose that H-InvDB is an important resource for finding novel candidate targets for medical care and drug development. PMID:23197657

Takeda, Jun-ichi; Yamasaki, Chisato; Murakami, Katsuhiko; Nagai, Yoko; Sera, Miho; Hara, Yuichiro; Obi, Nobuo; Habara, Takuya; Gojobori, Takashi; Imanishi, Tadashi

2013-01-01

266

Antagonistic functional duality of cancer genes.  

PubMed

Cancer evolution is a stochastic process both at the genome and gene levels. Most of tumors contain multiple genetic subclones, evolving in either succession or in parallel, either in a linear or branching manner, with heterogeneous genome and gene alterations, extensively rewired signaling networks, and addicted to multiple oncogenes easily switching with each other during cancer progression and medical intervention. Hundreds of discovered cancer genes are classified according to whether they function in a dominant (oncogenes) or recessive (tumor suppressor genes) manner in a cancer cell. However, there are many cancer "gene-chameleons", which behave distinctly in opposite way in the different experimental settings showing antagonistic duality. In contrast to the widely accepted view that mutant NADP(+)-dependent isocitrate dehydrogenases 1/2 (IDH1/2) and associated metabolite 2-hydroxyglutarate (R)-enantiomer are intrinsically "the drivers" of tumourigenesis, mutant IDH1/2 inhibited, promoted or had no effect on cell proliferation, growth and tumorigenicity in diverse experiments. Similar behavior was evidenced for dozens of cancer genes. Gene function is dependent on genetic network, which is defined by the genome context. The overall changes in karyotype can result in alterations of the role and function of the same genes and pathways. The diverse cell lines and tumor samples have been used in experiments for proving gene tumor promoting/suppressive activity. They all display heterogeneous individual karyotypes and disturbed signaling networks. Consequently, the effect and function of gene under investigation can be opposite and versatile in cells with different genomes that may explain antagonistic duality of cancer genes and the cell type- or the cellular genetic/context-dependent response to the same protein. Antagonistic duality of cancer genes might contribute to failure of chemotherapy. Instructive examples of unexpected activity of cancer genes and "paradoxical" effects of different anticancer drugs depending on the cellular genetic context/signaling network are discussed. PMID:23933273

Stepanenko, A A; Vassetzky, Y S; Kavsan, V M

2013-10-25

267

PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physical-chemical property and functional annotations.  

PubMed

Protein structure prediction is critical to functional annotation of the massively accumulated biological sequences, which prompts an imperative need for the development of high-throughput technologies. As a first and key step in protein structure prediction, protein structural class prediction becomes an increasingly challenging task. Amongst most homological-based approaches, the accuracies of protein structural class prediction are sufficiently high for high similarity datasets, but still far from being satisfactory for low similarity datasets, i.e., below 40% in pairwise sequence similarity. Therefore, we present a novel method for accurate and reliable protein structural class prediction for both high and low similarity datasets. This method is based on Support Vector Machine (SVM) in conjunction with integrated features from position-specific score matrix (PSSM), PROFEAT and Gene Ontology (GO). A feature selection approach, SVM-RFE, is also used to rank the integrated feature vectors through recursively removing the feature with the lowest ranking score. The definitive top features selected by SVM-RFE are input into the SVM engines to predict the structural class of a query protein. To validate our method, jackknife tests were applied to seven widely used benchmark datasets, reaching overall accuracies between 84.61% and 99.79%, which are significantly higher than those achieved by state-of-the-art tools. These results suggest that our method could serve as an accurate and cost-effective alternative to existing methods in protein structural classification, especially for low similarity datasets. PMID:24675610

Li, Liqi; Cui, Xiang; Yu, Sanjiu; Zhang, Yuan; Luo, Zhong; Yang, Hua; Zhou, Yue; Zheng, Xiaoqi

2014-01-01

268

Novel cardiovascular gene functions revealed via systematic phenotype prediction in zebrafish.  

PubMed

Comprehensive functional annotation of vertebrate genomes is fundamental to biological discovery. Reverse genetic screening has been highly useful for determination of gene function, but is untenable as a systematic approach in vertebrate model organisms given the number of surveyable genes and observable phenotypes. Unbiased prediction of gene-phenotype relationships offers a strategy to direct finite experimental resources towards likely phenotypes, thus maximizing de novo discovery of gene functions. Here we prioritized genes for phenotypic assay in zebrafish through machine learning, predicting the effect of loss of function of each of 15,106 zebrafish genes on 338 distinct embryonic anatomical processes. Focusing on cardiovascular phenotypes, the learning procedure predicted known knockdown and mutant phenotypes with high precision. In proof-of-concept studies we validated 16 high-confidence cardiac predictions using targeted morpholino knockdown and initial blinded phenotyping in embryonic zebrafish, confirming a significant enrichment for cardiac phenotypes as compared with morpholino controls. Subsequent detailed analyses of cardiac function confirmed these results, identifying novel physiological defects for 11 tested genes. Among these we identified tmem88a, a recently described attenuator of Wnt signaling, as a discrete regulator of the patterning of intercellular coupling in the zebrafish cardiac epithelium. Thus, we show that systematic prioritization in zebrafish can accelerate the pace of developmental gene function discovery. PMID:24346703

Musso, Gabriel; Tasan, Murat; Mosimann, Christian; Beaver, John E; Plovie, Eva; Carr, Logan A; Chua, Hon Nian; Dunham, Julie; Zuberi, Khalid; Rodriguez, Harold; Morris, Quaid; Zon, Leonard; Roth, Frederick P; MacRae, Calum A

2014-01-01

269

Novel cardiovascular gene functions revealed via systematic phenotype prediction in zebrafish  

PubMed Central

Comprehensive functional annotation of vertebrate genomes is fundamental to biological discovery. Reverse genetic screening has been highly useful for determination of gene function, but is untenable as a systematic approach in vertebrate model organisms given the number of surveyable genes and observable phenotypes. Unbiased prediction of gene-phenotype relationships offers a strategy to direct finite experimental resources towards likely phenotypes, thus maximizing de novo discovery of gene functions. Here we prioritized genes for phenotypic assay in zebrafish through machine learning, predicting the effect of loss of function of each of 15,106 zebrafish genes on 338 distinct embryonic anatomical processes. Focusing on cardiovascular phenotypes, the learning procedure predicted known knockdown and mutant phenotypes with high precision. In proof-of-concept studies we validated 16 high-confidence cardiac predictions using targeted morpholino knockdown and initial blinded phenotyping in embryonic zebrafish, confirming a significant enrichment for cardiac phenotypes as compared with morpholino controls. Subsequent detailed analyses of cardiac function confirmed these results, identifying novel physiological defects for 11 tested genes. Among these we identified tmem88a, a recently described attenuator of Wnt signaling, as a discrete regulator of the patterning of intercellular coupling in the zebrafish cardiac epithelium. Thus, we show that systematic prioritization in zebrafish can accelerate the pace of developmental gene function discovery. PMID:24346703

Musso, Gabriel; Tasan, Murat; Mosimann, Christian; Beaver, John E.; Plovie, Eva; Carr, Logan A.; Chua, Hon Nian; Dunham, Julie; Zuberi, Khalid; Rodriguez, Harold; Morris, Quaid; Zon, Leonard; Roth, Frederick P.; MacRae, Calum A.

2014-01-01

270

LeARN: a platform for detecting, clustering and annotating non-coding RNAs  

PubMed Central

Background In the last decade, sequencing projects have led to the development of a number of annotation systems dedicated to the structural and functional annotation of protein-coding genes. These annotation systems manage the annotation of the non-protein coding genes (ncRNAs) in a very crude way, allowing neither the edition of the secondary structures nor the clustering of ncRNA genes into families which are crucial for appropriate annotation of these molecules. Results LeARN is a flexible software package which handles the complete process of ncRNA annotation by integrating the layers of automatic detection and human curation. Conclusion This software provides the infrastructure to deal properly with ncRNAs in the framework of any annotation project. It fills the gap between existing prediction software, that detect independent ncRNA occurrences, and public ncRNA repositories, that do not offer the flexibility and interactivity required for annotation projects. The software is freely available from the download section of the website PMID:18194551

Noirot, Celine; Gaspin, Christine; Schiex, Thomas; Gouzy, Jerome

2008-01-01

271

DaGO-Fun: tool for Gene Ontology-based functional analysis using term information content measures  

PubMed Central

Background The use of Gene Ontology (GO) data in protein analyses have largely contributed to the improved outcomes of these analyses. Several GO semantic similarity measures have been proposed in recent years and provide tools that allow the integration of biological knowledge embedded in the GO structure into different biological analyses. There is a need for a unified tool that provides the scientific community with the opportunity to explore these different GO similarity measure approaches and their biological applications. Results We have developed DaGO-Fun, an online tool available at http://web.cbio.uct.ac.za/ITGOM, which incorporates many different GO similarity measures for exploring, analyzing and comparing GO terms and proteins within the context of GO. It uses GO data and UniProt proteins with their GO annotations as provided by the Gene Ontology Annotation (GOA) project to precompute GO term information content (IC), enabling rapid response to user queries. Conclusions The DaGO-Fun online tool presents the advantage of integrating all the relevant IC-based GO similarity measures, including topology- and annotation-based approaches to facilitate effective exploration of these measures, thus enabling users to choose the most relevant approach for their application. Furthermore, this tool includes several biological applications related to GO semantic similarity scores, including the retrieval of genes based on their GO annotations, the clustering of functionally related genes within a set, and term enrichment analysis. PMID:24067102

2013-01-01

272

Predicting gene function from images of cells  

E-print Network

This dissertation shows that biologically meaningful predictions can be made by analyzing images of cells. In particular, groups of related genes and their biological functions can be predicted using images from large ...

Jones, Thouis Raymond, 1971-

2007-01-01

273

Injectors and Annotations  

NASA Technical Reports Server (NTRS)

In a previous paper, we presented the Object Infrastructure Framework. The goal of that system is to simplify the creation of distributed applications. The primary claim of that work is that non-functional 'ilities' could be achieved by controlling and manipulating the communications between components, thereby simplifying the development of distributed systems. A secondary element of that paper is to argue for extending the conventional distributed objects model in two important ways: 1) The ability to insert injectors (filters, wrappers) into the communication path between components; 2) The ability to annotate communications with additional information, and to propagate these annotations through an application. Here we express the descriptions of that paper.

Filman, Robert E.

2004-01-01

274

Structural proteomics: a tool for genome annotation.  

PubMed

In any newly sequenced genome, 30% to 50% of genes encode proteins with unknown molecular or cellular function. Fortunately, structural genomics is emerging as a powerful approach of functional annotation. Because of recent developments in high-throughput technologies, ongoing structural genomics projects are generating new structures at an unprecedented rate. In the past year, structural studies have identified many new structural motifs involved in enzymatic catalysis or in binding ligands or other macromolecules (DNA, RNA, protein). The efficiency by which function is deduced from structure can be further improved by the integration of structure with bioinformatics and other experimental approaches, such as screening for enzymatic activity or ligand binding. PMID:15036155

Yakunin, Alexander F; Yee, Adelinda A; Savchenko, Alexei; Edwards, Aled M; Arrowsmith, Cheryl H

2004-02-01

275

The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)  

PubMed Central

In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources. PMID:24293654

Overbeek, Ross; Olson, Robert; Pusch, Gordon D.; Olsen, Gary J.; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Parrello, Bruce; Shukla, Maulik; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang; Stevens, Rick

2014-01-01

276

Assessment of community-submitted ontology annotations from a novel database-journal partnership  

PubMed Central

As the scientific literature grows, leading to an increasing volume of published experimental data, so does the need to access and analyze this data using computational tools. The most commonly used method to convert published experimental data on gene function into controlled vocabulary annotations relies on a professional curator, employed by a model organism database or a more general resource such as UniProt, to read published articles and compose annotation statements based on the articles' contents. A more cost-effective and scalable approach capable of capturing gene function data across the whole range of biological research organisms in computable form is urgently needed. We have analyzed a set of ontology annotations generated through collaborations between the Arabidopsis Information Resource and several plant science journals. Analysis of the submissions entered using the online submission tool shows that most community annotations were well supported and the ontology terms chosen were at an appropriate level of specificity. Of the 503 individual annotations that were submitted, 97% were approved and community submissions captured 72% of all possible annotations. This new method for capturing experimental results in a computable form provides a cost-effective way to greatly increase the available body of annotations without sacrificing annotation quality. Database URL: www.arabidopsis.org PMID:22859749

Berardini, Tanya Z.; Li, Donghui; Muller, Robert; Chetty, Raymond; Ploetz, Larry; Singh, Shanker; Wensel, April; Huala, Eva

2012-01-01

277

The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST).  

PubMed

In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources. PMID:24293654

Overbeek, Ross; Olson, Robert; Pusch, Gordon D; Olsen, Gary J; Davis, James J; Disz, Terry; Edwards, Robert A; Gerdes, Svetlana; Parrello, Bruce; Shukla, Maulik; Vonstein, Veronika; Wattam, Alice R; Xia, Fangfang; Stevens, Rick

2014-01-01

278

The functional landscape of mouse gene expression  

Microsoft Academic Search

ABSTRACT: BACKGROUND: Large-scale quantitative analysis of transcriptional co-expression has been used to dissect regulatory networks and to predict the functions of new genes discovered by genome sequencing in model organisms such as yeast. Although the idea that tissue-specific expression is indicative of gene function in mammals is widely accepted, it has not been objectively tested nor compared with the related

Wen Zhang; Quaid D Morris; Richard Chang; Ofer Shai; Malina A Bakowski; Nicholas Mitsakakis; Naveed Mohammad; Mark D Robinson; Ralph Zirngibl; Eszter Somogyi; Nancy Laurin; Eftekhar Eftekharpour; Eric Sat; Jörg Grigull; Qun Pan; Wen-Tao Peng; Nevan Krogan; Jack Greenblatt; Michael Fehlings; Derek van der Kooy; Jane Aubin; Benoit G Bruneau; Janet Rossant; Benjamin J Blencowe; Brendan J Frey; Timothy R Hughes

2004-01-01

279

Protein surface analysis for function annotation in high-throughput structural genomics pipeline  

Microsoft Academic Search

Structural genomics (SG) initiatives are expanding the universe of protein fold space by rapidly determining structures of proteins that were intentionally selected on the basis of low sequence similarity to proteins of known structure. Often these proteins have no associated biochemical or cellular functions. The SG success has resulted in an accelerated deposition of novel structures. In some cases the

T. Andrew Binkowski; Andrzej Joachimiak; Jie Liang

2005-01-01

280

Ensemble learning prediction of protein-protein interactions using proteins functional annotations.  

PubMed

Protein-protein interactions are important for the majority of biological processes. A significant number of computational methods have been developed to predict protein-protein interactions using protein sequence, structural and genomic data. Vast experimental data is publicly available on the Internet, but it is scattered across numerous databases. This fact motivated us to create and evaluate new high-throughput datasets of interacting proteins. We extracted interaction data from DIP, MINT, BioGRID and IntAct databases. Then we constructed descriptive features for machine learning purposes based on data from Gene Ontology and DOMINE. Thereafter, four well-established machine learning methods: Support Vector Machine, Random Forest, Decision Tree and Naïve Bayes, were used on these datasets to build an Ensemble Learning method based on majority voting. In cross-validation experiment, sensitivity exceeded 80% and classification/prediction accuracy reached 90% for the Ensemble Learning method. We extended the experiment to a bigger and more realistic dataset maintaining sensitivity over 70%. These results confirmed that our datasets are suitable for performing PPI prediction and Ensemble Learning method is well suited for this task. Both the processed PPI datasets and the software are available at . PMID:24469380

Saha, Indrajit; Zubek, Julian; Klingström, Tomas; Forsberg, Simon; Wikander, Johan; Kierczak, Marcin; Maulik, Ujjwal; Plewczynski, Dariusz

2014-04-01

281

ClustScan: an integrated program package for the semi-automatic annotation of modular biosynthetic gene clusters and in silico prediction of novel chemical structures  

PubMed Central

The program package ‘ClustScan’ (Cluster Scanner) is designed for rapid, semi-automatic, annotation of DNA sequences encoding modular biosynthetic enzymes including polyketide synthases (PKS), non-ribosomal peptide synthetases (NRPS) and hybrid (PKS/NRPS) enzymes. The program displays the predicted chemical structures of products as well as allowing export of the structures in a standard format for analyses with other programs. Recent advances in understanding of enzyme function are incorporated to make knowledge-based predictions about the stereochemistry of products. The program structure allows easy incorporation of additional knowledge about domain specificities and function. The results of analyses are presented to the user in a graphical interface, which also allows easy editing of the predictions to incorporate user experience. The versatility of this program package has been demonstrated by annotating biochemical pathways in microbial, invertebrate animal and metagenomic datasets. The speed and convenience of the package allows the annotation of all PKS and NRPS clusters in a complete Actinobacteria genome in 2–3 man hours. The open architecture of ClustScan allows easy integration with other programs, facilitating further analyses of results, which is useful for a broad range of researchers in the chemical and biological sciences. PMID:18978015

Starcevic, Antonio; Zucko, Jurica; Simunkovic, Jurica; Long, Paul F.; Cullum, John; Hranueli, Daslav

2008-01-01

282

Mining a database of single amplified genomes from Red Sea brine pool extremophiles-improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA).  

PubMed

Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available through the INDIGO website. PMID:24778629

Grötzinger, Stefan W; Alam, Intikhab; Ba Alawi, Wail; Bajic, Vladimir B; Stingl, Ulrich; Eppinger, Jörg

2014-01-01

283

TreeQ-VISTA: An Interactive Tree Visualization Tool withFunctional Annotation Query Capabilities  

SciTech Connect

Summary: We describe a general multiplatform exploratorytool called TreeQ-Vista, designed for presenting functional annotationsin a phylogenetic context. Traits, such as phenotypic and genomicproperties, are interactively queried from a relational database with auser-friendly interface which provides a set of tools for users with orwithout SQL knowledge. The query results are projected onto aphylogenetic tree and can be displayed in multiple color groups. A richset of browsing, grouping and query tools are provided to facilitatetrait exploration, comparison and analysis.Availability: The program,detailed tutorial and examples are available online athttp://genome-test.lbl.gov/vista/TreeQVista.

Gu, Shengyin; Anderson, Iain; Kunin, Victor; Cipriano, Michael; Minovitsky, Simon; Weber, Gunther; Amenta, Nina; Hamann, Bernd; Dubchak,Inna

2007-05-07

284

Functional Study of Genes Essential for Autogamy and Nuclear Reorganization in Paramecium?§  

PubMed Central

Like all ciliates, Paramecium tetraurelia is a unicellular eukaryote that harbors two kinds of nuclei within its cytoplasm. At each sexual cycle, a new somatic macronucleus (MAC) develops from the germ line micronucleus (MIC) through a sequence of complex events, which includes meiosis, karyogamy, and assembly of the MAC genome from MIC sequences. The latter process involves developmentally programmed genome rearrangements controlled by noncoding RNAs and a specialized RNA interference machinery. We describe our first attempts to identify genes and biological processes that contribute to the progression of the sexual cycle. Given the high percentage of unknown genes annotated in the P. tetraurelia genome, we applied a global strategy to monitor gene expression profiles during autogamy, a self-fertilization process. We focused this pilot study on the genes carried by the largest somatic chromosome and designed dedicated DNA arrays covering 484 genes from this chromosome (1.2% of all genes annotated in the genome). Transcriptome analysis revealed four major patterns of gene expression, including two successive waves of gene induction. Functional analysis of 15 upregulated genes revealed four that are essential for vegetative growth, one of which is involved in the maintenance of MAC integrity and another in cell division or membrane trafficking. Two additional genes, encoding a MIC-specific protein and a putative RNA helicase localizing to the old and then to the new MAC, are specifically required during sexual processes. Our work provides a proof of principle that genes essential for meiosis and nuclear reorganization can be uncovered following genome-wide transcriptome analysis. PMID:21257794

Nowak, Jacek K.; Gromadka, Robert; Juszczuk, Marek; Jerka-Dziadosz, Maria; Maliszewska, Kamila; Mucchielli, Marie-Helene; Gout, Jean-Francois; Arnaiz, Olivier; Agier, Nicolas; Tang, Thomas; Aggerbeck, Lawrence P.; Cohen, Jean; Delacroix, Herve; Sperling, Linda; Herbert, Christopher J.; Zagulski, Marek; Betermier, Mireille

2011-01-01

285

Relieving the cardiometabolic disease burden: a perspective on phytometabolite functional and chemical annotation for diabetes management.  

PubMed

Type 2 diabetes (T2D) is both a complex, multifactorial disease state and an unsolved, intensifying public-health problem. To help reduce disease burden, some T2D patients have embraced plant-derived substances for use with - if not in place of - prescription medicines, a trend based mainly upon historical precedent and anecdotal observations of human health benefit. Preclinical research has emphasized phytometabolite interactions with purported T2D pathogenic targets and the effects of botanical preparations on experimental T2D symptomology as induced in laboratory animals. More holistic, systems-oriented profiling of phytochemicals with functional-biology, omics, and chemical-fingerprinting tools now appears necessary to increase our appreciation of phytometabolite actions potentially beneficial to the T2D patient. The resultant, multidimensional view of phytometabolite pharmacology should help provide a more rational basis for evaluating the potential of natural plant products as T2D pharmacotherapy. Such information may also help substantiate and legitimize (pre)clinical demonstrations of phytochemical health benefits, advance our understanding of T2D pathogenesis, and offer scope for better T2D medicines. Public-private partnerships are invoked for conducting this research with the ultimate aim of improving the global cardiometabolic profile. PMID:24156826

Janero, David R

2014-01-01

286

Linking genes of unknown function with abiotic stress responses by high-throughput phenotype screening.  

PubMed

Over 13% of all genes in the Arabidopsis thaliana genome encode for proteins classified as having a completely unknown function, with the function of >30% of the Arabidopsis proteome poorly characterized. Although empirical data in the form of mRNA and proteome profiling experiments suggest that many of these proteins play an important role in different biological processes, their functional characterization remains one of the major challenges in modern biology. To expand the annotation of genes with unknown function involved in the response of Arabidopsis to different environmental stress conditions, we selected 1007 such genes and tested the response of their corresponding homozygous T-DNA insertional mutants to salinity, oxidative, osmotic, heat, cold and hypoxia stresses. Depending on the specific abiotic stresses tested, 12-31% of mutants had an altered stress-response phenotype. Interestingly, 832 out of 1007 mutants showed tolerance or sensitivity to more than one abiotic stress treatment, suggesting that genes of unknown function could play an important role in abiotic stress-response signaling, or general acclimation mechanisms. Further analysis of multiple stress-response phenotypes within different populations of mutants revealed interesting links between acclimation to heat, cold and oxidative stresses, as well as between sensitivity to ABA, osmotic, salinity, oxidative and hypoxia stresses. Our findings provide a significant contribution to the biological characterization of genes with unknown function in Arabidopsis and demonstrate that many of these genes play a key role in the response of plants to abiotic stresses. PMID:23517122

Luhua, Song; Hegie, Alicia; Suzuki, Nobuhiro; Shulaev, Elena; Luo, Xiaozhong; Cenariu, Diana; Ma, Vincent; Kao, Stephanie; Lim, Jennie; Gunay, Meryem Betul; Oosumi, Teruko; Lee, Seung Cho; Harper, Jeffery; Cushman, John; Gollery, Martin; Girke, Thomas; Bailey-Serres, Julia; Stevenson, Rebecca A; Zhu, Jian-Kang; Mittler, Ron

2013-07-01

287

Annotation of Differentially Expressed Genes in the Somatic Embryogenesis of Musa and Their Location in the Banana Genome  

PubMed Central

Analysis of cDNA-AFLP was used to study the genes expressed in zygotic and somatic embryogenesis of Musa acuminata Colla ssp. malaccensis, and a comparison was made between their differential transcribed fragments (TDFs) and the sequenced genome of the double haploid- (DH-) Pahang of the malaccensis subspecies that is available in the network. A total of 253 transcript-derived fragments (TDFs) were detected with apparent size of 100–4000 bp using 5 pairs of AFLP primers, of which 21 were differentially expressed during the different stages of banana embryogenesis; 15 of the sequences have matched DH-Pahang chromosomes, with 7 of them being homologous to gene sequences encoding either known or putative protein domains of higher plants. Four TDF sequences were located in all Musa chromosomes, while the rest were located in one or two chromosomes. Their putative individual function is briefly reviewed based on published information, and the potential roles of these genes in embryo development are discussed. Thus the availability of the genome of Musa and the information of TDFs sequences presented here opens new possibilities for an in-depth study of the molecular and biochemical research of zygotic and somatic embryogenesis of Musa. PMID:24027442

Maldonado-Borges, Josefina Ines; Ku-Cauich, Jose Roberto; Escobedo-GraciaMedrano, Rosa Maria

2013-01-01

288

Large-scale prediction of long non-coding RNA functions in a coding–non-coding gene co-expression network  

PubMed Central

Although accumulating evidence has provided insight into the various functions of long-non-coding RNAs (lncRNAs), the exact functions of the majority of such transcripts are still unknown. Here, we report the first computational annotation of lncRNA functions based on public microarray expression profiles. A coding–non-coding gene co-expression (CNC) network was constructed from re-annotated Affymetrix Mouse Genome Array data. Probable functions for altogether 340 lncRNAs were predicted based on topological or other network characteristics, such as module sharing, association with network hubs and combinations of co-expression and genomic adjacency. The functions annotated to the lncRNAs mainly involve organ or tissue development (e.g. neuron, eye and muscle development), cellular transport (e.g. neuronal transport and sodium ion, acid or lipid transport) or metabolic processes (e.g. involving macromolecules, phosphocreatine and tyrosine). PMID:21247874

Liao, Qi; Liu, Changning; Yuan, Xiongying; Kang, Shuli; Miao, Ruoyu; Xiao, Hui; Zhao, Guoguang; Luo, Haitao; Bu, Dechao; Zhao, Haitao; Skogerbø, Geir; Wu, Zhongdao; Zhao, Yi

2011-01-01

289

CoCiter: An Efficient Tool to Infer Gene Function by Assessing the Significance of Literature Co-Citation  

PubMed Central

A routine approach to inferring functions for a gene set is by using function enrichment analysis based on GO, KEGG or other curated terms and pathways. However, such analysis requires the existence of overlapping genes between the query gene set and those annotated by GO/KEGG. Furthermore, GO/KEGG databases only maintain a very restricted vocabulary. Here, we have developed a tool called “CoCiter” based on literature co-citations to address the limitations in conventional function enrichment analysis. Co-citation analysis is widely used in ranking articles and predicting protein-protein interactions (PPIs). Our algorithm can further assess the co-citation significance of a gene set with any other user-defined gene sets, or with free terms. We show that compared with the traditional approaches, CoCiter is a more accurate and flexible function enrichment analysis method. CoCiter is freely available at www.picb.ac.cn/hanlab/cociter/. PMID:24086311

Naveed, Hammad; Green, Christopher D.; Han, Jing-Dong J.

2013-01-01

290

CoCiter: an efficient tool to infer gene function by assessing the significance of literature co-citation.  

PubMed

A routine approach to inferring functions for a gene set is by using function enrichment analysis based on GO, KEGG or other curated terms and pathways. However, such analysis requires the existence of overlapping genes between the query gene set and those annotated by GO/KEGG. Furthermore, GO/KEGG databases only maintain a very restricted vocabulary. Here, we have developed a tool called "CoCiter" based on literature co-citations to address the limitations in conventional function enrichment analysis. Co-citation analysis is widely used in ranking articles and predicting protein-protein interactions (PPIs). Our algorithm can further assess the co-citation significance of a gene set with any other user-defined gene sets, or with free terms. We show that compared with the traditional approaches, CoCiter is a more accurate and flexible function enrichment analysis method. CoCiter is freely available at www.picb.ac.cn/hanlab/cociter/. PMID:24086311

Qiao, Nan; Huang, Yi; Naveed, Hammad; Green, Christopher D; Han, Jing-Dong J

2013-01-01

291

Functional Annotation of Two New Carboxypeptidases from the Amidohydrolase Superfamily of Enzymes  

SciTech Connect

Two proteins from the amidohydrolase superfamily of enzymes were cloned, expressed, and purified to homogeneity. The first protein, Cc0300, was from Caulobacter crescentus CB-15 (Cc0300), while the second one (Sgx9355e) was derived from an environmental DNA sequence originally isolated from the Sargasso Sea (gi|44371129). The catalytic functions and the substrate profiles for the two enzymes were determined with the aid of combinatorial dipeptide libraries. Both enzymes were shown to catalyze the hydrolysis of l-Xaa-l-Xaa dipeptides in which the amino acid at the N-terminus was relatively unimportant. These enzymes were specific for hydrophobic amino acids at the C-terminus. With Cc0300, substrates terminating in isoleucine, leucine, phenylalanine, tyrosine, valine, methionine, and tryptophan were hydrolyzed. The same specificity was observed with Sgx9355e, but this protein was also able to hydrolyze peptides terminating in threonine. Both enzymes were able to hydrolyze N-acetyl and N-formyl derivatives of the hydrophobic amino acids and tripeptides. The best substrates identified for Cc0300 were l-Ala-l-Leu with kcat and kcat/Km values of 37 s-1 and 1.1 x 105 M-1 s-1, respectively, and N-formyl-l-Tyr with kcat and kcat/Km values of 33 s-1 and 3.9 x 105 M-1 s-1, respectively. The best substrate identified for Sgx9355e was l-Ala-l-Phe with kcat and kcat/Km values of 0.41 s-1 and 5.8 x 103 M-1 s-1. The three-dimensional structure of Sgx9355e was determined to a resolution of 2.33 Angstroms with l-methionine bound in the active site. The a-carboxylate of the methionine is ion-paired to His-237 and also hydrogen bonded to the backbone amide groups of Val-201 and Leu-202. The a-amino group of the bound methionine interacts with Asp-328. The structural determinants for substrate recognition were identified and compared with other enzymes in this superfamily that hydrolyze dipeptides with different specificities.

Xiang, D.; Xu, C; Kumaran, D; Brown, A; Sauder, M; Burley, S; Swaminathan, S; Raushel, F

2009-01-01

292

Automatic Assignment of Protein Function with Supervised Classifiers  

E-print Network

High-throughput genome sequencing and sequence analysis technologies have created the need for automated annotation and analysis of large sets of genes. The Gene Ontology (GO) provides a common controlled vocabulary for describing gene function...

Jung, Jae

2010-01-16

293

RNA interference for wheat functional gene analysis  

Microsoft Academic Search

RNA interference (RNAi) refers to a common mechanism of RNA-based post-transcriptional gene silencing in eukaryotic cells.\\u000a In model plant species such as Arabidopsis and rice, RNAi has been routinely used to characterize gene function and to engineer novel phenotypes. In polyploid species,\\u000a this approach is in its early stages, but has great potential since multiple homoeologous copies can be simultaneously

Daolin Fu; Cristobal Uauy; Ann Blechl; Jorge Dubcovsky

2007-01-01

294

Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae  

SciTech Connect

Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. To date, the perceived value of manual curation for genome annotations is not offset by the real cost and time associated with the process. In order to balance the large number of sequences generated, the annotation process is now performed almost exclusively in an automated fashion for most genome sequencing projects. One possible way to reduce errors inherent to automated computational annotations is to apply data from 'omics' measurements (i.e. transcriptional and proteomic) to the un-annotated genome with a proteogenomic-based approach. This approach does require additional experimental and bioinformatics methods to include omics technologies; however, the approach is readily automatable and can benefit from rapid developments occurring in those research domains as well. The annotation process can be improved by experimental validation of transcription and translation and aid in the discovery of annotation errors. Here the concept of annotation refinement has been extended to include a comparative assessment of genomes across closely related species, as is becoming common in sequencing efforts. Transcriptomic and proteomic data derived from three highly similar pathogenic Yersiniae (Y. pestis CO92, Y. pestis pestoides F, and Y. pseudotuberculosis PB1/+) was used to demonstrate a comprehensive comparative omic-based annotation methodology. Peptide and oligo measurements experimentally validated the expression of nearly 40% of each strain's predicted proteome and revealed the identification of 28 novel and 68 previously incorrect protein-coding sequences (e.g., observed frameshifts, extended start sites, and translated pseudogenes) within the three current Yersinia genome annotations. Gene loss is presumed to play a major role in Y. pestis acquiring its niche as a virulent pathogen, thus the discovery of many translated pseudogenes underscores a need for functional analyses to investigate hypotheses related to divergence. Refinements included the discovery of a seemingly essential ribosomal protein, several virulence-associated factors, and a transcriptional regulator, among other proteins, most of which are annotated as hypothetical, that were missed during annotation.

Rutledge, Alexandra C.; Jones, Marcus B.; Chauhan, Sadhana; Purvine, Samuel O.; Sanford, James; Monroe, Matthew E.; Brewer, Heather M.; Payne, Samuel H.; Ansong, Charles; Frank, Bryan C.; Smith, Richard D.; Peterson, Scott; Motin, Vladimir L.; Adkins, Joshua N.

2012-03-27

295

RNA sequencing reveals sexually dimorphic gene expression before gonadal differentiation in chicken and allows comprehensive annotation of the W-chromosome  

PubMed Central

Background Birds have a ZZ male: ZW female sex chromosome system and while the Z-linked DMRT1 gene is necessary for testis development, the exact mechanism of sex determination in birds remains unsolved. This is partly due to the poor annotation of the W chromosome, which is speculated to carry a female determinant. Few genes have been mapped to the W and little is known of their expression. Results We used RNA-seq to produce a comprehensive profile of gene expression in chicken blastoderms and embryonic gonads prior to sexual differentiation. We found robust sexually dimorphic gene expression in both tissues pre-dating gonadogenesis, including sex-linked and autosomal genes. This supports the hypothesis that sexual differentiation at the molecular level is at least partly cell autonomous in birds. Different sets of genes were sexually dimorphic in the two tissues, indicating that molecular sexual differentiation is tissue specific. Further analyses allowed the assembly of full-length transcripts for 26 W chromosome genes, providing a view of the W transcriptome in embryonic tissues. This is the first extensive analysis of W-linked genes and their expression profiles in early avian embryos. Conclusion Sexual differentiation at the molecular level is established in chicken early in embryogenesis, before gonadal sex differentiation. We find that the W chromosome is more transcriptionally active than previously thought, expand the number of known genes to 26 and present complete coding sequences for these W genes. This includes two novel W-linked sequences and three small RNAs reassigned to the W from the Un_Random chromosome. PMID:23531366

2013-01-01

296

Ranking Biomedical Annotations with Annotator's Semantic Relevancy  

PubMed Central

Biomedical annotation is a common and affective artifact for researchers to discuss, show opinion, and share discoveries. It becomes increasing popular in many online research communities, and implies much useful information. Ranking biomedical annotations is a critical problem for data user to efficiently get information. As the annotator's knowledge about the annotated entity normally determines quality of the annotations, we evaluate the knowledge, that is, semantic relationship between them, in two ways. The first is extracting relational information from credible websites by mining association rules between an annotator and a biomedical entity. The second way is frequent pattern mining from historical annotations, which reveals common features of biomedical entities that an annotator can annotate with high quality. We propose a weighted and concept-extended RDF model to represent an annotator, a biomedical entity, and their background attributes and merge information from the two ways as the context of an annotator. Based on that, we present a method to rank the annotations by evaluating their correctness according to user's vote and the semantic relevancy between the annotator and the annotated entity. The experimental results show that the approach is applicable and efficient even when data set is large. PMID:24899918

2014-01-01

297

Rice Annotation Database (RAD): a contig-oriented database for map-based rice genomics.  

PubMed

A contig-oriented database for annotation of the rice genome has been constructed to facilitate map-based rice genomics. The Rice Annotation Database has the following functional features: (i) extensive effort of manual annotations of P1-derived artificial chromosome/bacterial artificial chromosome clones can be merged at chromosome and contig-level; (ii) concise visualization of the annotation information such as the predicted genes, results of various prediction programs (RiceHMM, Genscan, Genscan+, Fgenesh, GeneMark, etc.), homology to expressed sequence tag, full-length cDNA and protein; (iii) user-friendly clone / gene query system; (iv) download functions for nucleotide, amino acid and coding sequences; (v) analysis of various features of the genome (GC-content, average value, etc.); and (vi) genome-wide homology search (BLAST) of contig- and chromosome-level genome sequence to allow comparative analysis with the genome sequence of other organisms. As of October 2004, the database contains a total of 215 Mb sequence with relevant annotation results including 30 000 manually curated genes. The database can provide the latest information on manual annotation as well as a comprehensive structural analysis of various features of the rice genome. The database can be accessed at http://rad.dna.affrc.go.jp/. PMID:15608281

Ito, Yuichi; Arikawa, Kohji; Antonio, Baltazar A; Ohta, Isamu; Naito, Shinji; Mukai, Yoshiyuki; Shimano, Atsuko; Masukawa, Masatoshi; Shibata, Michie; Yamamoto, Mayu; Ito, Yukiyo; Yokoyama, Junri; Sakai, Yasumichi; Sakata, Katsumi; Nagamura, Yoshiaki; Namiki, Nobukazu; Matsumoto, Takashi; Higo, Kenichi; Sasaki, Takuji

2005-01-01

298

WN: Gene functional classification from heterogeneous data  

E-print Network

In our attempts to understand cellular function at the molecular level, we must be able to synthesize information from disparate types of genomic data. We consider the problem of inferring gene functional classifications from a heterogeneous data set consisting of DNA microarray expression measurements and phylogenetic profiles from whole-genome sequence comparisons. We demonstrate the application of the support vector machine (SVM) learning algorithm to this functional inference task. Our results suggest the importance of exploiting prior information about the heterogeneity of the data. In particular, we propose an SVM kernel function that is explicitly heterogeneous. We also show how to use knowledge about heterogeneity to aid in feature selection. 1

Paul Pavlidis; Jinsong Cai; Jason Weston; William Noble Grundy

299

Comparative Omics-Driven Genome Annotation Refinement: Application across Yersiniae  

PubMed Central

Genome sequencing continues to be a rapidly evolving technology, yet most downstream aspects of genome annotation pipelines remain relatively stable or are even being abandoned. The annotation process is now performed almost exclusively in an automated fashion to balance the large number of sequences generated. One possible way of reducing errors inherent to automated computational annotations is to apply data from omics measurements (i.e. transcriptional and proteomic) to the un-annotated genome with a proteogenomic-based approach. Here, the concept of annotation refinement has been extended to include a comparative assessment of genomes across closely related species. Transcriptomic and proteomic data derived from highly similar pathogenic Yersiniae (Y. pestis CO92, Y. pestis Pestoides F, and Y. pseudotuberculosis PB1/+) was used to demonstrate a comprehensive comparative omic-based annotation methodology. Peptide and oligo measurements experimentally validated the expression of nearly 40% of each strain's predicted proteome and revealed the identification of 28 novel and 68 incorrect (i.e., observed frameshifts, extended start sites, and translated pseudogenes) protein-coding sequences within the three current genome annotations. Gene loss is presumed to play a major role in Y. pestis acquiring its niche as a virulent pathogen, thus the discovery of many translated pseudogenes, including the insertion-ablated argD, underscores a need for functional analyses to investigate hypotheses related to divergence. Refinements included the discovery of a seemingly essential ribosomal protein, several virulence-associated factors, a transcriptional regulator, and many hypothetical proteins that were missed during annotation. PMID:22479471

Schrimpe-Rutledge, Alexandra C.; Jones, Marcus B.; Chauhan, Sadhana; Purvine, Samuel O.; Sanford, James A.; Monroe, Matthew E.; Brewer, Heather M.; Payne, Samuel H.; Ansong, Charles; Frank, Bryan C.; Smith, Richard D.; Peterson, Scott N.; Motin, Vladimir L.; Adkins, Joshua N.

2012-01-01

300

Re-annotation of the Saccharopolyspora erythraea genome using a systems biology approach  

PubMed Central

Background Accurate bacterial genome annotations provide a framework to understanding cellular functions, behavior and pathogenicity and are essential for metabolic engineering. Annotations based only on in silico predictions are inaccurate, particularly for large, high G?+?C content genomes due to the lack of similarities in gene length and gene organization to model organisms. Results Here we describe a 2D systems biology driven re-annotation of the Saccharopolyspora erythraea genome using proteogenomics, a genome-scale metabolic reconstruction, RNA-sequencing and small-RNA-sequencing. We observed transcription of more than 300 intergenic regions, detected 59 peptides in intergenic regions, confirmed 164 open reading frames previously annotated as hypothetical proteins and reassigned function to open reading frames using the genome-scale metabolic reconstruction. Finally, we present a novel way of mapping ribosomal binding sites across the genome by sequencing small RNAs. Conclusions The work presented here describes a novel framework for annotation of the Saccharopolyspora erythraea genome. Based on experimental observations, the 2D annotation framework greatly reduces errors that are commonly made when annotating large-high G?+?C content genomes using computational prediction algorithms. PMID:24118942

2013-01-01

301

GENCODE: The reference human genome annotation for The ENCODE Project  

E-print Network

The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation ...

Lin, Michael

302

Bayesian modelling of shared gene function  

Microsoft Academic Search

Motivation Biological assays are often carried out on tissues that contain many cell lineages and active pathways. Microarray data produced using such material therefore reflect superimpositions of biological processes. Analysing such data for shared gene function by means of well matched assays may help to provide a better focus on specific cell types and processes. The identification of ge nes

P. Sykacek; R. Clarkson; C. Print; R. A. Furlong; Gos Micklem

2007-01-01

303

Functional classification of genes using semantic distance and fuzzy clustering approach: evaluation with reference sets and overlap analysis.  

PubMed

Functional classification aims at grouping genes according to their molecular function or the biological process they participate in. Evaluating the validity of such unsupervised gene classification remains a challenge given the variety of distance measures and classification algorithms that can be used. We evaluate here functional classification of genes with the help of reference sets: KEGG (Kyoto Encyclopaedia of Genes and Genomes) pathways and Pfam clans. These sets represent ground truth for any distance based on GO (Gene Ontology) biological process and molecular function annotations respectively. Overlaps between clusters and reference sets are estimated by the F-score method. We test our previously described IntelliGO semantic distance with hierarchical and fuzzy C-means clustering and we compare results with the state-of-the-art DAVID (Database for Annotation Visualisation and Integrated Discovery) functional classification method. Finally, study of best matching clusters to reference sets leads us to propose a set-difference method for discovering missing information. PMID:23013652

Devignes, Marie-Dominique; Benabderrahmane, Sidahmed; Smaïl-Tabbone, Malika; Napoli, Amedeo; Poch, Olivier

2012-01-01

304

Progress and challenges in the computational prediction of gene function using networks: 2012-2013 update  

PubMed Central

In an opinion published in 2012, we reviewed and discussed our studies of how gene network-based guilt-by-association (GBA) is impacted by confounds related to gene multifunctionality. We found such confounds account for a significant part of the GBA signal, and as a result meaningfully evaluating and applying computationally-guided GBA is more challenging than generally appreciated. We proposed that effort currently spent on incrementally improving algorithms would be better spent in identifying the features of data that do yield novel functional insights. We also suggested that part of the problem is the reliance by computational biologists on gold standard annotations such as the Gene Ontology. In the year since, there has been continued heavy activity in GBA-based research, including work that contributes to our understanding of the issues we raised. Here we provide a review of some of the most relevant recent work, or which point to new areas of progress and challenges. PMID:24715959

Pavlidis, Paul; Gillis, Jesse

2013-01-01

305

Neural Networks Approaches for Discovering the Learnable Correlation between Gene Function and Gene  

E-print Network

in many ways, especially in Gene Therapy [18]. Identifying gene function in prokaryotes is much easierNeural Networks Approaches for Discovering the Learnable Correlation between Gene Function and Gene University of Toronto Toronto, ON. emad@cs.toronto.edu Abstract. Identifying gene function has many useful

Bonner, Anthony

306

Assembly, Gene Annotation and Marker Development Using 454 Floral Transcriptome Sequences in Ziziphus Celata (Rhamnaceae), a Highly Endangered, Florida Endemic Plant  

PubMed Central

Large-scale DNA sequence data may enable development of genetic resources in endangered species, thereby facilitating conservation efforts. Ziziphus celata, a federally endangered, self-incompatible plant species occurring in Florida, USA, is one species for which genetic resources are necessary to facilitate new introductions and augmentations essential for recovery of the species. We used 454 pyrosequencing of a Z. celata normalized floral cDNA library to create a genomic resource for gene and marker discovery. A half-plate GS-FLX Titanium run yielded 655 337 reads averaging 250 bp. A total of 474 025 reads were assembled de novo into 84 645 contigs averaging 408 bp, while 181 312 reads remained unassembled. Forty-seven and 43% of contig consensus sequences had BLAST matches to known proteins in the Uniref50 and TAIR9 annotated protein databases, respectively; many contigs fully represented orthologous proteins in TAIR9. A total of 22 707 unique genes were sequenced, indicating substantial coverage of the Z. celata transcriptome. We detected single-nucleotide polymorphisms and simple sequence repeats (SSRs) and developed thousands of SSR primers for use in future genetic studies. As a first step towards understanding self-incompatibility in Z. celata, we identified sequences belonging to the gene family encoding self-incompatibility. This study demonstrates the efficacy of 454 transcriptome sequencing for rapid gene and marker discovery in an endangered plant. PMID:22039173

Edwards, Christine E.; Parchman, Thomas L.; Weekley, Carl W.

2012-01-01

307

Validation of a novel expressed sequence tag (EST) clustering method and development of a phylogenetic annotation pipeline for livestock gene families  

E-print Network

Prediction of functions of genes in a genome is a key step in all genome sequencing projects. Sequences that carry out important functions are likely to be conserved between evolutionarily distant species and can be identified using cross...

Venkatraman, Anand

2009-05-15

308

Proteogenomics: the needs and roles to be filled by proteomics in genome annotation  

SciTech Connect

While genome sequencing efforts reveal the basic building blocks of life, a genome sequence alone is insufficient for elucidating biological function. Genome annotation – the process of identifying genes and assigning function to each gene in a genome sequence – provides the means to elucidate biological function from sequence. Current state-of-the-art high throughput genome annotation uses a combination of comparative (sequence similarity data) and non-comparative (ab initio gene prediction algorithms) methods to identify open reading frames in genome sequences. Because approaches used to validate the presence of these open reading frames are typically based on the information derived from the annotated genomes, they cannot independently and unequivocally determine whether a predicted open reading frame is translated into a protein. With the ability to directly measure peptides arising from expressed proteins, high throughput liquid chromatography-tandem mass spectrometry-based proteomics, approaches can be used to verify coding regions of a genomic sequence. Here, we highlight several ways in which high throughput tandem mass spectrometry-based proteomics can improve the quality of genome annotations and suggest that it could be efficiently applied during the initial gene calling process so that the improvements are propagated through the subsequent functional annotation process.

Ansong, Charles; Purvine, Samuel O.; Adkins, Joshua N.; Lipton, Mary S.; Smith, Richard D.

2008-01-01

309

Chado Controller: advanced annotation management with a community annotation system  

PubMed Central

Summary: We developed a controller that is compliant with the Chado database schema, GBrowse and genome annotation-editing tools such as Artemis and Apollo. It enables the management of public and private data, monitors manual annotation (with controlled vocabularies, structural and functional annotation controls) and stores versions of annotation for all modified features. The Chado controller uses PostgreSQL and Perl. Availability: The Chado Controller package is available for download at http://www.gnpannot.org/content/chado-controller and runs on any Unix-like operating system, and documentation is available at http://www.gnpannot.org/content/chado-controller-doc The system can be tested using the GNPAnnot Sandbox at http://www.gnpannot.org/content/gnpannot-sandbox-form Contact: valentin.guignon@cirad.fr; stephanie.sidibe-bocs@cirad.fr Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22285827

Guignon, Valentin; Droc, Gaetan; Alaux, Michael; Baurens, Franc-Christophe; Garsmeur, Olivier; Poiron, Claire; Carver, Tim; Rouard, Mathieu; Bocs, Stephanie

2012-01-01

310

The Structure of a Gene Co-Expression Network Reveals Biological Functions Underlying eQTLs  

PubMed Central

What are the commonalities between genes, whose expression level is partially controlled by eQTL, especially with regard to biological functions? Moreover, how are these genes related to a phenotype of interest? These issues are particularly difficult to address when the genome annotation is incomplete, as is the case for mammalian species. Moreover, the direct link between gene expression and a phenotype of interest may be weak, and thus difficult to handle. In this framework, the use of a co-expression network has proven useful: it is a robust approach for modeling a complex system of genetic regulations, and to infer knowledge for yet unknown genes. In this article, a case study was conducted with a mammalian species. It showed that the use of a co-expression network based on partial correlation, combined with a relevant clustering of nodes, leads to an enrichment of biological functions of around 83%. Moreover, the use of a spatial statistics approach allowed us to superimpose additional information related to a phenotype; this lead to highlighting specific genes or gene clusters that are related to the network structure and the phenotype. Three main results are worth noting: first, key genes were highlighted as a potential focus for forthcoming biological experiments; second, a set of biological functions, which support a list of genes under partial eQTL control, was set up by an overview of the global structure of the gene expression network; third, pH was found correlated with gene clusters, and then with related biological functions, as a result of a spatial analysis of the network topology. PMID:23577081

Villa-Vialaneix, Nathalie; Liaubet, Laurence; Laurent, Thibault; Cherel, Pierre; Gamot, Adrien; SanCristobal, Magali

2013-01-01

311

Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs  

Microsoft Academic Search

Only a small proportion of the mouse genome is transcribed into mature messenger RNA transcripts. There is an international collaborative effort to identify all full-length mRNA transcripts from the mouse, and to ensure that each is represented in a physical collection of clones. Here we report the manual annotation of 60,770 full-length mouse complementary DNA sequences. These are clustered into

Y. Okazaki; M. Furuno; T. Kasukawa; J. Adachi; H. Bono; S. Kondo; I. Nikaido; N. Osato; R. Saito; H. Suzuki; I. Yamanaka; H. Kiyosawa; K. Yagi; Y. Tomaru; Y. Hasegawa; A. Nogami; C. Schönbach; T. Gojobori; R. Baldarelli; D. P. Hill; C. Bult; D. A. Hume; J. Quackenbush; L. M. Schriml; A. Kanapin; H. Matsuda; S. Batalov; K. W. Beisel; J. A. Blake; D. Bradt; V. Brusic; C. Chothia; L. E. Corbani; S. Cousins; E. Dalla; T. A. Dragani; C. F. Fletcher; A. Forrest; K. S. Frazer; T. Gaasterland; M. Gariboldi; C. Gissi; A. Godzik; J. Gough; S. Grimmond; S. Gustincich; N. Hirokawa; I. J. Jackson; E. D. Jarvis; A. Kanai; H. Kawaji; Y. Kawasawa; R. M. Kedzierski; B. L. King; A. Konagaya; I. V. Kurochkin; Y. Lee; B. Lenhard; P. A. Lyons; D. R. Maglott; L. Maltais; L. Marchionni; L. McKenzie; H. Miki; T. Nagashima; K. Numata; T. Okido; W. J. Pavan; G. Pertea; G. Pesole; N. Petrovsky; R. Pillai; J. U. Pontius; D. Qi; S. Ramachandran; T. Ravasi; J. C. Reed; D. J. Reed; J. Reid; B. Z. Ring; M. Ringwald; A. Sandelin; C. Schneider; C. A. M. Semple; M. Setou; K. Shimada; R. Sultana; Y. Takenaka; M. S. Taylor; R. D. Teasdale; M. Tomita; R. Verardo; L. Wagner; C. Wahlestedt; Y. Wang; Y. Watanabe; C. Wells; L. G. Wilming; A. Wynshaw-Boris; M. Yanagisawa; I. Yang; L. Yang; Z. Yuan; M. Zavolan; Y. Zhu; A. Zimmer; P. Carninci; N. Hayatsu; T. Hirozane-Kishikawa; H. Konno; M. Nakamura; N. Sakazume; K. Sato; T. Shiraki; K. Waki; J. Kawai; K. Aizawa; T. Arakawa; S. Fukuda; A. Hara; W. Hashizume; K. Imotani; Y. Ishii; M. Itoh; I. Kagawa; A. Miyazaki; K. Sakai; D. Sasaki; K. Shibata; A. Shinagawa; A. Yasunishi; M. Yoshino; R. Waterston; E. S. Lander; J. Rogers; E. Birney; Y. Hayashizaki

2002-01-01

312

Analysis of Antisense Expression by Whole Genome Tiling Microarrays and siRNAs Suggests Mis-Annotation of Arabidopsis Orphan Protein-Coding Genes  

PubMed Central

Background MicroRNAs (miRNAs) and trans-acting small-interfering RNAs (tasi-RNAs) are small (20–22 nt long) RNAs (smRNAs) generated from hairpin secondary structures or antisense transcripts, respectively, that regulate gene expression by Watson-Crick pairing to a target mRNA and altering expression by mechanisms related to RNA interference. The high sequence homology of plant miRNAs to their targets has been the mainstay of miRNA prediction algorithms, which are limited in their predictive power for other kingdoms because miRNA complementarity is less conserved yet transitive processes (production of antisense smRNAs) are active in eukaryotes. We hypothesize that antisense transcription and associated smRNAs are biomarkers which can be computationally modeled for gene discovery. Principal Findings We explored rice (Oryza sativa) sense and antisense gene expression in publicly available whole genome tiling array transcriptome data and sequenced smRNA libraries (as well as C. elegans) and found evidence of transitivity of MIRNA genes similar to that found in Arabidopsis. Statistical analysis of antisense transcript abundances, presence of antisense ESTs, and association with smRNAs suggests several hundred Arabidopsis ‘orphan’ hypothetical genes are non-coding RNAs. Consistent with this hypothesis, we found novel Arabidopsis homologues of some MIRNA genes on the antisense strand of previously annotated protein-coding genes. A Support Vector Machine (SVM) was applied using thermodynamic energy of binding plus novel expression features of sense/antisense transcription topology and siRNA abundances to build a prediction model of miRNA targets. The SVM when trained on targets could predict the “ancient” (deeply conserved) class of validated Arabidopsis MIRNA genes with an accuracy of 84%, and 76% for “new” rapidly-evolving MIRNA genes. Conclusions Antisense and smRNA expression features and computational methods may identify novel MIRNA genes and other non-coding RNAs in plants and potentially other kingdoms, which can provide insight into antisense transcription, miRNA evolution, and post-transcriptional gene regulation. PMID:20520764

Richardson, Casey R.; Luo, Qing-Jun; Gontcharova, Viktoria; Jiang, Ying-Wen; Samanta, Manoj; Youn, Eunseog; Rock, Christopher D.

2010-01-01

313

Annotation of a hybrid partial genome of the coffee rust (Hemileia vastatrix) contributes to the gene repertoire catalog of the Pucciniales  

PubMed Central

Coffee leaf rust caused by the fungus Hemileia vastatrix is the most damaging disease to coffee worldwide. The pathogen has recently appeared in multiple outbreaks in coffee producing countries resulting in significant yield losses and increases in costs related to its control. New races/isolates are constantly emerging as evidenced by the presence of the fungus in plants that were previously resistant. Genomic studies are opening new avenues for the study of the evolution of pathogens, the detailed description of plant-pathogen interactions and the development of molecular techniques for the identification of individual isolates. For this purpose we sequenced 8 different H. vastatrix isolates using NGS technologies and gathered partial genome assemblies due to the large repetitive content in the coffee rust hybrid genome; 74.4% of the assembled contigs harbor repetitive sequences. A hybrid assembly of 333 Mb was built based on the 8 isolates; this assembly was used for subsequent analyses. Analysis of the conserved gene space showed that the hybrid H. vastatrix genome, though highly fragmented, had a satisfactory level of completion with 91.94% of core protein-coding orthologous genes present. RNA-Seq from urediniospores was used to guide the de novo annotation of the H. vastatrix gene complement. In total, 14,445 genes organized in 3921 families were uncovered; a considerable proportion of the predicted proteins (73.8%) were homologous to other Pucciniales species genomes. Several gene families related to the fungal lifestyle were identified, particularly 483 predicted secreted proteins that represent candidate effector genes and will provide interesting hints to decipher virulence in the coffee rust fungus. The genome sequence of Hva will serve as a template to understand the molecular mechanisms used by this fungus to attack the coffee plant, to study the diversity of this species and for the development of molecular markers to distinguish races/isolates.

Cristancho, Marco A.; Botero-Rozo, David Octavio; Giraldo, William; Tabima, Javier; Riano-Pachon, Diego Mauricio; Escobar, Carolina; Rozo, Yomara; Rivera, Luis F.; Duran, Andres; Restrepo, Silvia; Eilam, Tamar; Anikster, Yehoshua; Gaitan, Alvaro L.

2014-01-01

314

GO-based Functional Dissimilarity of Gene Sets  

PubMed Central

Background The Gene Ontology (GO) provides a controlled vocabulary for describing the functions of genes and can be used to evaluate the functional coherence of gene sets. Many functional coherence measures consider each pair of gene functions in a set and produce an output based on all pairwise distances. A single gene can encode multiple proteins that may differ in function. For each functionality, other proteins that exhibit the same activity may also participate. Therefore, an identification of the most common function for all of the genes involved in a biological process is important in evaluating the functional similarity of groups of genes and a quantification of functional coherence can helps to clarify the role of a group of genes working together. Results To implement this approach to functional assessment, we present GFD (GO-based Functional Dissimilarity), a novel dissimilarity measure for evaluating groups of genes based on the most relevant functions of the whole set. The measure assigns a numerical value to the gene set for each of the three GO sub-ontologies. Conclusions Results show that GFD performs robustly when applied to gene set of known functionality (extracted from KEGG). It performs particularly well on randomly generated gene sets. An ROC analysis reveals that the performance of GFD in evaluating the functional dissimilarity of gene sets is very satisfactory. A comparative analysis against other functional measures, such as GS2 and those presented by Resnik and Wang, also demonstrates the robustness of GFD. PMID:21884611

2011-01-01

315

Drosophila genomic sequence annotation using the BLOCKS+ database.  

PubMed

A simple and general homology-based method for gene finding was applied to the 2.9-Mb Drosophila melanogaster Adh region, the target sequence of the Genome Annotation Assessment Project (GASP). Each strand of the entire sequence was used as query of the BLOCKS+ database of conserved regions of proteins. This led to functional assignments for more than one-third of the genes and two-thirds of the transposons. Considering the enormous size of the query, the fact that only two false-positive matches were reported emphasizes the high selectivity of protein family-based methods for gene finding. We used the search results to improve BLOCKS+ by identifying compositionally biased blocks. Our results confirm that protein family databases can be used effectively in automated sequence annotation efforts. PMID:10779495

Henikoff, J G; Henikoff, S

2000-04-01

316

Bioconductor: annotation databases Thomas Lumley  

E-print Network

Bioconductor: annotation databases Thomas Lumley Ken Rice UW Biostatistics Seattle, June 2008 #12 chromosome, to estimate sex from DNA intensity and heterozygous X- chromosome loci, for QC. > head but tedious, so we want an automated approach #12;Example: finding chromosomes First extract the gene names

Rice, Ken

317

Annotating nonspecific SAGE tags with microarray data.  

PubMed

SAGE (serial analysis of gene expression) detects transcripts by extracting short tags from the transcripts. Because of the limited length, many SAGE tags are shared by transcripts from different genes. Relying on sequence information in the general gene expression database has limited power to solve this problem due to the highly heterogeneous nature of the deposited sequences. Considering that the complexity of gene expression at a single tissue level should be much simpler than that in the general expression database, we reasoned that by restricting gene expression to tissue level, the accuracy of gene annotation for the nonspecific SAGE tags should be significantly improved. To test the idea, we developed a tissue-specific SAGE annotation database based on microarray data (). This database contains microarray expression information represented as UniGene clusters for 73 normal human tissues and 18 cancer tissues and cell lines. The nonspecific SAGE tag is first matched to the database by the same tissue type used by both SAGE and microarray analysis; then the multiple UniGene clusters assigned to the nonspecific SAGE tag are searched in the database under the matched tissue type. The UniGene cluster presented solely or at higher expression levels in the database is annotated to represent the specific gene for the nonspecific SAGE tags. The accuracy of gene annotation by this database was largely confirmed by experimental data. Our study shows that microarray data provide a useful source for annotating the nonspecific SAGE tags. PMID:16314072

Ge, Xijin; Jung, Yong-Chul; Wu, Qingfa; Kibbe, Warren A; Wang, San Ming

2006-01-01

318

The evolution of the plastid chromosome in land plants: gene content, gene order, gene function  

E-print Network

, we will explore the functions and roles of plastid encoded genes in metabolism and their evolutionary take place in plastids, including synthesis of starch, fatty acids, pigments and amino acids (reviewed

dePamphilis, Claude

319

ACID: annotation of cassette and integron data  

E-print Network

Background: Although integrons and their associated gene cassettes are present in ~10% of bacteria and can represent up to 3% of the genome in which they are found, very few have been properly identified and annotated in ...

Boucher, Yan

320

Discovery and functional assessment of gene variants in the vascular endothelial growth factor pathway.  

PubMed

Angiogenesis is a host-mediated mechanism in disease pathophysiology. The vascular endothelial growth factor (VEGF) pathway is a major determinant of angiogenesis, and a comprehensive annotation of the functional variation in this pathway is essential to understand the genetic basis of angiogenesis-related diseases. We assessed the allelic heterogeneity of gene expression, population specificity of cis expression quantitative trait loci (eQTLs), and eQTL function in luciferase assays in CEU and Yoruba people of Ibadan, Nigeria (YRI) HapMap lymphoblastoid cell lines in 23 resequenced genes. Among 356 cis-eQTLs, 155 and 174 were unique to CEU and YRI, respectively, and 27 were shared between CEU and YRI. Two cis-eQTLs provided mechanistic evidence for two genome-wide association study findings. Five eQTLs were tested for function in luciferase assays and the effect of two KRAS variants was concordant with the eQTL effect. Two eQTLs found in each of PRKCE, PIK3C2A, and MAP2K6 could predict 44%, 37%, and 45% of the variance in gene expression, respectively. This is the first analysis focusing on the pattern of functional genetic variation of the VEGF pathway genes in CEU and YRI populations and providing mechanistic evidence for genetic association studies of diseases for which angiogenesis plays a pathophysiologic role. PMID:24186849

Paré-Brunet, Laia; Glubb, Dylan; Evans, Patrick; Berenguer-Llergo, Antoni; Etheridge, Amy S; Skol, Andrew D; Di Rienzo, Anna; Duan, Shiwei; Gamazon, Eric R; Innocenti, Federico

2014-02-01

321

Automatic Annotation of Spatial Expression Patterns via Sparse Bayesian Factor Models  

PubMed Central

Advances in reporters for gene expression have made it possible to document and quantify expression patterns in 2D–4D. In contrast to microarrays, which provide data for many genes but averaged and/or at low resolution, images reveal the high spatial dynamics of gene expression. Developing computational methods to compare, annotate, and model gene expression based on images is imperative, considering that available data are rapidly increasing. We have developed a sparse Bayesian factor analysis model in which the observed expression diversity of among a large set of high-dimensional images is modeled by a small number of hidden common factors. We apply this approach on embryonic expression patterns from a Drosophila RNA in situ image database, and show that the automatically inferred factors provide for a meaningful decomposition and represent common co-regulation or biological functions. The low-dimensional set of factor mixing weights is further used as features by a classifier to annotate expression patterns with functional categories. On human-curated annotations, our sparse approach reaches similar or better classification of expression patterns at different developmental stages, when compared to other automatic image annotation methods using thousands of hard-to-interpret features. Our study therefore outlines a general framework for large microscopy data sets, in which both the generative model itself, as well as its application for analysis tasks such as automated annotation, can provide insight into biological questions. PMID:21814502

Pruteanu-Malinici, Iulian; Mace, Daniel L.; Ohler, Uwe

2011-01-01

322

Automatic annotation of spatial expression patterns via sparse Bayesian factor models.  

PubMed

Advances in reporters for gene expression have made it possible to document and quantify expression patterns in 2D-4D. In contrast to microarrays, which provide data for many genes but averaged and/or at low resolution, images reveal the high spatial dynamics of gene expression. Developing computational methods to compare, annotate, and model gene expression based on images is imperative, considering that available data are rapidly increasing. We have developed a sparse Bayesian factor analysis model in which the observed expression diversity of among a large set of high-dimensional images is modeled by a small number of hidden common factors. We apply this approach on embryonic expression patterns from a Drosophila RNA in situ image database, and show that the automatically inferred factors provide for a meaningful decomposition and represent common co-regulation or biological functions. The low-dimensional set of factor mixing weights is further used as features by a classifier to annotate expression patterns with functional categories. On human-curated annotations, our sparse approach reaches similar or better classification of expression patterns at different developmental stages, when compared to other automatic image annotation methods using thousands of hard-to-interpret features. Our study therefore outlines a general framework for large microscopy data sets, in which both the generative model itself, as well as its application for analysis tasks such as automated annotation, can provide insight into biological questions. PMID:21814502

Pruteanu-Malinici, Iulian; Mace, Daniel L; Ohler, Uwe

2011-07-01

323

Towards Viral Genome Annotation Standards, Report from the 2010 NCBI Annotation Workshop  

PubMed Central

Improvements in DNA sequencing technologies portend a new era in virology and could possibly lead to a giant leap in our understanding of viral evolution and ecology. Yet, as viral genome sequences begin to fill the world’s biological databases, it is critically important to recognize that the scientific promise of this era is dependent on consistent and comprehensive genome annotation. With this in mind, the NCBI Genome Annotation Workshop recently hosted a study group tasked with developing sequence, function, and metadata annotation standards for viral genomes. This report describes the issues involved in viral genome annotation and reviews policy recommendations presented at the NCBI Annotation Workshop. PMID:21994619

Brister, James Rodney; Bao, Yiming; Kuiken, Carla; Lefkowitz, Elliot J.; Le Mercier, Philippe; Leplae, Raphael; Madupu, Ramana; Scheuermann, Richard H.; Schobel, Seth; Seto, Donald; Shrivastava, Susmita; Sterk, Peter; Zeng, Qiandong; Klimke, William; Tatusova, Tatiana

2010-01-01

324

Objective-guided image annotation.  

PubMed

Automatic image annotation, which is usually formulated as a multi-label classification problem, is one of the major tools used to enhance the semantic understanding of web images. Many multimedia applications (e.g., tag-based image retrieval) can greatly benefit from image annotation. However, the insufficient performance of image annotation methods prevents these applications from being practical. On the other hand, specific measures are usually designed to evaluate how well one annotation method performs for a specific objective or application, but most image annotation methods do not consider optimization of these measures, so that they are inevitably trapped into suboptimal performance of these objective-specific measures. To address this issue, we first summarize a variety of objective-guided performance measures under a unified representation. Our analysis reveals that macro-averaging measures are very sensitive to infrequent keywords, and hamming measure is easily affected by skewed distributions. We then propose a unified multi-label learning framework, which directly optimizes a variety of objective-specific measures of multi-label learning tasks. Specifically, we first present a multilayer hierarchical structure of learning hypotheses for multi-label problems based on which a variety of loss functions with respect to objective-guided measures are defined. And then, we formulate these loss functions as relaxed surrogate functions and optimize them by structural SVMs. According to the analysis of various measures and the high time complexity of optimizing micro-averaging measures, in this paper, we focus on example-based measures that are tailor-made for image annotation tasks but are seldom explored in the literature. Experiments show consistency with the formal analysis on two widely used multi-label datasets, and demonstrate the superior performance of our proposed method over state-of-the-art baseline methods in terms of example-based measures on four image annotation datasets. PMID:23247859

Mao, Qi; Tsang, Ivor Wai-Hung; Gao, Shenghua

2013-04-01

325

DAnCER: Disease-Annotated Chromatin Epigenetics Resource  

PubMed Central

Chromatin modification (CM) is a set of epigenetic processes that govern many aspects of DNA replication, transcription and repair. CM is carried out by groups of physically interacting proteins, and their disruption has been linked to a number of complex human diseases. CM remains largely unexplored, however, especially in higher eukaryotes such as human. Here we present the DAnCER resource, which integrates information on genes with CM function from five model organisms, including human. Currently integrated are gene functional annotations, Pfam domain architecture, protein interaction networks and associated human diseases. Additional supporting evidence includes orthology relationships across organisms, membership in protein complexes, and information on protein 3D structure. These data are available for 962 experimentally confirmed and manually curated CM genes and for over 5000 genes with predicted CM function on the basis of orthology and domain composition. DAnCER allows visual explorations of the integrated data and flexible query capabilities using a variety of data filters. In particular, disease information and functional annotations are mapped onto the protein interaction networks, enabling the user to formulate new hypotheses on the function and disease associations of a given gene based on those of its interaction partners. DAnCER is freely available at http://wodaklab.org/dancer/. PMID:20876685

Turinsky, Andrei L.; Turner, Brian; Borja, Rosanne C.; Gleeson, James A.; Heath, Michael; Pu, Shuye; Switzer, Thomas; Dong, Dong; Gong, Yunchen; On, Tuan; Xiong, Xuejian; Emili, Andrew; Greenblatt, Jack; Parkinson, John; Zhang, Zhaolei; Wodak, Shoshana J.

2011-01-01

326

Using co-expression to redefine functional gene sets for gene set enrichment analysis  

E-print Network

Manually curated gene sets related to a biological function often contain genes that are not tightly co-regulated transcriptionally. which obscures the evidence of coordinated differential expression of these gene sets in ...

Kodysh, Yuliya

2007-01-01

327

Genome-wide SNP genotyping to infer the effects on gene functions in tomato.  

PubMed

The genotype data of 7054 single nucleotide polymorphism (SNP) loci in 40 tomato lines, including inbred lines, F1 hybrids, and wild relatives, were collected using Illumina's Infinium and GoldenGate assay platforms, the latter of which was utilized in our previous study. The dendrogram based on the genotype data corresponded well to the breeding types of tomato and wild relatives. The SNPs were classified into six categories according to their positions in the genes predicted on the tomato genome sequence. The genes with SNPs were annotated by homology searches against the nucleotide and protein databases, as well as by domain searches, and they were classified into the functional categories defined by the NCBI's eukaryotic orthologous groups (KOG). To infer the SNPs' effects on the gene functions, the three-dimensional structures of the 843 proteins that were encoded by the genes with SNPs causing missense mutations were constructed by homology modelling, and 200 of these proteins were considered to carry non-synonymous amino acid substitutions in the predicted functional sites. The SNP information obtained in this study is available at the Kazusa Tomato Genomics Database (http://plant1.kazusa.or.jp/tomato/). PMID:23482505

Hirakawa, Hideki; Shirasawa, Kenta; Ohyama, Akio; Fukuoka, Hiroyuki; Aoki, Koh; Rothan, Christophe; Sato, Shusei; Isobe, Sachiko; Tabata, Satoshi

2013-06-01

328

Annotation of primate miRNAs by high throughput sequencing of small RNA libraries  

PubMed Central

Background In addition to genome sequencing, accurate functional annotation of genomes is required in order to carry out comparative and evolutionary analyses between species. Among primates, the human genome is the most extensively annotated. Human miRNA gene annotation is based on multiple lines of evidence including evidence for expression as well as prediction of the characteristic hairpin structure. In contrast, most miRNA genes in non-human primates are annotated based on homology without any expression evidence. We have sequenced small-RNA libraries from chimpanzee, gorilla, orangutan and rhesus macaque from multiple individuals and tissues. Using patterns of miRNA expression in conjunction with a model of miRNA biogenesis we used these high-throughput sequencing data to identify novel miRNAs in non-human primates. Results We predicted 47 new miRNAs in chimpanzee, 240 in gorilla, 55 in orangutan and 47 in rhesus macaque. The algorithm we used was able to predict 64% of the previously known miRNAs in chimpanzee, 94% in gorilla, 61% in orangutan and 71% in rhesus macaque. We therefore added evidence for expression in between one and five tissues to miRNAs that were previously annotated based only on homology to human miRNAs. We increased from 60 to 175 the number miRNAs that are located in orthologous regions in humans and the four non-human primate species studied here. Conclusions In this study we provide expression evidence for homology-based annotated miRNAs and predict de novo miRNAs in four non-human primate species. We increased the number of annotated miRNA genes and provided evidence for their expression in four non-human primates. Similar approaches using different individuals and tissues would improve annotation in non-human primates and allow for further comparative studies in the future. PMID:22453055

2012-01-01

329

categoryCompare, an analytical tool based on feature annotations  

PubMed Central

Assessment of high-throughput—omics data initially focuses on relative or raw levels of a particular feature, such as an expression value for a transcript, protein, or metabolite. At a second level, analyses of annotations including known or predicted functions and associations of each individual feature, attempt to distill biological context. Most currently available comparative- and meta-analyses methods are dependent on the availability of identical features across data sets, and concentrate on determining features that are differentially expressed across experiments, some of which may be considered “biomarkers.” The heterogeneity of measurement platforms and inherent variability of biological systems confounds the search for robust biomarkers indicative of a particular condition. In many instances, however, multiple data sets show involvement of common biological processes or signaling pathways, even though individual features are not commonly measured or differentially expressed between them. We developed a methodology, categoryCompare, for cross-platform and cross-sample comparison of high-throughput data at the annotation level. We assessed the utility of the approach using hypothetical data, as well as determining similarities and differences in the set of processes in two instances: (1) denervated skin vs. denervated muscle, and (2) colon from Crohn's disease vs. colon from ulcerative colitis (UC). The hypothetical data showed that in many cases comparing annotations gave superior results to comparing only at the gene level. Improved analytical results depended as well on the number of genes included in the annotation term, the amount of noise in relation to the number of genes expressing in unenriched annotation categories, and the specific method in which samples are combined. In the skin vs. muscle denervation comparison, the tissues demonstrated markedly different responses. The Crohn's vs. UC comparison showed gross similarities in inflammatory response in the two diseases, with particular processes specific to each disease. PMID:24808906

Flight, Robert M.; Harrison, Benjamin J.; Mohammad, Fahim; Bunge, Mary B.; Moon, Lawrence D. F.; Petruska, Jeffrey C.; Rouchka, Eric C.

2014-01-01

330

KSHV 2.0: A Comprehensive Annotation of the Kaposi's Sarcoma-Associated Herpesvirus Genome Using Next-Generation Sequencing Reveals Novel Genomic and Functional Features  

PubMed Central

Productive herpesvirus infection requires a profound, time-controlled remodeling of the viral transcriptome and proteome. To gain insights into the genomic architecture and gene expression control in Kaposi's sarcoma-associated herpesvirus (KSHV), we performed a systematic genome-wide survey of viral transcriptional and translational activity throughout the lytic cycle. Using mRNA-sequencing and ribosome profiling, we found that transcripts encoding lytic genes are promptly bound by ribosomes upon lytic reactivation, suggesting their regulation is mainly transcriptional. Our approach also uncovered new genomic features such as ribosome occupancy of viral non-coding RNAs, numerous upstream and small open reading frames (ORFs), and unusual strategies to expand the virus coding repertoire that include alternative splicing, dynamic viral mRNA editing, and the use of alternative translation initiation codons. Furthermore, we provide a refined and expanded annotation of transcription start sites, polyadenylation sites, splice junctions, and initiation/termination codons of known and new viral features in the KSHV genomic space which we have termed KSHV 2.0. Our results represent a comprehensive genome-scale image of gene regulation during lytic KSHV infection that substantially expands our understanding of the genomic architecture and coding capacity of the virus. PMID:24453964

Arias, Carolina; Weisburd, Ben; Stern-Ginossar, Noam; Mercier, Alexandre; Madrid, Alexis S.; Bellare, Priya; Holdorf, Meghan; Weissman, Jonathan S.; Ganem, Don

2014-01-01

331

Comparative genome analysis of PHB gene family reveals deep evolutionary origins and diverse gene function  

Microsoft Academic Search

BACKGROUND: PHB (Prohibitin) gene family is involved in a variety of functions important for different biological processes. PHB genes are ubiquitously present in divergent species from prokaryotes to eukaryotes. Human PHB genes have been found to be associated with various diseases. Recent studies by our group and others have shown diverse function of PHB genes in plants for development, senescence,

Chao Di; Wenying Xu; Zhen Su; Joshua S Yuan

2010-01-01

332

Chapter 14: Genome Assembly and Annotation Process Annotation of other genomes  

E-print Network

the Mouse Genome Sequencing Consortium (MGSC) by skipping the assembly steps used in the human process used to assemble and annotate genomic contigs from finished mouse clone sequences (see the Map Viewer to genetic maps and known genes, NCBI provides annotated assemblies of public genome sequence data. NCBI

Levin, Judith G.

333

Glucose - An Annotated Bibliography.  

National Technical Information Service (NTIS)

The annotated bibliography contains 905 citations. About 90 percent of the articles annotated pertain to glucose analytical methodology and the other 10 percent consists of clinical articles which pertain to the glucose tolerance test and normal values. T...

A. M. Polk, C. Lewis, N. Radin, N. M. Richardson

1976-01-01

334

ToppGene Suite for gene list enrichment analysis and candidate gene prioritization  

PubMed Central

ToppGene Suite (http://toppgene.cchmc.org; this web site is free and open to all users and does not require a login to access) is a one-stop portal for (i) gene list functional enrichment, (ii) candidate gene prioritization using either functional annotations or network analysis and (iii) identification and prioritization of novel disease candidate genes in the interactome. Functional annotation-based disease candidate gene prioritization uses a fuzzy-based similarity measure to compute the similarity between any two genes based on semantic annotations. The similarity scores from individual features are combined into an overall score using statistical meta-analysis. A P-value of each annotation of a test gene is derived by random sampling of the whole genome. The protein–protein interaction network (PPIN)-based disease candidate gene prioritization uses social and Web networks analysis algorithms (extended versions of the PageRank and HITS algorithms, and the K-Step Markov method). We demonstrate the utility of ToppGene Suite using 20 recently reported GWAS-based gene–disease associations (including novel disease genes) representing five diseases. ToppGene ranked 19 of 20 (95%) candidate genes within the top 20%, while ToppNet ranked 12 of 16 (75%) candidate genes among the top 20%. PMID:19465376

Chen, Jing; Bardes, Eric E.; Aronow, Bruce J.; Jegga, Anil G.

2009-01-01

335

ToppGene Suite for gene list enrichment analysis and candidate gene prioritization.  

PubMed

ToppGene Suite (http://toppgene.cchmc.org; this web site is free and open to all users and does not require a login to access) is a one-stop portal for (i) gene list functional enrichment, (ii) candidate gene prioritization using either functional annotations or network analysis and (iii) identification and prioritization of novel disease candidate genes in the interactome. Functional annotation-based disease candidate gene prioritization uses a fuzzy-based similarity measure to compute the similarity between any two genes based on semantic annotations. The similarity scores from individual features are combined into an overall score using statistical meta-analysis. A P-value of each annotation of a test gene is derived by random sampling of the whole genome. The protein-protein interaction network (PPIN)-based disease candidate gene prioritization uses social and Web networks analysis algorithms (extended versions of the PageRank and HITS algorithms, and the K-Step Markov method). We demonstrate the utility of ToppGene Suite using 20 recently reported GWAS-based gene-disease associations (including novel disease genes) representing five diseases. ToppGene ranked 19 of 20 (95%) candidate genes within the top 20%, while ToppNet ranked 12 of 16 (75%) candidate genes among the top 20%. PMID:19465376

Chen, Jing; Bardes, Eric E; Aronow, Bruce J; Jegga, Anil G

2009-07-01

336

Original article AnnotCompute: annotation-based  

E-print Network

, University of Pennsylvania, 3330 Walnut Street, Philadelphia, PA 19104, USA *Corresponding author: Tel: +215 is available for download at http://www.cbil.upenn.edu/downloads/AnnotCompute. Database URL: http implements search and browsing functionality, and also makes its repository available for download. Array

Pennsylvania, University of

337

Teachers Reference: Annotations  

NSDL National Science Digital Library

This collection of 171 annotations was written to enhance and explain the text of the book 'Stone Wall Secrets'. Each annotation consists of a number that refers specifically to the phrase preceding it. Each annotation number is followed by three indexing elements: subject category, one or more keywords, and one or more sample questions with answers.

338

Annotating Enzymes of Uncertain Function: The Deacylation of d-Amino Acids by Members of the Amidohydrolase Superfamily  

SciTech Connect

The catalytic activities of three members of the amidohydrolase superfamily were discovered using amino acid substrate libraries. Bb3285 from Bordetella bronchiseptica, Gox1177 from Gluconobacter oxidans, and Sco4986 from Streptomyces coelicolor are currently annotated as d-aminoacylases or N-acetyl-d-glutamate deacetylases. These three enzymes are 22-34% identical to one another in amino acid sequence. Substrate libraries containing nearly all combinations of N-formyl-d-Xaa, N-acetyl-d-Xaa, N-succinyl-d-Xaa, and l-Xaa-d-Xaa were used to establish the substrate profiles for these enzymes. It was demonstrated that Bb3285 is restricted to the hydrolysis of N-acyl-substituted derivatives of d-glutamate. The best substrates for this enzyme are N-formyl-d-glutamate (k{sub cat}/K{sub m} = 5.8 x 10{sup 6} M{sup -1} s{sup -1}), N-acetyl-d-glutamate (k{sub cat}/K{sub m} = 5.2 x 10{sup 6} M{sup -1} s{sup -1}), and l-methionine-d-glutamate (k{sub cat}/K{sub m} = 3.4 x 10{sup 5} M{sup -1} s{sup -1}). Gox1177 and Sco4986 preferentially hydrolyze N-acyl-substituted derivatives of hydrophobic d-amino acids. The best substrates for Gox1177 are N-acetyl-d-leucine (k{sub cat}/K{sub m} = 3.2 x 104 M{sup -1} s-1), N-acetyl-d-tryptophan (kcat/Km = 4.1 x 104 M-1 s-1), and l-tyrosine-d-leucine (kcat/Km = 1.5 x 104 M-1 s-1). A fourth protein, Bb2785 from B. bronchiseptica, did not have d-aminoacylase activity. The best substrates for Sco4986 are N-acetyl-d-phenylalanine and N-acetyl-d-tryptophan. The three-dimensional structures of Bb3285 in the presence of the product acetate or a potent mimic of the tetrahedral intermediate were determined by X-ray diffraction methods. The side chain of the d-glutamate moiety of the inhibitor is ion-paired to Arg-295, while the {alpha}-carboxylate is ion-paired with Lys-250 and Arg-376. These results have revealed the chemical and structural determinants for substrate specificity in this protein. Bioinformatic analyses of an additional {approx}250 sequences identified as members of this group suggest that there are no simple motifs that allow prediction of substrate specificity for most of these unknowns, highlighting the challenges for computational annotation of some groups of homologous proteins.

Cummings, J.; Fedorov, A; Xu, C; Brown, S; Fedorov, E; Babbitt, P; Almo, S; Raushel, F

2009-01-01

339

Concept annotation in the CRAFT corpus  

PubMed Central

Background Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. Results This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. Conclusions As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml. PMID:22776079

2012-01-01

340

PANDORA: analysis of protein and peptide sets through the hierarchical integration of annotations.  

PubMed

Derivation of biological meaning from large sets of proteins or genes is a frequent task in genomic and proteomic studies. Such sets often arise from experimental methods including large-scale gene expression experiments and mass spectrometry (MS) proteomics. Large sets of genes or proteins are also the outcome of computational methods such as BLAST search and homology-based classifications. We have developed the PANDORA web server, which functions as a platform for the advanced biological analysis of sets of genes, proteins, or proteolytic peptides. First, the input set is mapped to a set of corresponding proteins. Then, an analysis of the protein set produces a graph-based hierarchy which highlights intrinsic relations amongst biological subsets, in light of their different annotations from multiple annotation resources. PANDORA integrates a large collection of annotation sources (GO, UniProt Keywords, InterPro, Enzyme, SCOP, CATH, Gene-3D, NCBI taxonomy and more) that comprise approximately 200,000 different annotation terms associated with approximately 3.2 million sequences from UniProtKB. Statistical enrichment based on a binomial approximation of the hypergeometric distribution and corrected for multiple hypothesis tests is calculated using several background sets, including major gene-expression DNA-chip platforms. Users can also visualize either standard or user-defined binary and quantitative properties alongside the proteins. PANDORA 4.2 is available at http://www.pandora.cs.huji.ac.il. PMID:20444873

Rappoport, Nadav; Fromer, Menachem; Schweiger, Regev; Linial, Michal

2010-07-01

341

Rice Annotation Project Database (RAP-DB): an integrative and interactive database for rice genomics.  

PubMed

The Rice Annotation Project Database (RAP-DB, http://rapdb.dna.affrc.go.jp/) has been providing a comprehensive set of gene annotations for the genome sequence of rice, Oryza sativa (japonica group) cv. Nipponbare. Since the first release in 2005, RAP-DB has been updated several times along with the genome assembly updates. Here, we present our newest RAP-DB based on the latest genome assembly, Os-Nipponbare-Reference-IRGSP-1.0 (IRGSP-1.0), which was released in 2011. We detected 37,869 loci by mapping transcript and protein sequences of 150 monocot species. To provide plant researchers with highly reliable and up to date rice gene annotations, we have been incorporating literature-based manually curated data, and 1,626 loci currently incorporate literature-based annotation data, including commonly used gene names or gene symbols. Transcriptional activities are shown at the nucleotide level by mapping RNA-Seq reads derived from 27 samples. We also mapped the Illumina reads of a Japanese leading japonica cultivar, Koshihikari, and a Chinese indica cultivar, Guangluai-4, to the genome and show alignments together with the single nucleotide polymorphisms (SNPs) and gene functional annotations through a newly developed browser, Short-Read Assembly Browser (S-RAB). We have developed two satellite databases, Plant Gene Family Database (PGFD) and Integrative Database of Cereal Gene Phylogeny (IDCGP), which display gene family and homologous gene relationships among diverse plant species. RAP-DB and the satellite databases offer simple and user-friendly web interfaces, enabling plant and genome researchers to access the data easily and facilitating a broad range of plant research topics. PMID:23299411

Sakai, Hiroaki; Lee, Sung Shin; Tanaka, Tsuyoshi; Numa, Hisataka; Kim, Jungsok; Kawahara, Yoshihiro; Wakimoto, Hironobu; Yang, Ching-chia; Iwamoto, Masao; Abe, Takashi; Yamada, Yuko; Muto, Akira; Inokuchi, Hachiro; Ikemura, Toshimichi; Matsumoto, Takashi; Sasaki, Takuji; Itoh, Takeshi

2013-02-01

342

Automated pipeline for atlas-based annotation of gene expresssion patterns: application to postnatal day 7 mouse brain  

SciTech Connect

Abstract As bio-medical images and volumes are being collected at an increasing speed, there is a growing demand for efficient means to organize spatial information for comparative analysis. In many scenarios, such as determining gene expression patterns by in situ hybridization, the images are collected from multiple subjects over a common anatomical region, such as the brain. A fundamental challenge in comparing spatial data from different images is how to account for the shape variations among subjects, which makes direct image-to-image comparison meaningless. In this paper, we describe subdivision meshes as a geometric means to efficiently organize 2D images and 3D volumes collected from different subjects for comparison. The key advantages of a subdivision mesh for this purpose are its light-weight geometric structure and its explicit modeling of anatomical boundaries, which enable efficient and accurate registration. The multi-resolution structure of a subdivision mesh also allows development of fast comparison algorithms among registered images and volumes.

Carson, James P.; Ju, Tao; Bello, Musodiq; Thaller, Christina; Warren, Joe; Kakadiaris, Ioannis; Chiu, Wah; Eichele, Gregor

2010-02-01

343

High-resolution chemical dissection of a model eukaryote reveals targets, pathways and gene functions.  

PubMed

Due to evolutionary conservation of biology, experimental knowledge captured from genetic studies in eukaryotic model organisms provides insight into human cellular pathways and ultimately physiology. Yeast chemogenomic profiling is a powerful approach for annotating cellular responses to small molecules. Using an optimized platform, we provide the relative sensitivities of the heterozygous and homozygous deletion collections for nearly 1800 biologically active compounds. The data quality enables unique insights into pathways that are sensitive and resistant to a given perturbation, as demonstrated with both known and novel compounds. We present examples of novel compounds that inhibit the therapeutically relevant fatty acid synthase and desaturase (Fas1p and Ole1p), and demonstrate how the individual profiles facilitate hypothesis-driven experiments to delineate compound mechanism of action. Importantly, the scale and diversity of tested compounds yields a dataset where the number of modulated pathways approaches saturation. This resource can be used to map novel biological connections, and also identify functions for unannotated genes. We validated hypotheses generated by global two-way hierarchical clustering of profiles for (i) novel compounds with a similar mechanism of action acting upon microtubules or vacuolar ATPases, and (ii) an un-annotated ORF, YIL060w, that plays a role in respiration in the mitochondria. Finally, we identify and characterize background mutations in the widely used yeast deletion collection which should improve the interpretation of past and future screens throughout the community. This comprehensive resource of cellular responses enables the expansion of our understanding of eukaryotic pathway biology. PMID:24360837

Hoepfner, Dominic; Helliwell, Stephen B; Sadlish, Heather; Schuierer, Sven; Filipuzzi, Ireos; Brachat, Sophie; Bhullar, Bhupinder; Plikat, Uwe; Abraham, Yann; Altorfer, Marc; Aust, Thomas; Baeriswyl, Lukas; Cerino, Raffaele; Chang, Lena; Estoppey, David; Eichenberger, Juerg; Frederiksen, Mathias; Hartmann, Nicole; Hohendahl, Annika; Knapp, Britta; Krastel, Philipp; Melin, Nicolas; Nigsch, Florian; Oakeley, Edward J; Petitjean, Virginie; Petersen, Frank; Riedl, Ralph; Schmitt, Esther K; Staedtler, Frank; Studer, Christian; Tallarico, John A; Wetzel, Stefan; Fishman, Mark C; Porter, Jeffrey A; Movva, N Rao

2014-01-01

344

Analysis and functional annotation of expressed sequence tags (ESTs) from multiple tissues of oil palm (Elaeis guineensis Jacq.)  

PubMed Central

Background Oil palm is the second largest source of edible oil which contributes to approximately 20% of the world's production of oils and fats. In order to understand the molecular biology involved in in vitro propagation, flowering, efficient utilization of nitrogen sources and root diseases, we have initiated an expressed sequence tag (EST) analysis on oil palm. Results In this study, six cDNA libraries from oil palm zygotic embryos, suspension cells, shoot apical meristems, young flowers, mature flowers and roots, were constructed. We have generated a total of 14537 expressed sequence tags (ESTs) from these libraries, from which 6464 tentative unique contigs (TUCs) and 2129 singletons were obtained. Approximately 6008 of these tentative unique genes (TUGs) have significant matches to the non-redundant protein database, from which 2361 were assigned to one or more Gene Ontology categories. Predominant transcripts and differentially expressed genes were identified in multiple oil palm tissues. Homologues of genes involved in many aspects of flower development were also identified among the EST collection, such as CONSTANS-like, AGAMOUS-like (AGL)2, AGL20, LFY-like, SQUAMOSA, SQUAMOSA binding protein (SBP) etc. Majority of them are the first representatives in oil palm, providing opportunities to explore the cause of epigenetic homeotic flowering abnormality in oil palm, given the importance of flowering in fruit production. The transcript levels of two flowering-related genes, EgSBP and EgSEP were analysed in the flower tissues of various developmental stages. Gene homologues for enzymes involved in oil biosynthesis, utilization of nitrogen sources, and scavenging of oxygen radicals, were also uncovered among the oil palm ESTs. Conclusion The EST sequences generated will allow comparative genomic studies between oil palm and other monocotyledonous and dicotyledonous plants, development of gene-targeted markers for the reference genetic map, design and fabrication of DNA array for future studies of oil palm. The outcomes of such studies will contribute to oil palm improvements through the establishment of breeding program using marker-assisted selection, development of diagnostic assays using gene targeted markers, and discovery of candidate genes related to important agronomic traits of oil palm. PMID:17953740

Ho, Chai-Ling; Kwan, Yen-Yen; Choi, Mei-Chooi; Tee, Sue-Sean; Ng, Wai-Har; Lim, Kok-Ang; Lee, Yang-Ping; Ooi, Siew-Eng; Lee, Weng-Wah; Tee, Jin-Ming; Tan, Siang-Hee; Kulaveerasingam, Harikrishna; Alwee, Sharifah Shahrul Rabiah Syed; Abdullah, Meilina Ong

2007-01-01

345

Annotation of the Drosophila melanogaster euchromatic genome: a systematic review  

Microsoft Academic Search

BACKGROUND: The recent completion of the Drosophila melanogaster genomic sequence to high quality and the availability of a greatly expanded set of Drosophila cDNA sequences, aligning to 78% of the predicted euchromatic genes, afforded FlyBase the opportunity to significantly improve genomic annotations. We made the annotation process more rigorous by inspecting each gene visually, utilizing a comprehensive set of curation

Sima Misra; Madeline A Crosby; Christopher J Mungall; Beverley B Matthews; Kathryn S Campbell; Pavel Hradecky; Yanmei Huang; Joshua S Kaminker; Gillian H Millburn; Simon E Prochnik; Christopher D Smith; Jonathan L Tupy; Eleanor J Whitfield; Leyla Bayraktaroglu; Benjamin P Berman; Brian R Bettencourt; Susan E Celniker; Aubrey DNJ de Grey; Rachel A Drysdale; Nomi L Harris; John Richter; Susan Russo; Andrew J Schroeder; ShengQiang Shu; Mark Stapleton; Chihiro Yamada; Michael Ashburner; William M Gelbart; Gerald M Rubin; Suzanna E Lewis

2002-01-01

346

Comparative genome analysis of PHB gene family reveals deep evolutionary origins and diverse gene function  

PubMed Central

Background PHB (Prohibitin) gene family is involved in a variety of functions important for different biological processes. PHB genes are ubiquitously present in divergent species from prokaryotes to eukaryotes. Human PHB genes have been found to be associated with various diseases. Recent studies by our group and others have shown diverse function of PHB genes in plants for development, senescence, defence, and others. Despite the importance of the PHB gene family, no comprehensive gene family analysis has been carried to evaluate the relatedness of PHB genes across different species. In order to better guide the gene function analysis and understand the evolution of the PHB gene family, we therefore carried out the comparative genome analysis of the PHB genes across different kingdoms. Results The relatedness, motif distribution, and intron/exon distribution all indicated that PHB genes is a relatively conserved gene family. The PHB genes can be classified into 5 classes and each class have a very deep evolutionary origin. The PHB genes within the class maintained the same motif patterns during the evolution. With Arabidopsis as the model species, we found that PHB gene intron/exon structure and domains are also conserved during the evolution. Despite being a conserved gene family, various gene duplication events led to the expansion of the PHB genes. Both segmental and tandem gene duplication were involved in Arabidopsis PHB gene family expansion. However, segmental duplication is predominant in Arabidopsis. Moreover, most of the duplicated genes experienced neofunctionalization. The results highlighted that PHB genes might be involved in important functions so that the duplicated genes are under the evolutionary pressure to derive new function. Conclusion PHB gene family is a conserved gene family and accounts for diverse but important biological functions based on the similar molecular mechanisms. The highly diverse biological function indicated that more research needs to be carried out to dissect the PHB gene function. The conserved gene evolution indicated that the study in the model species can be translated to human and mammalian studies. PMID:20946606

2010-01-01

347

Function Annotation of Hepatic Retinoid x Receptor ? Based on Genome-Wide DNA Binding and Transcriptome Profiling  

E-print Network

interests exist. * E-mail: yjywan@ucdavis.edu Introduction Retinoid x receptor (RXR) plays a critical role in metabolism, development, differentiation, proliferation, and cell death by regulating gene expression [1,2]. The expression profile of RXRs... Receptor Target Gene Our data location (motif) Reported data location (motif) Reference RARs Prcka 219 (NK) 293,265 (NR) [23] Cyp26a1 21862 (DR5) 22 kb (DR5) [24] RARb 2341 (DR5) 259 (DR5) [25] FXR Nr0b2 2271 (IR1, DR3) 2320,2220 (NR) [26] Abcb11 2220 , 250...

Zhan, Qi; Fang, Yaping; He, Yuqi; Liu, Hui-Xin; Fang, Jianwen; Wan, Yu-Jui Yvonne

2012-11-15

348

KAAS: an automatic genome annotation and pathway reconstruction server.  

PubMed

The number of complete and draft genomes is rapidly growing in recent years, and it has become increasingly important to automate the identification of functional properties and biological roles of genes in these genomes. In the KEGG database, genes in complete genomes are annotated with the KEGG orthology (KO) identifiers, or the K numbers, based on the best hit information using Smith-Waterman scores as well as by the manual curation. Each K number represents an ortholog group of genes, and it is directly linked to an object in the KEGG pathway map or the BRITE functional hierarchy. Here, we have developed a web-based server called KAAS (KEGG Automatic Annotation Server: http://www.genome.jp/kegg/kaas/) i.e. an implementation of a rapid method to automatically assign K numbers to genes in the genome, enabling reconstruction of KEGG pathways and BRITE hierarchies. The method is based on sequence similarities, bi-directional best hit information and some heuristics, and has achieved a high degree of accuracy when compared with the manually curated KEGG GENES database. PMID:17526522

Moriya, Yuki; Itoh, Masumi; Okuda, Shujiro; Yoshizawa, Akiyasu C; Kanehisa, Minoru

2007-07-01

349