snpGeneSets: An R Package for Genome-Wide Study Annotation
Mei, Hao; Li, Lianna; Jiang, Fan; Simino, Jeannette; Griswold, Michael; Mosley, Thomas; Liu, Shijian
2016-01-01
Genome-wide studies (GWS) of SNP associations and differential gene expressions have generated abundant results; next-generation sequencing technology has further boosted the number of variants and genes identified. Effective interpretation requires massive annotation and downstream analysis of these genome-wide results, a computationally challenging task. We developed the snpGeneSets package to simplify annotation and analysis of GWS results. Our package integrates local copies of knowledge bases for SNPs, genes, and gene sets, and implements wrapper functions in the R language to enable transparent access to low-level databases for efficient annotation of large genomic data. The package contains functions that execute three types of annotations: (1) genomic mapping annotation for SNPs and genes and functional annotation for gene sets; (2) bidirectional mapping between SNPs and genes, and genes and gene sets; and (3) calculation of gene effect measures from SNP associations and performance of gene set enrichment analyses to identify functional pathways. We applied snpGeneSets to type 2 diabetes (T2D) results from the NHGRI genome-wide association study (GWAS) catalog, a Finnish GWAS, and a genome-wide expression study (GWES). These studies demonstrate the usefulness of snpGeneSets for annotating and performing enrichment analysis of GWS results. The package is open-source, free, and can be downloaded at: https://www.umc.edu/biostats_software/. PMID:27807048
LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources.
Karchin, Rachel; Diekhans, Mark; Kelly, Libusha; Thomas, Daryl J; Pieper, Ursula; Eswar, Narayanan; Haussler, David; Sali, Andrej
2005-06-15
The NCBI dbSNP database lists over 9 million single nucleotide polymorphisms (SNPs) in the human genome, but currently contains limited annotation information. SNPs that result in amino acid residue changes (nsSNPs) are of critical importance in variation between individuals, including disease and drug sensitivity. We have developed LS-SNP, a genomic scale software pipeline to annotate nsSNPs. LS-SNP comprehensively maps nsSNPs onto protein sequences, functional pathways and comparative protein structure models, and predicts positions where nsSNPs destabilize proteins, interfere with the formation of domain-domain interfaces, have an effect on protein-ligand binding or severely impact human health. It currently annotates 28,043 validated SNPs that produce amino acid residue substitutions in human proteins from the SwissProt/TrEMBL database. Annotations can be viewed via a web interface either in the context of a genomic region or by selecting sets of SNPs, genes, proteins or pathways. These results are useful for identifying candidate functional SNPs within a gene, haplotype or pathway and in probing molecular mechanisms responsible for functional impacts of nsSNPs. http://www.salilab.org/LS-SNP CONTACT: rachelk@salilab.org http://salilab.org/LS-SNP/supp-info.pdf.
SNPdbe: constructing an nsSNP functional impacts database.
Schaefer, Christian; Meier, Alice; Rost, Burkhard; Bromberg, Yana
2012-02-15
Many existing databases annotate experimentally characterized single nucleotide polymorphisms (SNPs). Each non-synonymous SNP (nsSNP) changes one amino acid in the gene product (single amino acid substitution;SAAS). This change can either affect protein function or be neutral in that respect. Most polymorphisms lack experimental annotation of their functional impact. Here, we introduce SNPdbe-SNP database of effects, with predictions of computationally annotated functional impacts of SNPs. Database entries represent nsSNPs in dbSNP and 1000 Genomes collection, as well as variants from UniProt and PMD. SAASs come from >2600 organisms; 'human' being the most prevalent. The impact of each SAAS on protein function is predicted using the SNAP and SIFT algorithms and augmented with experimentally derived function/structure information and disease associations from PMD, OMIM and UniProt. SNPdbe is consistently updated and easily augmented with new sources of information. The database is available as an MySQL dump and via a web front end that allows searches with any combination of organism names, sequences and mutation IDs. http://www.rostlab.org/services/snpdbe.
SNPit: a federated data integration system for the purpose of functional SNP annotation.
Shen, Terry H; Carlson, Christopher S; Tarczy-Hornoch, Peter
2009-08-01
Genome wide association studies can potentially identify the genetic causes behind the majority of human diseases. With the advent of more advanced genotyping techniques, there is now an explosion of data gathered on single nucleotide polymorphisms (SNPs). The need exists for an integrated system that can provide up-to-date functional annotation information on SNPs. We have developed the SNP Integration Tool (SNPit) system to address this need. Built upon a federated data integration system, SNPit provides current information on a comprehensive list of SNP data sources. Additional logical inference analysis was included through an inference engine plug in. The SNPit web servlet is available online for use. SNPit allows users to go to one source for up-to-date information on the functional annotation of SNPs. A tool that can help to integrate and analyze the potential functional significance of SNPs is important for understanding the results from genome wide association studies.
SNPit: a federated data integration system for the purpose of functional SNP annotation
Shen, Terry H; Carlson, Christopher S; Tarczy-Hornoch, Peter
2009-01-01
Genome wide association studies can potentially identify the genetic causes behind the majority of human diseases. With the advent of more advanced genotyping techniques, there is now an explosion of data gathered on single nucleotide polymorphisms (SNPs). The need exists for an integrated system that can provide up-to-date functional annotation information on SNPs. We have developed the SNP Integration Tool (SNPit) system to address this need. Built upon a federated data integration system, SNPit provides current information on a comprehensive list of SNP data sources. Additional logical inference analysis was included through an inference engine plug in. The SNPit web servlet is available online for use. SNPit allows users to go to one source for up-to-date information on the functional annotation of SNPs. A tool that can help to integrate and analyze the potential functional significance of SNPs is important for understanding the results from genome wide association studies. PMID:19327864
Guo, Liyuan; Wang, Jing
2018-01-04
Here, we present the updated rSNPBase 3.0 database (http://rsnp3.psych.ac.cn), which provides human SNP-related regulatory elements, element-gene pairs and SNP-based regulatory networks. This database is the updated version of the SNP regulatory annotation database rSNPBase and rVarBase. In comparison to the last two versions, there are both structural and data adjustments in rSNPBase 3.0: (i) The most significant new feature is the expansion of analysis scope from SNP-related regulatory elements to include regulatory element-target gene pairs (E-G pairs), therefore it can provide SNP-based gene regulatory networks. (ii) Web function was modified according to data content and a new network search module is provided in the rSNPBase 3.0 in addition to the previous regulatory SNP (rSNP) search module. The two search modules support data query for detailed information (related-elements, element-gene pairs, and other extended annotations) on specific SNPs and SNP-related graphic networks constructed by interacting transcription factors (TFs), miRNAs and genes. (3) The type of regulatory elements was modified and enriched. To our best knowledge, the updated rSNPBase 3.0 is the first data tool supports SNP functional analysis from a regulatory network prospective, it will provide both a comprehensive understanding and concrete guidance for SNP-related regulatory studies. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
2018-01-01
Abstract Here, we present the updated rSNPBase 3.0 database (http://rsnp3.psych.ac.cn), which provides human SNP-related regulatory elements, element-gene pairs and SNP-based regulatory networks. This database is the updated version of the SNP regulatory annotation database rSNPBase and rVarBase. In comparison to the last two versions, there are both structural and data adjustments in rSNPBase 3.0: (i) The most significant new feature is the expansion of analysis scope from SNP-related regulatory elements to include regulatory element–target gene pairs (E–G pairs), therefore it can provide SNP-based gene regulatory networks. (ii) Web function was modified according to data content and a new network search module is provided in the rSNPBase 3.0 in addition to the previous regulatory SNP (rSNP) search module. The two search modules support data query for detailed information (related-elements, element-gene pairs, and other extended annotations) on specific SNPs and SNP-related graphic networks constructed by interacting transcription factors (TFs), miRNAs and genes. (3) The type of regulatory elements was modified and enriched. To our best knowledge, the updated rSNPBase 3.0 is the first data tool supports SNP functional analysis from a regulatory network prospective, it will provide both a comprehensive understanding and concrete guidance for SNP-related regulatory studies. PMID:29140525
SNPMeta: SNP annotation and SNP metadata collection without a reference genome
USDA-ARS?s Scientific Manuscript database
The increase in availability of resequencing data is greatly accelerating SNP discovery and has facilitated the development of SNP genotyping assays. This, in turn, is increasing interest in annotation of individual SNPs. Currently, these data are only available through curation, or comparison to a ...
LS-SNP/PDB: annotated non-synonymous SNPs mapped to Protein Data Bank structures.
Ryan, Michael; Diekhans, Mark; Lien, Stephanie; Liu, Yun; Karchin, Rachel
2009-06-01
LS-SNP/PDB is a new WWW resource for genome-wide annotation of human non-synonymous (amino acid changing) SNPs. It serves high-quality protein graphics rendered with UCSF Chimera molecular visualization software. The system is kept up-to-date by an automated, high-throughput build pipeline that systematically maps human nsSNPs onto Protein Data Bank structures and annotates several biologically relevant features. LS-SNP/PDB is available at (http://ls-snp.icm.jhu.edu/ls-snp-pdb) and via links from protein data bank (PDB) biology and chemistry tabs, UCSC Genome Browser Gene Details and SNP Details pages and PharmGKB Gene Variants Downloads/Cross-References pages.
Zhang, RuiJie; Li, Xia; Jiang, YongShuai; Liu, GuiYou; Li, ChuanXing; Zhang, Fan; Xiao, Yun; Gong, BinSheng
2009-02-01
High-throughout single nucleotide polymorphism detection technology and the existing knowledge provide strong support for mining the disease-related haplotypes and genes. In this study, first, we apply four kinds of haplotype identification methods (Confidence Intervals, Four Gamete Tests, Solid Spine of LD and fusing method of haplotype block) into high-throughout SNP genotype data to identify blocks, then use cluster analysis to verify the effectiveness of the four methods, and select the alcoholism-related SNP haplotypes through risk analysis. Second, we establish a mapping from haplotypes to alcoholism-related genes. Third, we inquire NCBI SNP and gene databases to locate the blocks and identify the candidate genes. In the end, we make gene function annotation by KEGG, Biocarta, and GO database. We find 159 haplotype blocks, which relate to the alcoholism most possibly on chromosome 1 approximately 22, including 227 haplotypes, of which 102 SNP haplotypes may increase the risk of alcoholism. We get 121 alcoholism-related genes and verify their reliability by the functional annotation of biology. In a word, we not only can handle the SNP data easily, but also can locate the disease-related genes precisely by combining our novel strategies of mining alcoholism-related haplotypes and genes with existing knowledge framework.
DOE Office of Scientific and Technical Information (OSTI.GOV)
SacconePhD, Scott F; Chesler, Elissa J; Bierut, Laura J
Commercial SNP microarrays now provide comprehensive and affordable coverage of the human genome. However, some diseases have biologically relevant genomic regions that may require additional coverage. Addiction, for example, is thought to be influenced by complex interactions among many relevant genes and pathways. We have assembled a list of 486 biologically relevant genes nominated by a panel of experts on addiction. We then added 424 genes that showed evidence of association with addiction phenotypes through mouse QTL mappings and gene co-expression analysis. We demonstrate that there are a substantial number of SNPs in these genes that are not well representedmore » by commercial SNP platforms. We address this problem by introducing a publicly available SNP database for addiction. The database is annotated using numeric prioritization scores indicating the extent of biological relevance. The scores incorporate a number of factors such as SNP/gene functional properties (including synonymy and promoter regions), data from mouse systems genetics and measures of human/mouse evolutionary conservation. We then used HapMap genotyping data to determine if a SNP is tagged by a commercial microarray through linkage disequilibrium. This combination of biological prioritization scores and LD tagging annotation will enable addiction researchers to supplement commercial SNP microarrays to ensure comprehensive coverage of biologically relevant regions.« less
Zhang, Shujun
2018-01-01
Genome-wide association studies (GWASs) have identified many disease associated loci, the majority of which have unknown biological functions. Understanding the mechanism underlying trait associations requires identifying trait-relevant tissues and investigating associations in a trait-specific fashion. Here, we extend the widely used linear mixed model to incorporate multiple SNP functional annotations from omics studies with GWAS summary statistics to facilitate the identification of trait-relevant tissues, with which to further construct powerful association tests. Specifically, we rely on a generalized estimating equation based algorithm for parameter inference, a mixture modeling framework for trait-tissue relevance classification, and a weighted sequence kernel association test constructed based on the identified trait-relevant tissues for powerful association analysis. We refer to our analytic procedure as the Scalable Multiple Annotation integration for trait-Relevant Tissue identification and usage (SMART). With extensive simulations, we show how our method can make use of multiple complementary annotations to improve the accuracy for identifying trait-relevant tissues. In addition, our procedure allows us to make use of the inferred trait-relevant tissues, for the first time, to construct more powerful SNP set tests. We apply our method for an in-depth analysis of 43 traits from 28 GWASs using tissue-specific annotations in 105 tissues derived from ENCODE and Roadmap. Our results reveal new trait-tissue relevance, pinpoint important annotations that are informative of trait-tissue relationship, and illustrate how we can use the inferred trait-relevant tissues to construct more powerful association tests in the Wellcome trust case control consortium study. PMID:29377896
Tsai, Hsin Y; Robledo, Diego; Lowe, Natalie R; Bekaert, Michael; Taggart, John B; Bron, James E; Houston, Ross D
2016-07-07
High density linkage maps are useful tools for fine-scale mapping of quantitative trait loci, and characterization of the recombination landscape of a species' genome. Genomic resources for Atlantic salmon (Salmo salar) include a well-assembled reference genome, and high density single nucleotide polymorphism (SNP) arrays. Our aim was to create a high density linkage map, and to align it with the reference genome assembly. Over 96,000 SNPs were mapped and ordered on the 29 salmon linkage groups using a pedigreed population comprising 622 fish from 60 nuclear families, all genotyped with the 'ssalar01' high density SNP array. The number of SNPs per group showed a high positive correlation with physical chromosome length (r = 0.95). While the order of markers on the genetic and physical maps was generally consistent, areas of discrepancy were identified. Approximately 6.5% of the previously unmapped reference genome sequence was assigned to chromosomes using the linkage map. Male recombination rate was lower than females across the vast majority of the genome, but with a notable peak in subtelomeric regions. Finally, using RNA-Seq data to annotate the reference genome, the mapped SNPs were categorized according to their predicted function, including annotation of ∼2500 putative nonsynonymous variants. The highest density SNP linkage map for any salmonid species has been created, annotated, and integrated with the Atlantic salmon reference genome assembly. This map highlights the marked heterochiasmy of salmon, and provides a useful resource for salmonid genetics and genomics research. Copyright © 2016 Tsai et al.
Transcriptome-based differentiation of closely-related Miscanthus lines.
Chouvarine, Philippe; Cooksey, Amanda M; McCarthy, Fiona M; Ray, David A; Baldwin, Brian S; Burgess, Shane C; Peterson, Daniel G
2012-01-01
Distinguishing between individuals is critical to those conducting animal/plant breeding, food safety/quality research, diagnostic and clinical testing, and evolutionary biology studies. Classical genetic identification studies are based on marker polymorphisms, but polymorphism-based techniques are time and labor intensive and often cannot distinguish between closely related individuals. Illumina sequencing technologies provide the detailed sequence data required for rapid and efficient differentiation of related species, lines/cultivars, and individuals in a cost-effective manner. Here we describe the use of Illumina high-throughput exome sequencing, coupled with SNP mapping, as a rapid means of distinguishing between related cultivars of the lignocellulosic bioenergy crop giant miscanthus (Miscanthus × giganteus). We provide the first exome sequence database for Miscanthus species complete with Gene Ontology (GO) functional annotations. A SNP comparative analysis of rhizome-derived cDNA sequences was successfully utilized to distinguish three Miscanthus × giganteus cultivars from each other and from other Miscanthus species. Moreover, the resulting phylogenetic tree generated from SNP frequency data parallels the known breeding history of the plants examined. Some of the giant miscanthus plants exhibit considerable sequence divergence. Here we describe an analysis of Miscanthus in which high-throughput exome sequencing was utilized to differentiate between closely related genotypes despite the current lack of a reference genome sequence. We functionally annotated the exome sequences and provide resources to support Miscanthus systems biology. In addition, we demonstrate the use of the commercial high-performance cloud computing to do computational GO annotation.
AncestrySNPminer: A bioinformatics tool to retrieve and develop ancestry informative SNP panels
Amirisetty, Sushil; Khurana Hershey, Gurjit K.; Baye, Tesfaye M.
2012-01-01
A wealth of genomic information is available in public and private databases. However, this information is underutilized for uncovering population specific and functionally relevant markers underlying complex human traits. Given the huge amount of SNP data available from the annotation of human genetic variation, data mining is a faster and cost effective approach for investigating the number of SNPs that are informative for ancestry. In this study, we present AncestrySNPminer, the first web-based bioinformatics tool specifically designed to retrieve Ancestry Informative Markers (AIMs) from genomic data sets and link these informative markers to genes and ontological annotation classes. The tool includes an automated and simple “scripting at the click of a button” functionality that enables researchers to perform various population genomics statistical analyses methods with user friendly querying and filtering of data sets across various populations through a single web interface. AncestrySNPminer can be freely accessed at https://research.cchmc.org/mershalab/AncestrySNPminer/login.php. PMID:22584067
Pavy, Nathalie; Parsons, Lee S; Paule, Charles; MacKay, John; Bousquet, Jean
2006-01-01
Background High-throughput genotyping technologies represent a highly efficient way to accelerate genetic mapping and enable association studies. As a first step toward this goal, we aimed to develop a resource of candidate Single Nucleotide Polymorphisms (SNP) in white spruce (Picea glauca [Moench] Voss), a softwood tree of major economic importance. Results A white spruce SNP resource encompassing 12,264 SNPs was constructed from a set of 6,459 contigs derived from Expressed Sequence Tags (EST) and by using the bayesian-based statistical software PolyBayes. Several parameters influencing the SNP prediction were analysed including the a priori expected polymorphism, the probability score (PSNP), and the contig depth and length. SNP detection in 3' and 5' reads from the same clones revealed a level of inconsistency between overlapping sequences as low as 1%. A subset of 245 predicted SNPs were verified through the independent resequencing of genomic DNA of a genotype also used to prepare cDNA libraries. The validation rate reached a maximum of 85% for SNPs predicted with either PSNP ≥ 0.95 or ≥ 0.99. A total of 9,310 SNPs were detected by using PSNP ≥ 0.95 as a criterion. The SNPs were distributed among 3,590 contigs encompassing an array of broad functional categories, with an overall frequency of 1 SNP per 700 nucleotide sites. Experimental and statistical approaches were used to evaluate the proportion of paralogous SNPs, with estimates in the range of 8 to 12%. The 3,789 coding SNPs identified through coding region annotation and ORF prediction, were distributed into 39% nonsynonymous and 61% synonymous substitutions. Overall, there were 0.9 SNP per 1,000 nonsynonymous sites and 5.2 SNPs per 1,000 synonymous sites, for a genome-wide nonsynonymous to synonymous substitution rate ratio (Ka/Ks) of 0.17. Conclusion We integrated the SNP data in the ForestTreeDB database along with functional annotations to provide a tool facilitating the choice of candidate genes for mapping purposes or association studies. PMID:16824208
Network Analysis Reveals Putative Genes Affecting Meat Quality in Angus Cattle.
Mateescu, Raluca G; Garrick, Dorian J; Reecy, James M
2017-01-01
Improvements in eating satisfaction will benefit consumers and should increase beef demand which is of interest to the beef industry. Tenderness, juiciness, and flavor are major determinants of the palatability of beef and are often used to reflect eating satisfaction. Carcass qualities are used as indicator traits for meat quality, with higher quality grade carcasses expected to relate to more tender and palatable meat. However, meat quality is a complex concept determined by many component traits making interpretation of genome-wide association studies (GWAS) on any one component challenging to interpret. Recent approaches combining traditional GWAS with gene network interactions theory could be more efficient in dissecting the genetic architecture of complex traits. Phenotypic measures of 23 traits reflecting carcass characteristics, components of meat quality, along with mineral and peptide concentrations were used along with Illumina 54k bovine SNP genotypes to derive an annotated gene network associated with meat quality in 2,110 Angus beef cattle. The efficient mixed model association (EMMAX) approach in combination with a genomic relationship matrix was used to directly estimate the associations between 54k SNP genotypes and each of the 23 component traits. Genomic correlated regions were identified by partial correlations which were further used along with an information theory algorithm to derive gene network clusters. Correlated SNP across 23 component traits were subjected to network scoring and visualization software to identify significant SNP. Significant pathways implicated in the meat quality complex through GO term enrichment analysis included angiogenesis, inflammation, transmembrane transporter activity, and receptor activity. These results suggest that network analysis using partial correlations and annotation of significant SNP can reveal the genetic architecture of complex traits and provide novel information regarding biological mechanisms and genes that lead to complex phenotypes, like meat quality, and the nutritional and healthfulness value of beef. Improvements in genome annotation and knowledge of gene function will contribute to more comprehensive analyses that will advance our ability to dissect the complex architecture of complex traits.
Gardner, Shea N.; Hall, Barry G.
2013-01-01
Effective use of rapid and inexpensive whole genome sequencing for microbes requires fast, memory efficient bioinformatics tools for sequence comparison. The kSNP v2 software finds single nucleotide polymorphisms (SNPs) in whole genome data. kSNP v2 has numerous improvements over kSNP v1 including SNP gene annotation; better scaling for draft genomes available as assembled contigs or raw, unassembled reads; a tool to identify the optimal value of k; distribution of packages of executables for Linux and Mac OS X for ease of installation and user-friendly use; and a detailed User Guide. SNP discovery is based on k-mer analysis, and requires no multiple sequence alignment or the selection of a single reference genome. Most target sets with hundreds of genomes complete in minutes to hours. SNP phylogenies are built by maximum likelihood, parsimony, and distance, based on all SNPs, only core SNPs, or SNPs present in some intermediate user-specified fraction of targets. The SNP-based trees that result are consistent with known taxonomy. kSNP v2 can handle many gigabases of sequence in a single run, and if one or more annotated genomes are included in the target set, SNPs are annotated with protein coding and other information (UTRs, etc.) from Genbank file(s). We demonstrate application of kSNP v2 on sets of viral and bacterial genomes, and discuss in detail analysis of a set of 68 finished E. coli and Shigella genomes and a set of the same genomes to which have been added 47 assemblies and four “raw read” genomes of H104:H4 strains from the recent European E. coli outbreak that resulted in both bloody diarrhea and hemolytic uremic syndrome (HUS), and caused at least 50 deaths. PMID:24349125
Gardner, Shea N; Hall, Barry G
2013-01-01
Effective use of rapid and inexpensive whole genome sequencing for microbes requires fast, memory efficient bioinformatics tools for sequence comparison. The kSNP v2 software finds single nucleotide polymorphisms (SNPs) in whole genome data. kSNP v2 has numerous improvements over kSNP v1 including SNP gene annotation; better scaling for draft genomes available as assembled contigs or raw, unassembled reads; a tool to identify the optimal value of k; distribution of packages of executables for Linux and Mac OS X for ease of installation and user-friendly use; and a detailed User Guide. SNP discovery is based on k-mer analysis, and requires no multiple sequence alignment or the selection of a single reference genome. Most target sets with hundreds of genomes complete in minutes to hours. SNP phylogenies are built by maximum likelihood, parsimony, and distance, based on all SNPs, only core SNPs, or SNPs present in some intermediate user-specified fraction of targets. The SNP-based trees that result are consistent with known taxonomy. kSNP v2 can handle many gigabases of sequence in a single run, and if one or more annotated genomes are included in the target set, SNPs are annotated with protein coding and other information (UTRs, etc.) from Genbank file(s). We demonstrate application of kSNP v2 on sets of viral and bacterial genomes, and discuss in detail analysis of a set of 68 finished E. coli and Shigella genomes and a set of the same genomes to which have been added 47 assemblies and four "raw read" genomes of H104:H4 strains from the recent European E. coli outbreak that resulted in both bloody diarrhea and hemolytic uremic syndrome (HUS), and caused at least 50 deaths.
Rice SNP-seek database update: new SNPs, indels, and queries.
Mansueto, Locedie; Fuentes, Roven Rommel; Borja, Frances Nikki; Detras, Jeffery; Abriol-Santos, Juan Miguel; Chebotarov, Dmytro; Sanciangco, Millicent; Palis, Kevin; Copetti, Dario; Poliakov, Alexandre; Dubchak, Inna; Solovyev, Victor; Wing, Rod A; Hamilton, Ruaraidh Sackville; Mauleon, Ramil; McNally, Kenneth L; Alexandrov, Nickolai
2017-01-04
We describe updates to the Rice SNP-Seek Database since its first release. We ran a new SNP-calling pipeline followed by filtering that resulted in complete, base, filtered and core SNP datasets. Besides the Nipponbare reference genome, the pipeline was run on genome assemblies of IR 64, 93-11, DJ 123 and Kasalath. New genotype query and display features are added for reference assemblies, SNP datasets and indels. JBrowse now displays BAM, VCF and other annotation tracks, the additional genome assemblies and an embedded VISTA genome comparison viewer. Middleware is redesigned for improved performance by using a hybrid of HDF5 and RDMS for genotype storage. Query modules for genotypes, varieties and genes are improved to handle various constraints. An integrated list manager allows the user to pass query parameters for further analysis. The SNP Annotator adds traits, ontology terms, effects and interactions to markers in a list. Web-service calls were implemented to access most data. These features enable seamless querying of SNP-Seek across various biological entities, a step toward semi-automated gene-trait association discovery. URL: http://snp-seek.irri.org. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Dayem Ullah, Abu Z; Oscanoa, Jorge; Wang, Jun; Nagano, Ai; Lemoine, Nicholas R; Chelala, Claude
2018-05-11
Broader functional annotation of genetic variation is a valuable means for prioritising phenotypically-important variants in further disease studies and large-scale genotyping projects. We developed SNPnexus to meet this need by assessing the potential significance of known and novel SNPs on the major transcriptome, proteome, regulatory and structural variation models. Since its previous release in 2012, we have made significant improvements to the annotation categories and updated the query and data viewing systems. The most notable changes include broader functional annotation of noncoding variants and expanding annotations to the most recent human genome assembly GRCh38/hg38. SNPnexus has now integrated rich resources from ENCODE and Roadmap Epigenomics Consortium to map and annotate the noncoding variants onto different classes of regulatory regions and noncoding RNAs as well as providing their predicted functional impact from eight popular non-coding variant scoring algorithms and computational methods. A novel functionality offered now is the support for neo-epitope predictions from leading tools to facilitate its use in immunotherapeutic applications. These updates to SNPnexus are in preparation for its future expansion towards a fully comprehensive computational workflow for disease-associated variant prioritization from sequencing data, placing its users at the forefront of translational research. SNPnexus is freely available at http://www.snp-nexus.org.
Kumar, Sunil; Ambrosini, Giovanna; Bucher, Philipp
2017-01-04
SNP2TFBS is a computational resource intended to support researchers investigating the molecular mechanisms underlying regulatory variation in the human genome. The database essentially consists of a collection of text files providing specific annotations for human single nucleotide polymorphisms (SNPs), namely whether they are predicted to abolish, create or change the affinity of one or several transcription factor (TF) binding sites. A SNP's effect on TF binding is estimated based on a position weight matrix (PWM) model for the binding specificity of the corresponding factor. These data files are regenerated at regular intervals by an automatic procedure that takes as input a reference genome, a comprehensive SNP catalogue and a collection of PWMs. SNP2TFBS is also accessible over a web interface, enabling users to view the information provided for an individual SNP, to extract SNPs based on various search criteria, to annotate uploaded sets of SNPs or to display statistics about the frequencies of binding sites affected by selected SNPs. Homepage: http://ccg.vital-it.ch/snp2tfbs/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Variation resources at UC Santa Cruz.
Thomas, Daryl J; Trumbower, Heather; Kern, Andrew D; Rhead, Brooke L; Kuhn, Robert M; Haussler, David; Kent, W James
2007-01-01
The variation resources within the University of California Santa Cruz Genome Browser include polymorphism data drawn from public collections and analyses of these data, along with their display in the context of other genomic annotations. Primary data from dbSNP is included for many organisms, with added information including genomic alleles and orthologous alleles for closely related organisms. Display filtering and coloring is available by variant type, functional class or other annotations. Annotation of potential errors is highlighted and a genomic alignment of the variant's flanking sequence is displayed. HapMap allele frequencies and linkage disequilibrium (LD) are available for each HapMap population, along with non-human primate alleles. The browsing and analysis tools, downloadable data files and links to documentation and other information can be found at http://genome.ucsc.edu/.
Genome-Wide SNP Genotyping to Infer the Effects on Gene Functions in Tomato
Hirakawa, Hideki; Shirasawa, Kenta; Ohyama, Akio; Fukuoka, Hiroyuki; Aoki, Koh; Rothan, Christophe; Sato, Shusei; Isobe, Sachiko; Tabata, Satoshi
2013-01-01
The genotype data of 7054 single nucleotide polymorphism (SNP) loci in 40 tomato lines, including inbred lines, F1 hybrids, and wild relatives, were collected using Illumina's Infinium and GoldenGate assay platforms, the latter of which was utilized in our previous study. The dendrogram based on the genotype data corresponded well to the breeding types of tomato and wild relatives. The SNPs were classified into six categories according to their positions in the genes predicted on the tomato genome sequence. The genes with SNPs were annotated by homology searches against the nucleotide and protein databases, as well as by domain searches, and they were classified into the functional categories defined by the NCBI's eukaryotic orthologous groups (KOG). To infer the SNPs' effects on the gene functions, the three-dimensional structures of the 843 proteins that were encoded by the genes with SNPs causing missense mutations were constructed by homology modelling, and 200 of these proteins were considered to carry non-synonymous amino acid substitutions in the predicted functional sites. The SNP information obtained in this study is available at the Kazusa Tomato Genomics Database (http://plant1.kazusa.or.jp/tomato/). PMID:23482505
NASA Astrophysics Data System (ADS)
Liu, Chengzhang; Wang, Xia; Xiang, Jianhai; Li, Fuhua
2012-09-01
Pacific white shrimp has become a major aquaculture and fishery species worldwide. Although a large scale EST resource has been publicly available since 2008, the data have not yet been widely used for SNP discovery or transcriptome-wide assessment of selective pressure. In this study, a set of 155 411 expressed sequence tags (ESTs) from the NCBI database were computationally analyzed and 17 225 single nucleotide polymorphisms (SNPs) were predicted, including 9 546 transitions, 5 124 transversions and 2 481 indels. Among the 7 298 SNP substitutions located in functionally annotated contigs, 58.4% (4 262) are non-synonymous SNPs capable of introducing amino acid mutations. Two hundred and fifty nonsynonymous SNPs in genes associated with economic traits have been identified as candidates for markers in selective breeding. Diversity estimates among the synonymous nucleotides were on average 3.49 times greater than those in non-synonymous, suggesting negative selection. Distribution of non-synonymous to synonymous substitutions (Ka/Ks) ratio ranges from 0 to 4.01, (average 0.42, median 0.26), suggesting that the majority of the affected genes are under purifying selection. Enrichment analysis identified multiple gene ontology categories under positive or negative selection. Categories involved in innate immune response and male gamete generation are rich in positively selected genes, which is similar to reports in Drosophila and primates. This work is the first transcriptome-wide assessment of selective pressure in a Penaeid shrimp species. The functionally annotated SNPs provide a valuable resource of potential molecular markers for selective breeding.
Breeding and Genetics Symposium: networks and pathways to guide genomic selection.
Snelling, W M; Cushman, R A; Keele, J W; Maltecca, C; Thomas, M G; Fortes, M R S; Reverter, A
2013-02-01
Many traits affecting profitability and sustainability of meat, milk, and fiber production are polygenic, with no single gene having an overwhelming influence on observed variation. No knowledge of the specific genes controlling these traits has been needed to make substantial improvement through selection. Significant gains have been made through phenotypic selection enhanced by pedigree relationships and continually improving statistical methodology. Genomic selection, recently enabled by assays for dense SNP located throughout the genome, promises to increase selection accuracy and accelerate genetic improvement by emphasizing the SNP most strongly correlated to phenotype although the genes and sequence variants affecting phenotype remain largely unknown. These genomic predictions theoretically rely on linkage disequilibrium (LD) between genotyped SNP and unknown functional variants, but familial linkage may increase effectiveness when predicting individuals related to those in the training data. Genomic selection with functional SNP genotypes should be less reliant on LD patterns shared by training and target populations, possibly allowing robust prediction across unrelated populations. Although the specific variants causing polygenic variation may never be known with certainty, a number of tools and resources can be used to identify those most likely to affect phenotype. Associations of dense SNP genotypes with phenotype provide a 1-dimensional approach for identifying genes affecting specific traits; in contrast, associations with multiple traits allow defining networks of genes interacting to affect correlated traits. Such networks are especially compelling when corroborated by existing functional annotation and established molecular pathways. The SNP occurring within network genes, obtained from public databases or derived from genome and transcriptome sequences, may be classified according to expected effects on gene products. As illustrated by functionally informed genomic predictions being more accurate than naive whole-genome predictions of beef tenderness, coupling evidence from livestock genotypes, phenotypes, gene expression, and genomic variants with existing knowledge of gene functions and interactions may provide greater insight into the genes and genomic mechanisms affecting polygenic traits and facilitate functional genomic selection for economically important traits.
2011-01-01
Background Many plants have large and complex genomes with an abundance of repeated sequences. Many plants are also polyploid. Both of these attributes typify the genome architecture in the tribe Triticeae, whose members include economically important wheat, rye and barley. Large genome sizes, an abundance of repeated sequences, and polyploidy present challenges to genome-wide SNP discovery using next-generation sequencing (NGS) of total genomic DNA by making alignment and clustering of short reads generated by the NGS platforms difficult, particularly in the absence of a reference genome sequence. Results An annotation-based, genome-wide SNP discovery pipeline is reported using NGS data for large and complex genomes without a reference genome sequence. Roche 454 shotgun reads with low genome coverage of one genotype are annotated in order to distinguish single-copy sequences and repeat junctions from repetitive sequences and sequences shared by paralogous genes. Multiple genome equivalents of shotgun reads of another genotype generated with SOLiD or Solexa are then mapped to the annotated Roche 454 reads to identify putative SNPs. A pipeline program package, AGSNP, was developed and used for genome-wide SNP discovery in Aegilops tauschii-the diploid source of the wheat D genome, and with a genome size of 4.02 Gb, of which 90% is repetitive sequences. Genomic DNA of Ae. tauschii accession AL8/78 was sequenced with the Roche 454 NGS platform. Genomic DNA and cDNA of Ae. tauschii accession AS75 was sequenced primarily with SOLiD, although some Solexa and Roche 454 genomic sequences were also generated. A total of 195,631 putative SNPs were discovered in gene sequences, 155,580 putative SNPs were discovered in uncharacterized single-copy regions, and another 145,907 putative SNPs were discovered in repeat junctions. These SNPs were dispersed across the entire Ae. tauschii genome. To assess the false positive SNP discovery rate, DNA containing putative SNPs was amplified by PCR from AL8/78 and AS75 and resequenced with the ABI 3730 xl. In a sample of 302 randomly selected putative SNPs, 84.0% in gene regions, 88.0% in repeat junctions, and 81.3% in uncharacterized regions were validated. Conclusion An annotation-based genome-wide SNP discovery pipeline for NGS platforms was developed. The pipeline is suitable for SNP discovery in genomic libraries of complex genomes and does not require a reference genome sequence. The pipeline is applicable to all current NGS platforms, provided that at least one such platform generates relatively long reads. The pipeline package, AGSNP, and the discovered 497,118 Ae. tauschii SNPs can be accessed at (http://avena.pw.usda.gov/wheatD/agsnp.shtml). PMID:21266061
An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets.
Hosseini, Parsa; Tremblay, Arianne; Matthews, Benjamin F; Alkharouf, Nadim W
2010-07-02
The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease.
Ning, Shangwei; Yue, Ming; Wang, Peng; Liu, Yue; Zhi, Hui; Zhang, Yan; Zhang, Jizhou; Gao, Yue; Guo, Maoni; Zhou, Dianshuang; Li, Xin; Li, Xia
2017-01-04
We describe LincSNP 2.0 (http://bioinfo.hrbmu.edu.cn/LincSNP), an updated database that is used specifically to store and annotate disease-associated single nucleotide polymorphisms (SNPs) in human long non-coding RNAs (lncRNAs) and their transcription factor binding sites (TFBSs). In LincSNP 2.0, we have updated the database with more data and several new features, including (i) expanding disease-associated SNPs in human lncRNAs; (ii) identifying disease-associated SNPs in lncRNA TFBSs; (iii) updating LD-SNPs from the 1000 Genomes Project; and (iv) collecting more experimentally supported SNP-lncRNA-disease associations. Furthermore, we developed three flexible online tools to retrieve and analyze the data. Linc-Mart is a convenient way for users to customize their own data. Linc-Browse is a tool for all data visualization. Linc-Score predicts the associations between lncRNA and disease. In addition, we provided users a newly designed, user-friendly interface to search and download all the data in LincSNP 2.0 and we also provided an interface to submit novel data into the database. LincSNP 2.0 is a continually updated database and will serve as an important resource for investigating the functions and mechanisms of lncRNAs in human diseases. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Efficient Integrative Multi-SNP Association Analysis via Deterministic Approximation of Posteriors.
Wen, Xiaoquan; Lee, Yeji; Luca, Francesca; Pique-Regi, Roger
2016-06-02
With the increasing availability of functional genomic data, incorporating genomic annotations into genetic association analysis has become a standard procedure. However, the existing methods often lack rigor and/or computational efficiency and consequently do not maximize the utility of functional annotations. In this paper, we propose a rigorous inference procedure to perform integrative association analysis incorporating genomic annotations for both traditional GWASs and emerging molecular QTL mapping studies. In particular, we propose an algorithm, named deterministic approximation of posteriors (DAP), which enables highly efficient and accurate joint enrichment analysis and identification of multiple causal variants. We use a series of simulation studies to highlight the power and computational efficiency of our proposed approach and further demonstrate it by analyzing the cross-population eQTL data from the GEUVADIS project and the multi-tissue eQTL data from the GTEx project. In particular, we find that genetic variants predicted to disrupt transcription factor binding sites are enriched in cis-eQTLs across all tissues. Moreover, the enrichment estimates obtained across the tissues are correlated with the cell types for which the annotations are derived. Copyright © 2016 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Tollenaere, Charlotte; Susi, Hanna; Nokso-Koivisto, Jussi; Koskinen, Patrik; Tack, Ayco; Auvinen, Petri; Paulin, Lars; Frilander, Mikko J.; Lehtonen, Rainer; Laine, Anna-Liisa
2012-01-01
Background Molecular tools may greatly improve our understanding of pathogen evolution and epidemiology but technical constraints have hindered the development of genetic resources for parasites compared to free-living organisms. This study aims at developing molecular tools for Podosphaera plantaginis, an obligate fungal pathogen of Plantago lanceolata. This interaction has been intensively studied in the Åland archipelago of Finland with epidemiological data collected from over 4,000 host populations annually since year 2001. Principal Findings A cDNA library of a pooled sample of fungal conidia was sequenced on the 454 GS-FLX platform. Over 549,411 reads were obtained and annotated into 45,245 contigs. Annotation data was acquired for 65.2% of the assembled sequences. The transcriptome assembly was screened for SNP loci, as well as for functionally important genes (mating-type genes and potential effector proteins). A genotyping assay of 27 SNP loci was designed and tested on 380 infected leaf samples from 80 populations within the Åland archipelago. With this panel we identified 85 multilocus genotypes (MLG) with uneven frequencies across the pathogen metapopulation. Approximately half of the sampled populations contain polymorphism. Our genotyping protocol revealed mixed-genotype infection within a single host leaf to be common. Mixed infection has been proposed as one of the main drivers of pathogen evolution, and hence may be an important process in this pathosystem. Significance The developed SNP panel offers exciting research perspectives for future studies in this well-characterized pathosystem. Also, the transcriptome provides an invaluable novel genomic resource for powdery mildews, which cause significant yield losses on commercially important crops annually. Furthermore, the features that render genetic studies in this system a challenge are shared with the majority of obligate parasitic species, and hence our results provide methodological insights from SNP calling to field sampling protocols for a wide range of biological systems. PMID:23300684
An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets
2010-01-01
Background The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value. Findings We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. Conclusions TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease. PMID:20598141
Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil
2015-01-01
The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. PMID:25362073
Hill, W D; Davies, G; Harris, S E; Hagenaars, S P; Liewald, D C; Penke, L; Gale, C R; Deary, I J
2016-12-13
Differences in general cognitive function have been shown to be partly heritable and to show genetic correlations with several psychiatric and physical disease states. However, to date, few single-nucleotide polymorphisms (SNPs) have demonstrated genome-wide significance, hampering efforts aimed at determining which genetic variants are most important for cognitive function and which regions drive the genetic associations between cognitive function and disease states. Here, we combine multiple large genome-wide association study (GWAS) data sets, from the CHARGE cognitive consortium (n=53 949) and UK Biobank (n=36 035), to partition the genome into 52 functional annotations and an additional 10 annotations describing tissue-specific histone marks. Using stratified linkage disequilibrium score regression we show that, in two measures of cognitive function, SNPs associated with cognitive function cluster in regions of the genome that are under evolutionary negative selective pressure. These conserved regions contained ~2.6% of the SNPs from each GWAS but accounted for ~40% of the SNP-based heritability. The results suggest that the search for causal variants associated with cognitive function, and those variants that exert a pleiotropic effect between cognitive function and health, will be facilitated by examining these enriched regions.
Hill, W D; Davies, G; Harris, S E; Hagenaars, S P; Davies, Gail; Deary, Ian J; Debette, Stephanie; Verbaas, Carla I; Bressler, Jan; Schuur, Maaike; Smith, Albert V; Bis, Joshua C; Bennett, David A; Ikram, M Arfan; Launer, Lenore J; Fitzpatrick, Annette L; Seshadri, Sudha; van Duijn, Cornelia M; Mosley Jr, Thomas H; Liewald, D C; Penke, L; Gale, C R; Deary, I J
2016-01-01
Differences in general cognitive function have been shown to be partly heritable and to show genetic correlations with several psychiatric and physical disease states. However, to date, few single-nucleotide polymorphisms (SNPs) have demonstrated genome-wide significance, hampering efforts aimed at determining which genetic variants are most important for cognitive function and which regions drive the genetic associations between cognitive function and disease states. Here, we combine multiple large genome-wide association study (GWAS) data sets, from the CHARGE cognitive consortium (n=53 949) and UK Biobank (n=36 035), to partition the genome into 52 functional annotations and an additional 10 annotations describing tissue-specific histone marks. Using stratified linkage disequilibrium score regression we show that, in two measures of cognitive function, SNPs associated with cognitive function cluster in regions of the genome that are under evolutionary negative selective pressure. These conserved regions contained ~2.6% of the SNPs from each GWAS but accounted for ~40% of the SNP-based heritability. The results suggest that the search for causal variants associated with cognitive function, and those variants that exert a pleiotropic effect between cognitive function and health, will be facilitated by examining these enriched regions. PMID:27959336
Characterization of gonadal transcriptomes from the turbot (Scophthalmus maximus).
Hu, Yulong; Huang, Meng; Wang, Weiji; Guan, Jiantao; Kong, Jie
2016-01-01
The mechanisms underlying sexual reproduction and sex ratio determination remains unclear in turbot, a flatfish of great commercial value. And there is limited information in the turbot database regarding genes related to the reproductive system. Here, we conducted high-throughput transcriptome profiling of turbot gonad tissues to better understand their reproductive functions and to supply essential gene sequence information for marker-assisted selection programs in the turbot industry. In this study, two gonad libraries representing sex differences in Scophthalmus maximus yielded 453 818 high-quality reads that were assembled into 24 611 contigs and 33 713 singletons by using 454 pyrosequencing, 13 936 contigs and singletons (CS) of which were annotated using BLASTx. GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway analyses revealed that various biological functions and processes were associated with many of the annotated CS. Expression analyses showed that 510 genes were differentially expressed in males versus females; 80% of these genes were annotated. In addition, 6484 and 6036 single nucleotide polymorphisms (SNPs) were identified in male and female libraries, respectively. This transcriptome resource will serve as the foundation for cDNA or SNP microarray construction, gene expression characterization, and sex-specific linkage mapping in turbot.
Kim, Seungill; Kim, Myung-Shin; Kim, Yong-Min; Yeom, Seon-In; Cheong, Kyeongchae; Kim, Ki-Tae; Jeon, Jongbum; Kim, Sunggil; Kim, Do-Sun; Sohn, Seong-Han; Lee, Yong-Hwan; Choi, Doil
2015-02-01
The onion (Allium cepa L.) is one of the most widely cultivated and consumed vegetable crops in the world. Although a considerable amount of onion transcriptome data has been deposited into public databases, the sequences of the protein-coding genes are not accurate enough to be used, owing to non-coding sequences intermixed with the coding sequences. We generated a high-quality, annotated onion transcriptome from de novo sequence assembly and intensive structural annotation using the integrated structural gene annotation pipeline (ISGAP), which identified 54,165 protein-coding genes among 165,179 assembled transcripts totalling 203.0 Mb by eliminating the intron sequences. ISGAP performed reliable annotation, recognizing accurate gene structures based on reference proteins, and ab initio gene models of the assembled transcripts. Integrative functional annotation and gene-based SNP analysis revealed a whole biological repertoire of genes and transcriptomic variation in the onion. The method developed in this study provides a powerful tool for the construction of reference gene sets for organisms based solely on de novo transcriptome data. Furthermore, the reference genes and their variation described here for the onion represent essential tools for molecular breeding and gene cloning in Allium spp. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
Wang, Yunpeng; Thompson, Wesley K.; Schork, Andrew J.; Holland, Dominic; Chen, Chi-Hua; Bettella, Francesco; Desikan, Rahul S.; Li, Wen; Witoelar, Aree; Zuber, Verena; Devor, Anna; Nöthen, Markus M.; Rietschel, Marcella; Chen, Qiang; Werge, Thomas; Cichon, Sven; Weinberger, Daniel R.; Djurovic, Srdjan; O’Donovan, Michael; Visscher, Peter M.; Andreassen, Ole A.; Dale, Anders M.
2016-01-01
Most of the genetic architecture of schizophrenia (SCZ) has not yet been identified. Here, we apply a novel statistical algorithm called Covariate-Modulated Mixture Modeling (CM3), which incorporates auxiliary information (heterozygosity, total linkage disequilibrium, genomic annotations, pleiotropy) for each single nucleotide polymorphism (SNP) to enable more accurate estimation of replication probabilities, conditional on the observed test statistic (“z-score”) of the SNP. We use a multiple logistic regression on z-scores to combine information from auxiliary information to derive a “relative enrichment score” for each SNP. For each stratum of these relative enrichment scores, we obtain nonparametric estimates of posterior expected test statistics and replication probabilities as a function of discovery z-scores, using a resampling-based approach that repeatedly and randomly partitions meta-analysis sub-studies into training and replication samples. We fit a scale mixture of two Gaussians model to each stratum, obtaining parameter estimates that minimize the sum of squared differences of the scale-mixture model with the stratified nonparametric estimates. We apply this approach to the recent genome-wide association study (GWAS) of SCZ (n = 82,315), obtaining a good fit between the model-based and observed effect sizes and replication probabilities. We observed that SNPs with low enrichment scores replicate with a lower probability than SNPs with high enrichment scores even when both they are genome-wide significant (p < 5x10-8). There were 693 and 219 independent loci with model-based replication rates ≥80% and ≥90%, respectively. Compared to analyses not incorporating relative enrichment scores, CM3 increased out-of-sample yield for SNPs that replicate at a given rate. This demonstrates that replication probabilities can be more accurately estimated using prior enrichment information with CM3. PMID:26808560
Multi-Trait GWAS and New Candidate Genes Annotation for Growth Curve Parameters in Brahman Cattle
Crispim, Aline Camporez; Kelly, Matthew John; Guimarães, Simone Eliza Facioni; e Silva, Fabyano Fonseca; Fortes, Marina Rufino Salinas; Wenceslau, Raphael Rocha; Moore, Stephen
2015-01-01
Understanding the genetic architecture of beef cattle growth cannot be limited simply to the genome-wide association study (GWAS) for body weight at any specific ages, but should be extended to a more general purpose by considering the whole growth trajectory over time using a growth curve approach. For such an approach, the parameters that are used to describe growth curves were treated as phenotypes under a GWAS model. Data from 1,255 Brahman cattle that were weighed at birth, 6, 12, 15, 18, and 24 months of age were analyzed. Parameter estimates, such as mature weight (A) and maturity rate (K) from nonlinear models are utilized as substitutes for the original body weights for the GWAS analysis. We chose the best nonlinear model to describe the weight-age data, and the estimated parameters were used as phenotypes in a multi-trait GWAS. Our aims were to identify and characterize associated SNP markers to indicate SNP-derived candidate genes and annotate their function as related to growth processes in beef cattle. The Brody model presented the best goodness of fit, and the heritability values for the parameter estimates for mature weight (A) and maturity rate (K) were 0.23 and 0.32, respectively, proving that these traits can be a feasible alternative when the objective is to change the shape of growth curves within genetic improvement programs. The genetic correlation between A and K was -0.84, indicating that animals with lower mature body weights reached that weight at younger ages. One hundred and sixty seven (167) and two hundred and sixty two (262) significant SNPs were associated with A and K, respectively. The annotated genes closest to the most significant SNPs for A had direct biological functions related to muscle development (RAB28), myogenic induction (BTG1), fetal growth (IL2), and body weights (APEX2); K genes were functionally associated with body weight, body height, average daily gain (TMEM18), and skeletal muscle development (SMN1). Candidate genes emerging from this GWAS may inform the search for causative mutations that could underpin genomic breeding for improved growth rates. PMID:26445451
Multi-Trait GWAS and New Candidate Genes Annotation for Growth Curve Parameters in Brahman Cattle.
Crispim, Aline Camporez; Kelly, Matthew John; Guimarães, Simone Eliza Facioni; Fonseca e Silva, Fabyano; Fortes, Marina Rufino Salinas; Wenceslau, Raphael Rocha; Moore, Stephen
2015-01-01
Understanding the genetic architecture of beef cattle growth cannot be limited simply to the genome-wide association study (GWAS) for body weight at any specific ages, but should be extended to a more general purpose by considering the whole growth trajectory over time using a growth curve approach. For such an approach, the parameters that are used to describe growth curves were treated as phenotypes under a GWAS model. Data from 1,255 Brahman cattle that were weighed at birth, 6, 12, 15, 18, and 24 months of age were analyzed. Parameter estimates, such as mature weight (A) and maturity rate (K) from nonlinear models are utilized as substitutes for the original body weights for the GWAS analysis. We chose the best nonlinear model to describe the weight-age data, and the estimated parameters were used as phenotypes in a multi-trait GWAS. Our aims were to identify and characterize associated SNP markers to indicate SNP-derived candidate genes and annotate their function as related to growth processes in beef cattle. The Brody model presented the best goodness of fit, and the heritability values for the parameter estimates for mature weight (A) and maturity rate (K) were 0.23 and 0.32, respectively, proving that these traits can be a feasible alternative when the objective is to change the shape of growth curves within genetic improvement programs. The genetic correlation between A and K was -0.84, indicating that animals with lower mature body weights reached that weight at younger ages. One hundred and sixty seven (167) and two hundred and sixty two (262) significant SNPs were associated with A and K, respectively. The annotated genes closest to the most significant SNPs for A had direct biological functions related to muscle development (RAB28), myogenic induction (BTG1), fetal growth (IL2), and body weights (APEX2); K genes were functionally associated with body weight, body height, average daily gain (TMEM18), and skeletal muscle development (SMN1). Candidate genes emerging from this GWAS may inform the search for causative mutations that could underpin genomic breeding for improved growth rates.
Gusev, Alexander; Shi, Huwenbo; Kichaev, Gleb; Pomerantz, Mark; Li, Fugen; Long, Henry W; Ingles, Sue A; Kittles, Rick A; Strom, Sara S; Rybicki, Benjamin A; Nemesure, Barbara; Isaacs, William B; Zheng, Wei; Pettaway, Curtis A; Yeboah, Edward D; Tettey, Yao; Biritwum, Richard B; Adjei, Andrew A; Tay, Evelyn; Truelove, Ann; Niwa, Shelley; Chokkalingam, Anand P; John, Esther M; Murphy, Adam B; Signorello, Lisa B; Carpten, John; Leske, M Cristina; Wu, Suh-Yuh; Hennis, Anslem J M; Neslund-Dudas, Christine; Hsing, Ann W; Chu, Lisa; Goodman, Phyllis J; Klein, Eric A; Witte, John S; Casey, Graham; Kaggwa, Sam; Cook, Michael B; Stram, Daniel O; Blot, William J; Eeles, Rosalind A; Easton, Douglas; Kote-Jarai, Zsofia; Al Olama, Ali Amin; Benlloch, Sara; Muir, Kenneth; Giles, Graham G; Southey, Melissa C; Fitzgerald, Liesel M; Gronberg, Henrik; Wiklund, Fredrik; Aly, Markus; Henderson, Brian E; Schleutker, Johanna; Wahlfors, Tiina; Tammela, Teuvo L J; Nordestgaard, Børge G; Key, Tim J; Travis, Ruth C; Neal, David E; Donovan, Jenny L; Hamdy, Freddie C; Pharoah, Paul; Pashayan, Nora; Khaw, Kay-Tee; Stanford, Janet L; Thibodeau, Stephen N; McDonnell, Shannon K; Schaid, Daniel J; Maier, Christiane; Vogel, Walther; Luedeke, Manuel; Herkommer, Kathleen; Kibel, Adam S; Cybulski, Cezary; Wokolorczyk, Dominika; Kluzniak, Wojciech; Cannon-Albright, Lisa; Teerlink, Craig; Brenner, Hermann; Dieffenbach, Aida K; Arndt, Volker; Park, Jong Y; Sellers, Thomas A; Lin, Hui-Yi; Slavov, Chavdar; Kaneva, Radka; Mitev, Vanio; Batra, Jyotsna; Spurdle, Amanda; Clements, Judith A; Teixeira, Manuel R; Pandha, Hardev; Michael, Agnieszka; Paulo, Paula; Maia, Sofia; Kierzek, Andrzej; Conti, David V; Albanes, Demetrius; Berg, Christine; Berndt, Sonja I; Campa, Daniele; Crawford, E David; Diver, W Ryan; Gapstur, Susan M; Gaziano, J Michael; Giovannucci, Edward; Hoover, Robert; Hunter, David J; Johansson, Mattias; Kraft, Peter; Le Marchand, Loic; Lindström, Sara; Navarro, Carmen; Overvad, Kim; Riboli, Elio; Siddiq, Afshan; Stevens, Victoria L; Trichopoulos, Dimitrios; Vineis, Paolo; Yeager, Meredith; Trynka, Gosia; Raychaudhuri, Soumya; Schumacher, Frederick R; Price, Alkes L; Freedman, Matthew L; Haiman, Christopher A; Pasaniuc, Bogdan
2016-04-07
Although genome-wide association studies have identified over 100 risk loci that explain ∼33% of familial risk for prostate cancer (PrCa), their functional effects on risk remain largely unknown. Here we use genotype data from 59,089 men of European and African American ancestries combined with cell-type-specific epigenetic data to build a genomic atlas of single-nucleotide polymorphism (SNP) heritability in PrCa. We find significant differences in heritability between variants in prostate-relevant epigenetic marks defined in normal versus tumour tissue as well as between tissue and cell lines. The majority of SNP heritability lies in regions marked by H3k27 acetylation in prostate adenoc7arcinoma cell line (LNCaP) or by DNaseI hypersensitive sites in cancer cell lines. We find a high degree of similarity between European and African American ancestries suggesting a similar genetic architecture from common variation underlying PrCa risk. Our findings showcase the power of integrating functional annotation with genetic data to understand the genetic basis of PrCa.
DOE Office of Scientific and Technical Information (OSTI.GOV)
With the flood of whole genome finished and draft microbial sequences, we need faster, more scalable bioinformatics tools for sequence comparison. An algorithm is described to find single nucleotide polymorphisms (SNPs) in whole genome data. It scales to hundreds of bacterial or viral genomes, and can be used for finished and/or draft genomes available as unassembled contigs or raw, unassembled reads. The method is fast to compute, finding SNPs and building a SNP phylogeny in minutes to hours, depending on the size and diversity of the input sequences. The SNP-based trees that result are consistent with known taxonomy and treesmore » determined in other studies. The approach we describe can handle many gigabases of sequence in a single run. The algorithm is based on k-mer analysis.« less
Burt, Andrew J; William, H Manilal; Perry, Gregory; Khanal, Raja; Pauls, K Peter; Kelly, James D; Navabi, Alireza
2015-01-01
Anthracnose, caused by Colletotrichum lindemuthianum, is an important fungal disease of common bean (Phaseolus vulgaris). Alleles at the Co-4 locus confer resistance to a number of races of C. lindemuthianum. A population of 94 F4:5 recombinant inbred lines of a cross between resistant black bean genotype B09197 and susceptible navy bean cultivar Nautica was used to identify markers associated with resistance in bean chromosome 8 (Pv08) where Co-4 is localized. Three SCAR markers with known linkage to Co-4 and a panel of single nucleotide markers were used for genotyping. A refined physical region on Pv08 with significant association with anthracnose resistance identified by markers was used in BLAST searches with the genomic sequence of common bean accession G19833. Thirty two unique annotated candidate genes were identified that spanned a physical region of 936.46 kb. A majority of the annotated genes identified had functional similarity to leucine rich repeats/receptor like kinase domains. Three annotated genes had similarity to 1, 3-β-glucanase domains. There were sequence similarities between some of the annotated genes found in the study and the genes associated with phosphoinositide-specific phosphilipases C associated with Co-x and the COK-4 loci found in previous studies. It is possible that the Co-4 locus is structured as a group of genes with functional domains dominated by protein tyrosine kinase along with leucine rich repeats/nucleotide binding site, phosphilipases C as well as β-glucanases.
Burt, Andrew J.; William, H. Manilal; Perry, Gregory; Khanal, Raja; Pauls, K. Peter; Kelly, James D.; Navabi, Alireza
2015-01-01
Anthracnose, caused by Colletotrichum lindemuthianum, is an important fungal disease of common bean (Phaseolus vulgaris). Alleles at the Co–4 locus confer resistance to a number of races of C. lindemuthianum. A population of 94 F4:5 recombinant inbred lines of a cross between resistant black bean genotype B09197 and susceptible navy bean cultivar Nautica was used to identify markers associated with resistance in bean chromosome 8 (Pv08) where Co–4 is localized. Three SCAR markers with known linkage to Co–4 and a panel of single nucleotide markers were used for genotyping. A refined physical region on Pv08 with significant association with anthracnose resistance identified by markers was used in BLAST searches with the genomic sequence of common bean accession G19833. Thirty two unique annotated candidate genes were identified that spanned a physical region of 936.46 kb. A majority of the annotated genes identified had functional similarity to leucine rich repeats/receptor like kinase domains. Three annotated genes had similarity to 1, 3-β-glucanase domains. There were sequence similarities between some of the annotated genes found in the study and the genes associated with phosphoinositide-specific phosphilipases C associated with Co-x and the COK–4 loci found in previous studies. It is possible that the Co–4 locus is structured as a group of genes with functional domains dominated by protein tyrosine kinase along with leucine rich repeats/nucleotide binding site, phosphilipases C as well as β-glucanases. PMID:26431031
Identification of novel drought-tolerant-associated SNPs in common bean (Phaseolus vulgaris)
Villordo-Pineda, Emiliano; González-Chavira, Mario M.; Giraldo-Carbajo, Patricia; Acosta-Gallegos, Jorge A.; Caballero-Pérez, Juan
2015-01-01
Common bean (Phaseolus vulgaris L.) is a leguminous in high demand for human nutrition and a very important agricultural product. Production of common bean is constrained by environmental stresses such as drought. Although conventional plant selection has been used to increase production yield and stress tolerance, drought tolerance selection based on phenotype is complicated by associated physiological, anatomical, cellular, biochemical, and molecular changes. These changes are modulated by differential gene expression. A common method to identify genes associated with phenotypes of interest is the characterization of Single Nucleotide Polymorphims (SNPs) to link them to specific functions. In this work, we selected two drought-tolerant parental lines from Mesoamerica, Pinto Villa, and Pinto Saltillo. The parental lines were used to generate a population of 282 families (F3:5) and characterized by 169 SNPs. We associated the segregation of the molecular markers in our population with phenotypes including flowering time, physiological maturity, reproductive period, plant, seed and total biomass, reuse index, seed yield, weight of 100 seeds, and harvest index in three cultivation cycles. We observed 83 SNPs with significant association (p < 0.0003 after Bonferroni correction) with our quantified phenotypes. Phenotypes most associated were days to flowering and seed biomass with 58 and 44 associated SNPs, respectively. Thirty-seven out of the 83 SNPs were annotated to a gene with a potential function related to drought tolerance or relevant molecular/biochemical functions. Some SNPs such as SNP28 and SNP128 are related to starch biosynthesis, a common osmotic protector; and SNP18 is related to proline biosynthesis, another well-known osmotic protector. PMID:26257755
Identification of novel drought-tolerant-associated SNPs in common bean (Phaseolus vulgaris).
Villordo-Pineda, Emiliano; González-Chavira, Mario M; Giraldo-Carbajo, Patricia; Acosta-Gallegos, Jorge A; Caballero-Pérez, Juan
2015-01-01
Common bean (Phaseolus vulgaris L.) is a leguminous in high demand for human nutrition and a very important agricultural product. Production of common bean is constrained by environmental stresses such as drought. Although conventional plant selection has been used to increase production yield and stress tolerance, drought tolerance selection based on phenotype is complicated by associated physiological, anatomical, cellular, biochemical, and molecular changes. These changes are modulated by differential gene expression. A common method to identify genes associated with phenotypes of interest is the characterization of Single Nucleotide Polymorphims (SNPs) to link them to specific functions. In this work, we selected two drought-tolerant parental lines from Mesoamerica, Pinto Villa, and Pinto Saltillo. The parental lines were used to generate a population of 282 families (F3:5) and characterized by 169 SNPs. We associated the segregation of the molecular markers in our population with phenotypes including flowering time, physiological maturity, reproductive period, plant, seed and total biomass, reuse index, seed yield, weight of 100 seeds, and harvest index in three cultivation cycles. We observed 83 SNPs with significant association (p < 0.0003 after Bonferroni correction) with our quantified phenotypes. Phenotypes most associated were days to flowering and seed biomass with 58 and 44 associated SNPs, respectively. Thirty-seven out of the 83 SNPs were annotated to a gene with a potential function related to drought tolerance or relevant molecular/biochemical functions. Some SNPs such as SNP28 and SNP128 are related to starch biosynthesis, a common osmotic protector; and SNP18 is related to proline biosynthesis, another well-known osmotic protector.
Zhao, Nan; Han, Jing Ginger; Shyu, Chi-Ren; Korkin, Dmitry
2014-01-01
Single nucleotide polymorphisms (SNPs) are among the most common types of genetic variation in complex genetic disorders. A growing number of studies link the functional role of SNPs with the networks and pathways mediated by the disease-associated genes. For example, many non-synonymous missense SNPs (nsSNPs) have been found near or inside the protein-protein interaction (PPI) interfaces. Determining whether such nsSNP will disrupt or preserve a PPI is a challenging task to address, both experimentally and computationally. Here, we present this task as three related classification problems, and develop a new computational method, called the SNP-IN tool (non-synonymous SNP INteraction effect predictor). Our method predicts the effects of nsSNPs on PPIs, given the interaction's structure. It leverages supervised and semi-supervised feature-based classifiers, including our new Random Forest self-learning protocol. The classifiers are trained based on a dataset of comprehensive mutagenesis studies for 151 PPI complexes, with experimentally determined binding affinities of the mutant and wild-type interactions. Three classification problems were considered: (1) a 2-class problem (strengthening/weakening PPI mutations), (2) another 2-class problem (mutations that disrupt/preserve a PPI), and (3) a 3-class classification (detrimental/neutral/beneficial mutation effects). In total, 11 different supervised and semi-supervised classifiers were trained and assessed resulting in a promising performance, with the weighted f-measure ranging from 0.87 for Problem 1 to 0.70 for the most challenging Problem 3. By integrating prediction results of the 2-class classifiers into the 3-class classifier, we further improved its performance for Problem 3. To demonstrate the utility of SNP-IN tool, it was applied to study the nsSNP-induced rewiring of two disease-centered networks. The accurate and balanced performance of SNP-IN tool makes it readily available to study the rewiring of large-scale protein-protein interaction networks, and can be useful for functional annotation of disease-associated SNPs. SNIP-IN tool is freely accessible as a web-server at http://korkinlab.org/snpintool/. PMID:24784581
Enabling a Community to Dissect an Organism: Overview of the Neurospora Functional Genomics Project
Dunlap, Jay C.; Borkovich, Katherine A.; Henn, Matthew R.; Turner, Gloria E.; Sachs, Matthew S.; Glass, N. Louise; McCluskey, Kevin; Plamann, Michael; Galagan, James E.; Birren, Bruce W.; Weiss, Richard L.; Townsend, Jeffrey P.; Loros, Jennifer J.; Nelson, Mary Anne; Lambreghts, Randy; Colot, Hildur V.; Park, Gyungsoon; Collopy, Patrick; Ringelberg, Carol; Crew, Christopher; Litvinkova, Liubov; DeCaprio, Dave; Hood, Heather M.; Curilla, Susan; Shi, Mi; Crawford, Matthew; Koerhsen, Michael; Montgomery, Phil; Larson, Lisa; Pearson, Matthew; Kasuga, Takao; Tian, Chaoguang; Baştürkmen, Meray; Altamirano, Lorena; Xu, Junhuan
2013-01-01
A consortium of investigators is engaged in a functional genomics project centered on the filamentous fungus Neurospora, with an eye to opening up the functional genomic analysis of all the filamentous fungi. The overall goal of the four interdependent projects in this effort is to acccomplish functional genomics, annotation, and expression analyses of Neurospora crassa, a filamentous fungus that is an established model for the assemblage of over 250,000 species of nonyeast fungi. Building from the completely sequenced 43-Mb Neurospora genome, Project 1 is pursuing the systematic disruption of genes through targeted gene replacements, phenotypic analysis of mutant strains, and their distribution to the scientific community at large. Project 2, through a primary focus in Annotation and Bioinformatics, has developed a platform for electronically capturing community feedback and data about the existing annotation, while building and maintaining a database to capture and display information about phenotypes. Oligonucleotide-based microarrays created in Project 3 are being used to collect baseline expression data for the nearly 11,000 distinguishable transcripts in Neurospora under various conditions of growth and development, and eventually to begin to analyze the global effects of loss of novel genes in strains created by Project 1. cDNA libraries generated in Project 4 document the overall complexity of expressed sequences in Neurospora, including alternative splicing alternative promoters and antisense transcripts. In addition, these studies have driven the assembly of an SNP map presently populated by nearly 300 markers that will greatly accelerate the positional cloning of genes. PMID:17352902
PredictSNP: Robust and Accurate Consensus Classifier for Prediction of Disease-Related Mutations
Bendl, Jaroslav; Stourac, Jan; Salanda, Ondrej; Pavelka, Antonin; Wieben, Eric D.; Zendulka, Jaroslav; Brezovsky, Jan; Damborsky, Jiri
2014-01-01
Single nucleotide variants represent a prevalent form of genetic variation. Mutations in the coding regions are frequently associated with the development of various genetic diseases. Computational tools for the prediction of the effects of mutations on protein function are very important for analysis of single nucleotide variants and their prioritization for experimental characterization. Many computational tools are already widely employed for this purpose. Unfortunately, their comparison and further improvement is hindered by large overlaps between the training datasets and benchmark datasets, which lead to biased and overly optimistic reported performances. In this study, we have constructed three independent datasets by removing all duplicities, inconsistencies and mutations previously used in the training of evaluated tools. The benchmark dataset containing over 43,000 mutations was employed for the unbiased evaluation of eight established prediction tools: MAPP, nsSNPAnalyzer, PANTHER, PhD-SNP, PolyPhen-1, PolyPhen-2, SIFT and SNAP. The six best performing tools were combined into a consensus classifier PredictSNP, resulting into significantly improved prediction performance, and at the same time returned results for all mutations, confirming that consensus prediction represents an accurate and robust alternative to the predictions delivered by individual tools. A user-friendly web interface enables easy access to all eight prediction tools, the consensus classifier PredictSNP and annotations from the Protein Mutant Database and the UniProt database. The web server and the datasets are freely available to the academic community at http://loschmidt.chemi.muni.cz/predictsnp. PMID:24453961
KinSNP software for homozygosity mapping of disease genes using SNP microarrays
2010-01-01
Consanguineous families affected with a recessive genetic disease caused by homozygotisation of a mutation offer a unique advantage for positional cloning of rare diseases. Homozygosity mapping of patient genotypes is a powerful technique for the identification of the genomic locus harbouring the causing mutation. This strategy relies on the observation that in these patients a large region spanning the disease locus is also homozygous with high probability. The high marker density in single nucleotide polymorphism (SNP) arrays is extremely advantageous for homozygosity mapping. We present KinSNP, a user-friendly software tool for homozygosity mapping using SNP arrays. The software searches for stretches of SNPs which are homozygous to the same allele in all ascertained sick individuals. User-specified parameters control the number of allowed genotyping 'errors' within homozygous blocks. Candidate disease regions are then reported in a detailed, coloured Excel file, along with genotypes of family members and healthy controls. An interactive genome browser has been included which shows homozygous blocks, individual genotypes, genes and further annotations along the chromosomes, with zooming and scrolling capabilities. The software has been used to identify the location of a mutated gene causing insensitivity to pain in a large Bedouin family. KinSNP is freely available from http://bioinfo.bgu.ac.il/bsu/software/kinSNP. PMID:20846928
Wu, Jiaxin; Wu, Mengmeng; Li, Lianshuo; Liu, Zhuo; Zeng, Wanwen; Jiang, Rui
2016-01-01
The recent advancement of the next generation sequencing technology has enabled the fast and low-cost detection of all genetic variants spreading across the entire human genome, making the application of whole-genome sequencing a tendency in the study of disease-causing genetic variants. Nevertheless, there still lacks a repository that collects predictions of functionally damaging effects of human genetic variants, though it has been well recognized that such predictions play a central role in the analysis of whole-genome sequencing data. To fill this gap, we developed a database named dbWGFP (a database and web server of human whole-genome single nucleotide variants and their functional predictions) that contains functional predictions and annotations of nearly 8.58 billion possible human whole-genome single nucleotide variants. Specifically, this database integrates 48 functional predictions calculated by 17 popular computational methods and 44 valuable annotations obtained from various data sources. Standalone software, user-friendly query services and free downloads of this database are available at http://bioinfo.au.tsinghua.edu.cn/dbwgfp. dbWGFP provides a valuable resource for the analysis of whole-genome sequencing, exome sequencing and SNP array data, thereby complementing existing data sources and computational resources in deciphering genetic bases of human inherited diseases. © The Author(s) 2016. Published by Oxford University Press.
ePIANNO: ePIgenomics ANNOtation tool.
Liu, Chia-Hsin; Ho, Bing-Ching; Chen, Chun-Ling; Chang, Ya-Hsuan; Hsu, Yi-Chiung; Li, Yu-Cheng; Yuan, Shin-Sheng; Huang, Yi-Huan; Chang, Chi-Sheng; Li, Ker-Chau; Chen, Hsuan-Yu
2016-01-01
Recently, with the development of next generation sequencing (NGS), the combination of chromatin immunoprecipitation (ChIP) and NGS, namely ChIP-seq, has become a powerful technique to capture potential genomic binding sites of regulatory factors, histone modifications and chromatin accessible regions. For most researchers, additional information including genomic variations on the TF binding site, allele frequency of variation between different populations, variation associated disease, and other neighbour TF binding sites are essential to generate a proper hypothesis or a meaningful conclusion. Many ChIP-seq datasets had been deposited on the public domain to help researchers make new discoveries. However, researches are often intimidated by the complexity of data structure and largeness of data volume. Such information would be more useful if they could be combined or downloaded with ChIP-seq data. To meet such demands, we built a webtool: ePIgenomic ANNOtation tool (ePIANNO, http://epianno.stat.sinica.edu.tw/index.html). ePIANNO is a web server that combines SNP information of populations (1000 Genomes Project) and gene-disease association information of GWAS (NHGRI) with ChIP-seq (hmChIP, ENCODE, and ROADMAP epigenomics) data. ePIANNO has a user-friendly website interface allowing researchers to explore, navigate, and extract data quickly. We use two examples to demonstrate how users could use functions of ePIANNO webserver to explore useful information about TF related genomic variants. Users could use our query functions to search target regions, transcription factors, or annotations. ePIANNO may help users to generate hypothesis or explore potential biological functions for their studies.
Knoppers, Bartha M; Isasi, Rosario; Benvenisty, Nissim; Kim, Ock-Joo; Lomax, Geoffrey; Morris, Clive; Murray, Thomas H; Lee, Eng Hin; Perry, Margery; Richardson, Genevra; Sipp, Douglas; Tanner, Klaus; Wahlström, Jan; de Wert, Guido; Zeng, Fanyi
2011-09-01
Novel methods and associated tools permitting individual identification in publicly accessible SNP databases have become a debatable issue. There is growing concern that current technical and ethical safeguards to protect the identities of donors could be insufficient. In the context of human embryonic stem cell research, there are no studies focusing on the probability that an hESC line donor could be identified by analyzing published SNP profiles and associated genotypic and phenotypic information. We present the International Stem Cell Forum (ISCF) Ethics Working Party's Policy Statement on "Publishing SNP Genotypes of Human Embryonic Stem Cell Lines (hESC)". The Statement prospectively addresses issues surrounding the publication of genotypic data and associated annotations of hESC lines in open access databases. It proposes a balanced approach between the goals of open science and data sharing with the respect for fundamental bioethical principles (autonomy, privacy, beneficence, justice and research merit and integrity).
The pig genome project has plenty to squeal about.
Fan, B; Gorbach, D M; Rothschild, M F
2011-01-01
Significant progress on pig genetics and genomics research has been witnessed in recent years due to the integration of advanced molecular biology techniques, bioinformatics and computational biology, and the collaborative efforts of researchers in the swine genomics community. Progress on expanding the linkage map has slowed down, but the efforts have created a higher-resolution physical map integrating the clone map and BAC end sequence. The number of QTL mapped is still growing and most of the updated QTL mapping results are available through PigQTLdb. Additionally, expression studies using high-throughput microarrays and other gene expression techniques have made significant advancements. The number of identified non-coding RNAs is rapidly increasing and their exact regulatory functions are being explored. A publishable draft (build 10) of the swine genome sequence was available for the pig genomics community by the end of December 2010. Build 9 of the porcine genome is currently available with Ensembl annotation; manual annotation is ongoing. These drafts provide useful tools for such endeavors as comparative genomics and SNP scans for fine QTL mapping. A recent community-wide effort to create a 60K porcine SNP chip has greatly facilitated whole-genome association analyses, haplotype block construction and linkage disequilibrium mapping, which can contribute to whole-genome selection. The future 'systems biology' that integrates and optimizes the information from all research levels can enhance the pig community's understanding of the full complexity of the porcine genome. These recent technological advances and where they may lead are reviewed. Copyright © 2011 S. Karger AG, Basel.
Prospecting for pig single nucleotide polymorphisms in the human genome: have we struck gold?
Grapes, L; Rudd, S; Fernando, R L; Megy, K; Rocha, D; Rothschild, M F
2006-06-01
Gene-to-gene variation in the frequency of single nucleotide polymorphisms (SNPs) has been observed in humans, mice, rats, primates and pigs, but a relationship across species in this variation has not been described. Here, the frequency of porcine coding SNPs (cSNPs) identified by in silico methods, and the frequency of murine cSNPs, were compared with the frequency of human cSNPs across homologous genes. From 150,000 porcine expressed sequence tag (EST) sequences, a total of 452 SNP-containing sequence clusters were found, totalling 1394 putative SNPs. All the clustered porcine EST annotations and SNP data have been made publicly available at http://sputnik.btk.fi/project?name=swine. Human and murine cSNPs were identified from dbSNP and were characterized as either validated or total number of cSNPs (validated plus non-validated) for comparison purposes. The correlation between in silico pig cSNP and validated human cSNP densities was found to be 0.77 (p < 0.00001) for a set of 25 homologous genes, while a correlation of 0.48 (p < 0.0005) was found for a primarily random sample of 50 homologous human and mouse genes. This is the first evidence of conserved gene-to-gene variability in cSNP frequency across species and indicates that site-directed screening of porcine genes that are homologous to cSNP-rich human genes may rapidly advance cSNP discovery in pigs.
KinSNP software for homozygosity mapping of disease genes using SNP microarrays.
Amir, El-Ad David; Bartal, Ofer; Morad, Efrat; Nagar, Tal; Sheynin, Jony; Parvari, Ruti; Chalifa-Caspi, Vered
2010-08-01
Consanguineous families affected with a recessive genetic disease caused by homozygotisation of a mutation offer a unique advantage for positional cloning of rare diseases. Homozygosity mapping of patient genotypes is a powerful technique for the identification of the genomic locus harbouring the causing mutation. This strategy relies on the observation that in these patients a large region spanning the disease locus is also homozygous with high probability. The high marker density in single nucleotide polymorphism (SNP) arrays is extremely advantageous for homozygosity mapping. We present KinSNP, a user-friendly software tool for homozygosity mapping using SNP arrays. The software searches for stretches of SNPs which are homozygous to the same allele in all ascertained sick individuals. User-specified parameters control the number of allowed genotyping 'errors' within homozygous blocks. Candidate disease regions are then reported in a detailed, coloured Excel file, along with genotypes of family members and healthy controls. An interactive genome browser has been included which shows homozygous blocks, individual genotypes, genes and further annotations along the chromosomes, with zooming and scrolling capabilities. The software has been used to identify the location of a mutated gene causing insensitivity to pain in a large Bedouin family. KinSNP is freely available from.
Huang, Zhen; Peng, Gary; Liu, Xunjia; Deora, Abhinandan; Falk, Kevin C.; Gossen, Bruce D.; McDonald, Mary R.; Yu, Fengqun
2017-01-01
Clubroot, caused by Plasmodiophora brassicae, is an important disease of canola (Brassica napus) in western Canada and worldwide. In this study, a clubroot resistance gene (Rcr2) was identified and fine mapped in Chinese cabbage cv. “Jazz” using single-nucleotide polymorphisms (SNP) markers identified from bulked segregant RNA sequencing (BSR-Seq) and molecular markers were developed for use in marker assisted selection. In total, 203.9 million raw reads were generated from one pooled resistant (R) and one pooled susceptible (S) sample, and >173,000 polymorphic SNP sites were identified between the R and S samples. One significant peak was observed between 22 and 26 Mb of chromosome A03, which had been predicted by BSR-Seq to contain the causal gene Rcr2. There were 490 polymorphic SNP sites identified in the region. A segregating population consisting of 675 plants was analyzed with 15 SNP sites in the region using the Kompetitive Allele Specific PCR method, and Rcr2 was fine mapped between two SNP markers, SNP_A03_32 and SNP_A03_67 with 0.1 and 0.3 cM from Rcr2, respectively. Five SNP markers co-segregated with Rcr2 in this region. Variants were identified in 14 of 36 genes annotated in the Rcr2 target region. The numbers of poly variants differed among the genes. Four genes encode TIR-NBS-LRR proteins and two of them Bra019410 and Bra019413, had high numbers of polymorphic variants and so are the most likely candidates of Rcr2. PMID:28894454
Nunes, José de Ribamar da Silva; Liu, Shikai; Pértille, Fábio; Perazza, Caio Augusto; Villela, Priscilla Marqui Schmidt; de Almeida-Val, Vera Maria Fonseca; Hilsdorf, Alexandre Wagner Silva; Liu, Zhanjiang; Coutinho, Luiz Lehmann
2017-01-01
Colossoma macropomum, or tambaqui, is the largest native Characiform species found in the Amazon and Orinoco river basins, yet few resources for genetic studies and the genetic improvement of tambaqui exist. In this study, we identified a large number of single-nucleotide polymorphisms (SNPs) for tambaqui and constructed a high-resolution genetic linkage map from a full-sib family of 124 individuals and their parents using the genotyping by sequencing method. In all, 68,584 SNPs were initially identified using minimum minor allele frequency (MAF) of 5%. Filtering parameters were used to select high-quality markers for linkage analysis. We selected 7,734 SNPs for linkage mapping, resulting in 27 linkage groups with a minimum logarithm of odds (LOD) of 8 and maximum recombination fraction of 0.35. The final genetic map contains 7,192 successfully mapped markers that span a total of 2,811 cM, with an average marker interval of 0.39 cM. Comparative genomic analysis between tambaqui and zebrafish revealed variable levels of genomic conservation across the 27 linkage groups which allowed for functional SNP annotations. The large-scale SNP discovery obtained here, allowed us to build a high-density linkage map in tambaqui, which will be useful to enhance genetic studies that can be applied in breeding programs. PMID:28387238
SNPranker 2.0: a gene-centric data mining tool for diseases associated SNP prioritization in GWAS.
Merelli, Ivan; Calabria, Andrea; Cozzi, Paolo; Viti, Federica; Mosca, Ettore; Milanesi, Luciano
2013-01-01
The capability of correlating specific genotypes with human diseases is a complex issue in spite of all advantages arisen from high-throughput technologies, such as Genome Wide Association Studies (GWAS). New tools for genetic variants interpretation and for Single Nucleotide Polymorphisms (SNPs) prioritization are actually needed. Given a list of the most relevant SNPs statistically associated to a specific pathology as result of a genotype study, a critical issue is the identification of genes that are effectively related to the disease by re-scoring the importance of the identified genetic variations. Vice versa, given a list of genes, it can be of great importance to predict which SNPs can be involved in the onset of a particular disease, in order to focus the research on their effects. We propose a new bioinformatics approach to support biological data mining in the analysis and interpretation of SNPs associated to pathologies. This system can be employed to design custom genotyping chips for disease-oriented studies and to re-score GWAS results. The proposed method relies (1) on the data integration of public resources using a gene-centric database design, (2) on the evaluation of a set of static biomolecular annotations, defined as features, and (3) on the SNP scoring function, which computes SNP scores using parameters and weights set by users. We employed a machine learning classifier to set default feature weights and an ontological annotation layer to enable the enrichment of the input gene set. We implemented our method as a web tool called SNPranker 2.0 (http://www.itb.cnr.it/snpranker), improving our first published release of this system. A user-friendly interface allows the input of a list of genes, SNPs or a biological process, and to customize the features set with relative weights. As result, SNPranker 2.0 returns a list of SNPs, localized within input and ontologically enriched genes, combined with their prioritization scores. Different databases and resources are already available for SNPs annotation, but they do not prioritize or re-score SNPs relying on a-priori biomolecular knowledge. SNPranker 2.0 attempts to fill this gap through a user-friendly integrated web resource. End users, such as researchers in medical genetics and epidemiology, may find in SNPranker 2.0 a new tool for data mining and interpretation able to support SNPs analysis. Possible scenarios are GWAS data re-scoring, SNPs selection for custom genotyping arrays and SNPs/diseases association studies.
Yang, Qing; Sun, Fanyue; Yang, Zhi; Li, Hongjun
2014-01-01
Calanus sinicus Brodsky (Copepoda, Crustacea) is a dominant zooplanktonic species widely distributed in the margin seas of the Northwest Pacific Ocean. In this study, we utilized an RNA-Seq-based approach to develop molecular resources for C. sinicus. Adult samples were sequenced using the Illumina HiSeq 2000 platform. The sequencing data generated 69,751 contigs from 58.9 million filtered reads. The assembled contigs had an average length of 928.8 bp. Gene annotation allowed the identification of 43,417 unigene hits against the NCBI database. Gene ontology (GO) and KEGG pathway mapping analysis revealed various functional genes related to diverse biological functions and processes. Transcripts potentially involved in stress response and lipid metabolism were identified among these genes. Furthermore, 4,871 microsatellites and 110,137 single nucleotide polymorphisms (SNPs) were identified in the C. sinicus transcriptome sequences. SNP validation by the melting temperature (T m)-shift method suggested that 16 primer pairs amplified target products and showed biallelic polymorphism among 30 individuals. The present work demonstrates the power of Illumina-based RNA-Seq for the rapid development of molecular resources in nonmodel species. The validated SNP set from our study is currently being utilized in an ongoing ecological analysis to support a future study of C. sinicus population genetics. PMID:24982883
2009-01-01
Background Polymerase chain reaction (PCR) is very useful in many areas of molecular biology research. It is commonly observed that PCR success is critically dependent on design of an effective primer pair. Current tools for primer design do not adequately address the problem of PCR failure due to mis-priming on target-related sequences and structural variations in the genome. Methods We have developed an integrated graphical web-based application for primer design, called RExPrimer, which was written in Python language. The software uses Primer3 as the primer designing core algorithm. Locally stored sequence information and genomic variant information were hosted on MySQLv5.0 and were incorporated into RExPrimer. Results RExPrimer provides many functionalities for improved PCR primer design. Several databases, namely annotated human SNP databases, insertion/deletion (indel) polymorphisms database, pseudogene database, and structural genomic variation databases were integrated into RExPrimer, enabling an effective without-leaving-the-website validation of the resulting primers. By incorporating these databases, the primers reported by RExPrimer avoid mis-priming to related sequences (e.g. pseudogene, segmental duplication) as well as possible PCR failure because of structural polymorphisms (SNP, indel, and copy number variation (CNV)). To prevent mismatching caused by unexpected SNPs in the designed primers, in particular the 3' end (SNP-in-Primer), several SNP databases covering the broad range of population-specific SNP information are utilized to report SNPs present in the primer sequences. Population-specific SNP information also helps customize primer design for a specific population. Furthermore, RExPrimer offers a graphical user-friendly interface through the use of scalable vector graphic image that intuitively presents resulting primers along with the corresponding gene structure. In this study, we demonstrated the program effectiveness in successfully generating primers for strong homologous sequences. Conclusion The improvements for primer design incorporated into RExPrimer were demonstrated to be effective in designing primers for challenging PCR experiments. Integration of SNP and structural variation databases allows for robust primer design for a variety of PCR applications, irrespective of the sequence complexity in the region of interest. This software is freely available at http://www4a.biotec.or.th/rexprimer. PMID:19958502
Gao, Guangtu; Nome, Torfinn; Pearse, Devon E; Moen, Thomas; Naish, Kerry A; Thorgaard, Gary H; Lien, Sigbjørn; Palti, Yniv
2018-01-01
Single-nucleotide polymorphisms (SNPs) are highly abundant markers, which are broadly distributed in animal genomes. For rainbow trout ( Oncorhynchus mykiss ), SNP discovery has been previously done through sequencing of restriction-site associated DNA (RAD) libraries, reduced representation libraries (RRL) and RNA sequencing. Recently we have performed high coverage whole genome resequencing with 61 unrelated samples, representing a wide range of rainbow trout and steelhead populations, with 49 new samples added to 12 aquaculture samples from AquaGen (Norway) that we previously used for SNP discovery. Of the 49 new samples, 11 were double-haploid lines from Washington State University (WSU) and 38 represented wild and hatchery populations from a wide range of geographic distribution and with divergent migratory phenotypes. We then mapped the sequences to the new rainbow trout reference genome assembly (GCA_002163495.1) which is based on the Swanson YY doubled haploid line. Variant calling was conducted with FreeBayes and SAMtools mpileup , followed by filtering of SNPs based on quality score, sequence complexity, read depth on the locus, and number of genotyped samples. Results from the two variant calling programs were compared and genotypes of the double haploid samples were used for detecting and filtering putative paralogous sequence variants (PSVs) and multi-sequence variants (MSVs). Overall, 30,302,087 SNPs were identified on the rainbow trout genome 29 chromosomes and 1,139,018 on unplaced scaffolds, with 4,042,723 SNPs having high minor allele frequency (MAF > 0.25). The average SNP density on the chromosomes was one SNP per 64 bp, or 15.6 SNPs per 1 kb. Results from the phylogenetic analysis that we conducted indicate that the SNP markers contain enough population-specific polymorphisms for recovering population relationships despite the small sample size used. Intra-Population polymorphism assessment revealed high level of polymorphism and heterozygosity within each population. We also provide functional annotation based on the genome position of each SNP and evaluate the use of clonal lines for filtering of PSVs and MSVs. These SNPs form a new database, which provides an important resource for a new high density SNP array design and for other SNP genotyping platforms used for genetic and genomics studies of this iconic salmonid fish species.
Yue, Ming; Zhou, Dianshuang; Zhi, Hui; Wang, Peng; Zhang, Yan; Gao, Yue; Guo, Maoni; Li, Xin; Wang, Yanxia
2018-01-01
Abstract The MiRNA SNP Disease Database (MSDD, http://www.bio-bigdata.com/msdd/) is a manually curated database that provides comprehensive experimentally supported associations among microRNAs (miRNAs), single nucleotide polymorphisms (SNPs) and human diseases. SNPs in miRNA-related functional regions such as mature miRNAs, promoter regions, pri-miRNAs, pre-miRNAs and target gene 3′-UTRs, collectively called ‘miRSNPs’, represent a novel category of functional molecules. miRSNPs can lead to miRNA and its target gene dysregulation, and resulting in susceptibility to or onset of human diseases. A curated collection and summary of miRSNP-associated diseases is essential for a thorough understanding of the mechanisms and functions of miRSNPs. Here, we describe MSDD, which currently documents 525 associations among 182 human miRNAs, 197 SNPs, 153 genes and 164 human diseases through a review of more than 2000 published papers. Each association incorporates information on the miRNAs, SNPs, miRNA target genes and disease names, SNP locations and alleles, the miRNA dysfunctional pattern, experimental techniques, a brief functional description, the original reference and additional annotation. MSDD provides a user-friendly interface to conveniently browse, retrieve, download and submit novel data. MSDD will significantly improve our understanding of miRNA dysfunction in disease, and thus, MSDD has the potential to serve as a timely and valuable resource. PMID:29106642
Yue, Ming; Zhou, Dianshuang; Zhi, Hui; Wang, Peng; Zhang, Yan; Gao, Yue; Guo, Maoni; Li, Xin; Wang, Yanxia; Zhang, Yunpeng; Ning, Shangwei; Li, Xia
2018-01-04
The MiRNA SNP Disease Database (MSDD, http://www.bio-bigdata.com/msdd/) is a manually curated database that provides comprehensive experimentally supported associations among microRNAs (miRNAs), single nucleotide polymorphisms (SNPs) and human diseases. SNPs in miRNA-related functional regions such as mature miRNAs, promoter regions, pri-miRNAs, pre-miRNAs and target gene 3'-UTRs, collectively called 'miRSNPs', represent a novel category of functional molecules. miRSNPs can lead to miRNA and its target gene dysregulation, and resulting in susceptibility to or onset of human diseases. A curated collection and summary of miRSNP-associated diseases is essential for a thorough understanding of the mechanisms and functions of miRSNPs. Here, we describe MSDD, which currently documents 525 associations among 182 human miRNAs, 197 SNPs, 153 genes and 164 human diseases through a review of more than 2000 published papers. Each association incorporates information on the miRNAs, SNPs, miRNA target genes and disease names, SNP locations and alleles, the miRNA dysfunctional pattern, experimental techniques, a brief functional description, the original reference and additional annotation. MSDD provides a user-friendly interface to conveniently browse, retrieve, download and submit novel data. MSDD will significantly improve our understanding of miRNA dysfunction in disease, and thus, MSDD has the potential to serve as a timely and valuable resource. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
DoGSD: the dog and wolf genome SNP database.
Bai, Bing; Zhao, Wen-Ming; Tang, Bi-Xia; Wang, Yan-Qing; Wang, Lu; Zhang, Zhang; Yang, He-Chuan; Liu, Yan-Hu; Zhu, Jun-Wei; Irwin, David M; Wang, Guo-Dong; Zhang, Ya-Ping
2015-01-01
The rapid advancement of next-generation sequencing technology has generated a deluge of genomic data from domesticated dogs and their wild ancestor, grey wolves, which have simultaneously broadened our understanding of domestication and diseases that are shared by humans and dogs. To address the scarcity of single nucleotide polymorphism (SNP) data provided by authorized databases and to make SNP data more easily/friendly usable and available, we propose DoGSD (http://dogsd.big.ac.cn), the first canidae-specific database which focuses on whole genome SNP data from domesticated dogs and grey wolves. The DoGSD is a web-based, open-access resource comprising ∼ 19 million high-quality whole-genome SNPs. In addition to the dbSNP data set (build 139), DoGSD incorporates a comprehensive collection of SNPs from two newly sequenced samples (1 wolf and 1 dog) and collected SNPs from three latest dog/wolf genetic studies (7 wolves and 68 dogs), which were taken together for analysis with the population genetic statistics, Fst. In addition, DoGSD integrates some closely related information including SNP annotation, summary lists of SNPs located in genes, synonymous and non-synonymous SNPs, sampling location and breed information. All these features make DoGSD a useful resource for in-depth analysis in dog-/wolf-related studies. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Locus category based analysis of a large genome-wide association study of rheumatoid arthritis
Freudenberg, Jan; Lee, Annette T.; Siminovitch, Katherine A.; Amos, Christopher I.; Ballard, David; Li, Wentian; Gregersen, Peter K.
2010-01-01
To pinpoint true positive single-nucleotide polymorphism (SNP) associations in a genome-wide association study (GWAS) of rheumatoid arthritis (RA), we categorize genetic loci by external knowledge. We test both the ‘enrichment of associated loci’ in a locus category and the ‘combined association’ of a locus category. The former is quantified by the odds ratio for the presence of SNP associations at the loci of a category, whereas the latter is quantified by the number of loci in a category that have SNP associations. These measures are compared with their expected values as obtained from the permutation of the affection status. To account for linkage disequilibrium (LD) among SNPs, we view each LD block as a genetic locus. Positional candidates were defined as loci implicated by earlier GWAS results, whereas functional candidates were defined by annotations regarding the molecular roles of genes, such as gene ontology categories. As expected, immune-related categories show the largest enrichment signal, although it is not very strong. The intersection of positional and functional candidate information predicts novel RA loci near the genes TEC/TXK, MBL2 and PIK3R1/CD180. Notably, a combined association signal is not only produced by immune-related categories, but also by most other categories and even randomly defined categories. The unspecific quality of these signals limits the possible conclusions from combined association tests. It also reduces the magnitude of enrichment test results. These unspecific signals might result from common variants of small effect and hardly concentrated in candidate categories, or an inflated size of associated regions from weak LD with infrequent mutations. PMID:20639398
Goettel, Wolfgang; Xia, Eric; Upchurch, Robert; Wang, Ming-Li; Chen, Pengyin; An, Yong-Qiang Charles
2014-04-23
Variation in seed oil composition and content among soybean varieties is largely attributed to differences in transcript sequences and/or transcript accumulation of oil production related genes in seeds. Discovery and analysis of sequence and expression variations in these genes will accelerate soybean oil quality improvement. In an effort to identify these variations, we sequenced the transcriptomes of soybean seeds from nine lines varying in oil composition and/or total oil content. Our results showed that 69,338 distinct transcripts from 32,885 annotated genes were expressed in seeds. A total of 8,037 transcript expression polymorphisms and 50,485 transcript sequence polymorphisms (48,792 SNPs and 1,693 small Indels) were identified among the lines. Effects of the transcript polymorphisms on their encoded protein sequences and functions were predicted. The studies also provided independent evidence that the lack of FAD2-1A gene activity and a non-synonymous SNP in the coding sequence of FAB2C caused elevated oleic acid and stearic acid levels in soybean lines M23 and FAM94-41, respectively. As a proof-of-concept, we developed an integrated RNA-seq and bioinformatics approach to identify and functionally annotate transcript polymorphisms, and demonstrated its high effectiveness for discovery of genetic and transcript variations that result in altered oil quality traits. The collection of transcript polymorphisms coupled with their predicted functional effects will be a valuable asset for further discovery of genes, gene variants, and functional markers to improve soybean oil quality.
MultiBLUP: improved SNP-based prediction for complex traits.
Speed, Doug; Balding, David J
2014-09-01
BLUP (best linear unbiased prediction) is widely used to predict complex traits in plant and animal breeding, and increasingly in human genetics. The BLUP mathematical model, which consists of a single random effect term, was adequate when kinships were measured from pedigrees. However, when genome-wide SNPs are used to measure kinships, the BLUP model implicitly assumes that all SNPs have the same effect-size distribution, which is a severe and unnecessary limitation. We propose MultiBLUP, which extends the BLUP model to include multiple random effects, allowing greatly improved prediction when the random effects correspond to classes of SNPs with distinct effect-size variances. The SNP classes can be specified in advance, for example, based on SNP functional annotations, and we also provide an adaptive procedure for determining a suitable partition of SNPs. We apply MultiBLUP to genome-wide association data from the Wellcome Trust Case Control Consortium (seven diseases), and from much larger studies of celiac disease and inflammatory bowel disease, finding that it consistently provides better prediction than alternative methods. Moreover, MultiBLUP is computationally very efficient; for the largest data set, which includes 12,678 individuals and 1.5 M SNPs, the total analysis can be run on a single desktop PC in less than a day and can be parallelized to run even faster. Tools to perform MultiBLUP are freely available in our software LDAK. © 2014 Speed and Balding; Published by Cold Spring Harbor Laboratory Press.
Nicolazzi, Ezequiel L; Caprera, Andrea; Nazzicari, Nelson; Cozzi, Paolo; Strozzi, Francesco; Lawley, Cindy; Pirani, Ali; Soans, Chandrasen; Brew, Fiona; Jorjani, Hossein; Evans, Gary; Simpson, Barry; Tosser-Klopp, Gwenola; Brauning, Rudiger; Williams, John L; Stella, Alessandra
2015-04-10
In recent years, the use of genomic information in livestock species for genetic improvement, association studies and many other fields has become routine. In order to accommodate different market requirements in terms of genotyping cost, manufacturers of single nucleotide polymorphism (SNP) arrays, private companies and international consortia have developed a large number of arrays with different content and different SNP density. The number of currently available SNP arrays differs among species: ranging from one for goats to more than ten for cattle, and the number of arrays available is increasing rapidly. However, there is limited or no effort to standardize and integrate array- specific (e.g. SNP IDs, allele coding) and species-specific (i.e. past and current assemblies) SNP information. Here we present SNPchiMp v.3, a solution to these issues for the six major livestock species (cow, pig, horse, sheep, goat and chicken). Original data was collected directly from SNP array producers and specific international genome consortia, and stored in a MySQL database. The database was then linked to an open-access web tool and to public databases. SNPchiMp v.3 ensures fast access to the database (retrieving within/across SNP array data) and the possibility of annotating SNP array data in a user-friendly fashion. This platform allows easy integration and standardization, and it is aimed at both industry and research. It also enables users to easily link the information available from the array producer with data in public databases, without the need of additional bioinformatics tools or pipelines. In recognition of the open-access use of Ensembl resources, SNPchiMp v.3 was officially credited as an Ensembl E!mpowered tool. Availability at http://bioinformatics.tecnoparco.org/SNPchimp.
Dodhia, Kejal; Stoll, Thomas; Hastie, Marcus; Furuki, Eiko; Ellwood, Simon R.; Williams, Angela H.; Tan, Yew-Foon; Testa, Alison C.; Gorman, Jeffrey J.; Oliver, Richard P.
2016-01-01
Parastagonospora nodorum, the causal agent of Septoria nodorum blotch (SNB), is an economically important pathogen of wheat (Triticum spp.), and a model for the study of necrotrophic pathology and genome evolution. The reference P. nodorum strain SN15 was the first Dothideomycete with a published genome sequence, and has been used as the basis for comparison within and between species. Here we present an updated reference genome assembly with corrections of SNP and indel errors in the underlying genome assembly from deep resequencing data as well as extensive manual annotation of gene models using transcriptomic and proteomic sources of evidence (https://github.com/robsyme/Parastagonospora_nodorum_SN15). The updated assembly and annotation includes 8,366 genes with modified protein sequence and 866 new genes. This study shows the benefits of using a wide variety of experimental methods allied to expert curation to generate a reliable set of gene models. PMID:26840125
Zhang, Kunlin; Chang, Suhua; Cui, Sijia; Guo, Liyuan; Zhang, Liuyan; Wang, Jing
2011-07-01
Genome-wide association study (GWAS) is widely utilized to identify genes involved in human complex disease or some other trait. One key challenge for GWAS data interpretation is to identify causal SNPs and provide profound evidence on how they affect the trait. Currently, researches are focusing on identification of candidate causal variants from the most significant SNPs of GWAS, while there is lack of support on biological mechanisms as represented by pathways. Although pathway-based analysis (PBA) has been designed to identify disease-related pathways by analyzing the full list of SNPs from GWAS, it does not emphasize on interpreting causal SNPs. To our knowledge, so far there is no web server available to solve the challenge for GWAS data interpretation within one analytical framework. ICSNPathway is developed to identify candidate causal SNPs and their corresponding candidate causal pathways from GWAS by integrating linkage disequilibrium (LD) analysis, functional SNP annotation and PBA. ICSNPathway provides a feasible solution to bridge the gap between GWAS and disease mechanism study by generating hypothesis of SNP → gene → pathway(s). The ICSNPathway server is freely available at http://icsnpathway.psych.ac.cn/.
Li, Wan; Zhu, Lina; Huang, Hao; He, Yuehan; Lv, Junjie; Li, Weimin; Chen, Lina; He, Weiming
2017-10-01
Complex chronic diseases are caused by the effects of genetic and environmental factors. Single nucleotide polymorphisms (SNPs), one common type of genetic variations, played vital roles in diseases. We hypothesized that disease risk functional SNPs in coding regions and protein interaction network modules were more likely to contribute to the identification of disease susceptible genes for complex chronic diseases. This could help to further reveal the pathogenesis of complex chronic diseases. Disease risk SNPs were first recognized from public SNP data for coronary heart disease (CHD), hypertension (HT) and type 2 diabetes (T2D). SNPs in coding regions that were classified into nonsense and missense by integrating several SNP functional annotation databases were treated as functional SNPs. Then, regions significantly associated with each disease were screened using random permutations for disease risk functional SNPs. Corresponding to these regions, 155, 169 and 173 potential disease susceptible genes were identified for CHD, HT and T2D, respectively. A disease-related gene product interaction network in environmental context was constructed for interacting gene products of both disease genes and potential disease susceptible genes for these diseases. After functional enrichment analysis for disease associated modules, 5 CHD susceptible genes, 7 HT susceptible genes and 3 T2D susceptible genes were finally identified, some of which had pleiotropic effects. Most of these genes were verified to be related to these diseases in literature. This was similar for disease genes identified from another method proposed by Lee et al. from a different aspect. This research could provide novel perspectives for diagnosis and treatment of complex chronic diseases and susceptible genes identification for other diseases. Copyright © 2017 Elsevier Inc. All rights reserved.
Cronin, Matthew A; Rincon, Gonzalo; Meredith, Robert W; MacNeil, Michael D; Islas-Trejo, Alma; Cánovas, Angela; Medrano, Juan F
2014-01-01
We assessed the relationships of polar bears (Ursus maritimus), brown bears (U. arctos), and black bears (U. americanus) with high throughput genomic sequencing data with an average coverage of 25× for each species. A total of 1.4 billion 100-bp paired-end reads were assembled using the polar bear and annotated giant panda (Ailuropoda melanoleuca) genome sequences as references. We identified 13.8 million single nucleotide polymorphisms (SNP) in the 3 species aligned to the polar bear genome. These data indicate that polar bears and brown bears share more SNP with each other than either does with black bears. Concatenation and coalescence-based analysis of consensus sequences of approximately 1 million base pairs of ultraconserved elements in the nuclear genome resulted in a phylogeny with black bears as the sister group to brown and polar bears, and all brown bears are in a separate clade from polar bears. Genotypes for 162 SNP loci of 336 bears from Alaska and Montana showed that the species are genetically differentiated and there is geographic population structure of brown and black bears but not polar bears.
Otero, José Manuel; Vongsangnak, Wanwipa; Asadollahi, Mohammad A; Olivares-Hernandes, Roberto; Maury, Jérôme; Farinelli, Laurent; Barlocher, Loïc; Osterås, Magne; Schalk, Michel; Clark, Anthony; Nielsen, Jens
2010-12-22
The need for rapid and efficient microbial cell factory design and construction are possible through the enabling technology, metabolic engineering, which is now being facilitated by systems biology approaches. Metabolic engineering is often complimented by directed evolution, where selective pressure is applied to a partially genetically engineered strain to confer a desirable phenotype. The exact genetic modification or resulting genotype that leads to the improved phenotype is often not identified or understood to enable further metabolic engineering. In this work we performed whole genome high-throughput sequencing and annotation can be used to identify single nucleotide polymorphisms (SNPs) between Saccharomyces cerevisiae strains S288c and CEN.PK113-7D. The yeast strain S288c was the first eukaryote sequenced, serving as the reference genome for the Saccharomyces Genome Database, while CEN.PK113-7D is a preferred laboratory strain for industrial biotechnology research. A total of 13,787 high-quality SNPs were detected between both strains (reference strain: S288c). Considering only metabolic genes (782 of 5,596 annotated genes), a total of 219 metabolism specific SNPs are distributed across 158 metabolic genes, with 85 of the SNPs being nonsynonymous (e.g., encoding amino acid modifications). Amongst metabolic SNPs detected, there was pathway enrichment in the galactose uptake pathway (GAL1, GAL10) and ergosterol biosynthetic pathway (ERG8, ERG9). Physiological characterization confirmed a strong deficiency in galactose uptake and metabolism in S288c compared to CEN.PK113-7D, and similarly, ergosterol content in CEN.PK113-7D was significantly higher in both glucose and galactose supplemented cultivations compared to S288c. Furthermore, DNA microarray profiling of S288c and CEN.PK113-7D in both glucose and galactose batch cultures did not provide a clear hypothesis for major phenotypes observed, suggesting that genotype to phenotype correlations are manifested post-transcriptionally or post-translationally either through protein concentration and/or function. With an intensifying need for microbial cell factories that produce a wide array of target compounds, whole genome high-throughput sequencing and annotation for SNP detection can aid in better reducing and defining the metabolic landscape. This work demonstrates direct correlations between genotype and phenotype that provides clear and high-probability of success metabolic engineering targets. The genome sequence, annotation, and a SNP viewer of CEN.PK113-7D are deposited at http://www.sysbio.se/cenpk.
2010-01-01
Background The need for rapid and efficient microbial cell factory design and construction are possible through the enabling technology, metabolic engineering, which is now being facilitated by systems biology approaches. Metabolic engineering is often complimented by directed evolution, where selective pressure is applied to a partially genetically engineered strain to confer a desirable phenotype. The exact genetic modification or resulting genotype that leads to the improved phenotype is often not identified or understood to enable further metabolic engineering. Results In this work we performed whole genome high-throughput sequencing and annotation can be used to identify single nucleotide polymorphisms (SNPs) between Saccharomyces cerevisiae strains S288c and CEN.PK113-7D. The yeast strain S288c was the first eukaryote sequenced, serving as the reference genome for the Saccharomyces Genome Database, while CEN.PK113-7D is a preferred laboratory strain for industrial biotechnology research. A total of 13,787 high-quality SNPs were detected between both strains (reference strain: S288c). Considering only metabolic genes (782 of 5,596 annotated genes), a total of 219 metabolism specific SNPs are distributed across 158 metabolic genes, with 85 of the SNPs being nonsynonymous (e.g., encoding amino acid modifications). Amongst metabolic SNPs detected, there was pathway enrichment in the galactose uptake pathway (GAL1, GAL10) and ergosterol biosynthetic pathway (ERG8, ERG9). Physiological characterization confirmed a strong deficiency in galactose uptake and metabolism in S288c compared to CEN.PK113-7D, and similarly, ergosterol content in CEN.PK113-7D was significantly higher in both glucose and galactose supplemented cultivations compared to S288c. Furthermore, DNA microarray profiling of S288c and CEN.PK113-7D in both glucose and galactose batch cultures did not provide a clear hypothesis for major phenotypes observed, suggesting that genotype to phenotype correlations are manifested post-transcriptionally or post-translationally either through protein concentration and/or function. Conclusions With an intensifying need for microbial cell factories that produce a wide array of target compounds, whole genome high-throughput sequencing and annotation for SNP detection can aid in better reducing and defining the metabolic landscape. This work demonstrates direct correlations between genotype and phenotype that provides clear and high-probability of success metabolic engineering targets. The genome sequence, annotation, and a SNP viewer of CEN.PK113-7D are deposited at http://www.sysbio.se/cenpk. PMID:21176163
A tool for selecting SNPs for association studies based on observed linkage disequilibrium patterns.
De La Vega, Francisco M; Isaac, Hadar I; Scafe, Charles R
2006-01-01
The design of genetic association studies using single-nucleotide polymorphisms (SNPs) requires the selection of subsets of the variants providing high statistical power at a reasonable cost. SNPs must be selected to maximize the probability that a causative mutation is in linkage disequilibrium (LD) with at least one marker genotyped in the study. The HapMap project performed a genome-wide survey of genetic variation with about a million SNPs typed in four populations, providing a rich resource to inform the design of association studies. A number of strategies have been proposed for the selection of SNPs based on observed LD, including construction of metric LD maps and the selection of haplotype tagging SNPs. Power calculations are important at the study design stage to ensure successful results. Integrating these methods and annotations can be challenging: the algorithms required to implement these methods are complex to deploy, and all the necessary data and annotations are deposited in disparate databases. Here, we present the SNPbrowser Software, a freely available tool to assist in the LD-based selection of markers for association studies. This stand-alone application provides fast query capabilities and swift visualization of SNPs, gene annotations, power, haplotype blocks, and LD map coordinates. Wizards implement several common SNP selection workflows including the selection of optimal subsets of SNPs (e.g. tagging SNPs). Selected SNPs are screened for their conversion potential to either TaqMan SNP Genotyping Assays or the SNPlex Genotyping System, two commercially available genotyping platforms, expediting the set-up of genetic studies with an increased probability of success.
SNP ID-info: SNP ID searching and visualization platform.
Yang, Cheng-Hong; Chuang, Li-Yeh; Cheng, Yu-Huei; Wen, Cheng-Hao; Chang, Phei-Lang; Chang, Hsueh-Wei
2008-09-01
Many association studies provide the relationship between single nucleotide polymorphisms (SNPs), diseases and cancers, without giving a SNP ID, however. Here, we developed the SNP ID-info freeware to provide the SNP IDs within inputting genetic and physical information of genomes. The program provides an "SNP-ePCR" function to generate the full-sequence using primers and template inputs. In "SNPosition," sequence from SNP-ePCR or direct input is fed to match the SNP IDs from SNP fasta-sequence. In "SNP search" and "SNP fasta" function, information of SNPs within the cytogenetic band, contig position, and keyword input are acceptable. Finally, the SNP ID neighboring environment for inputs is completely visualized in the order of contig position and marked with SNP and flanking hits. The SNP identification problems inherent in NCBI SNP BLAST are also avoided. In conclusion, the SNP ID-info provides a visualized SNP ID environment for multiple inputs and assists systematic SNP association studies. The server and user manual are available at http://bio.kuas.edu.tw/snpid-info.
Transcriptome-Based Differentiation of Closely-Related Miscanthus Lines
Chouvarine, Philippe; Cooksey, Amanda M.; McCarthy, Fiona M.; ...
2012-01-10
Distinguishing between individuals is critical to those conducting animal/plant breeding, food safety/quality research, diagnostic and clinical testing, and evolutionary biology studies. Classical genetic identification studies are based on marker polymorphisms, but polymorphism-based techniques are time and labor intensive and often cannot distinguish between closely related individuals. Illumina sequencing technologies provide the detailed sequence data required for rapid and efficient differentiation of related species, lines/cultivars, and individuals in a cost-effective manner. Here we describe the use of Illumina high-throughput exome sequencing, coupled with SNP mapping, as a rapid means of distinguishing between related cultivars of the lignocellulosic bioenergy crop giant miscanthusmore » (Miscanthus6giganteus). We provide the first exome sequence database for Miscanthus species complete with Gene Ontology (GO) functional annotations."« less
Ahn, Yul-Kyun; Tripathi, Swati; Kim, Jeong-Ho; Cho, Young-Il; Lee, Hye-Eun; Kim, Do-Sun; Woo, Jong-Gyu; Cho, Myeong-Cheoul
2014-01-10
Next generation sequencing technologies have proven to be a rapid and cost-effective means to assemble and characterize gene content and identify molecular markers in various organisms. Pepper (Capsicum annuum L., Solanaceae) is a major staple vegetable crop, which is economically important and has worldwide distribution. High-throughput transcriptome profiling of two pepper cultivars, Mandarin and Blackcluster, using 454 GS-FLX pyrosequencing yielded 279,221 and 316,357 sequenced reads with a total 120.44 and 142.54Mb of sequence data (average read length of 431 and 450 nucleotides). These reads resulted from 17,525 and 16,341 'isogroups' and were assembled into 19,388 and 18,057 isotigs, and 22,217 and 13,153 singletons for both the cultivars, respectively. Assembled sequences were annotated functionally based on homology to genes in multiple public databases. Detailed sequence variant analysis identified a total of 9701 and 12,741 potential SNPs which eventually resulted in 1025 and 1059 genotype specific SNPs, for both the varieties, respectively, after examining SNP frequency distribution for each mapped unigenes. These markers for pepper will be highly valuable for marker-assisted breeding and other genetic studies. © 2013 Elsevier B.V. All rights reserved.
Mi, Huaiyu; Huang, Xiaosong; Muruganujan, Anushya; Tang, Haiming; Mills, Caitlin; Kang, Diane; Thomas, Paul D
2017-01-04
The PANTHER database (Protein ANalysis THrough Evolutionary Relationships, http://pantherdb.org) contains comprehensive information on the evolution and function of protein-coding genes from 104 completely sequenced genomes. PANTHER software tools allow users to classify new protein sequences, and to analyze gene lists obtained from large-scale genomics experiments. In the past year, major improvements include a large expansion of classification information available in PANTHER, as well as significant enhancements to the analysis tools. Protein subfamily functional classifications have more than doubled due to progress of the Gene Ontology Phylogenetic Annotation Project. For human genes (as well as a few other organisms), PANTHER now also supports enrichment analysis using pathway classifications from the Reactome resource. The gene list enrichment tools include a new 'hierarchical view' of results, enabling users to leverage the structure of the classifications/ontologies; the tools also allow users to upload genetic variant data directly, rather than requiring prior conversion to a gene list. The updated coding single-nucleotide polymorphisms (SNP) scoring tool uses an improved algorithm. The hidden Markov model (HMM) search tools now use HMMER3, dramatically reducing search times and improving accuracy of E-value statistics. Finally, the PANTHER Tree-Attribute Viewer has been implemented in JavaScript, with new views for exploring protein sequence evolution. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Fang, Hai; Knezevic, Bogdan; Burnham, Katie L; Knight, Julian C
2016-12-13
Biological interpretation of genomic summary data such as those resulting from genome-wide association studies (GWAS) and expression quantitative trait loci (eQTL) studies is one of the major bottlenecks in medical genomics research, calling for efficient and integrative tools to resolve this problem. We introduce eXploring Genomic Relations (XGR), an open source tool designed for enhanced interpretation of genomic summary data enabling downstream knowledge discovery. Targeting users of varying computational skills, XGR utilises prior biological knowledge and relationships in a highly integrated but easily accessible way to make user-input genomic summary datasets more interpretable. We show how by incorporating ontology, annotation, and systems biology network-driven approaches, XGR generates more informative results than conventional analyses. We apply XGR to GWAS and eQTL summary data to explore the genomic landscape of the activated innate immune response and common immunological diseases. We provide genomic evidence for a disease taxonomy supporting the concept of a disease spectrum from autoimmune to autoinflammatory disorders. We also show how XGR can define SNP-modulated gene networks and pathways that are shared and distinct between diseases, how it achieves functional, phenotypic and epigenomic annotations of genes and variants, and how it enables exploring annotation-based relationships between genetic variants. XGR provides a single integrated solution to enhance interpretation of genomic summary data for downstream biological discovery. XGR is released as both an R package and a web-app, freely available at http://galahad.well.ox.ac.uk/XGR .
2014-01-01
Background Variation in seed oil composition and content among soybean varieties is largely attributed to differences in transcript sequences and/or transcript accumulation of oil production related genes in seeds. Discovery and analysis of sequence and expression variations in these genes will accelerate soybean oil quality improvement. Results In an effort to identify these variations, we sequenced the transcriptomes of soybean seeds from nine lines varying in oil composition and/or total oil content. Our results showed that 69,338 distinct transcripts from 32,885 annotated genes were expressed in seeds. A total of 8,037 transcript expression polymorphisms and 50,485 transcript sequence polymorphisms (48,792 SNPs and 1,693 small Indels) were identified among the lines. Effects of the transcript polymorphisms on their encoded protein sequences and functions were predicted. The studies also provided independent evidence that the lack of FAD2-1A gene activity and a non-synonymous SNP in the coding sequence of FAB2C caused elevated oleic acid and stearic acid levels in soybean lines M23 and FAM94-41, respectively. Conclusions As a proof-of-concept, we developed an integrated RNA-seq and bioinformatics approach to identify and functionally annotate transcript polymorphisms, and demonstrated its high effectiveness for discovery of genetic and transcript variations that result in altered oil quality traits. The collection of transcript polymorphisms coupled with their predicted functional effects will be a valuable asset for further discovery of genes, gene variants, and functional markers to improve soybean oil quality. PMID:24755115
Transcriptome sequencing for high throughput SNP development and genetic mapping in Pea
2014-01-01
Background Pea has a complex genome of 4.3 Gb for which only limited genomic resources are available to date. Although SNP markers are now highly valuable for research and modern breeding, only a few are described and used in pea for genetic diversity and linkage analysis. Results We developed a large resource by cDNA sequencing of 8 genotypes representative of modern breeding material using the Roche 454 technology, combining both long reads (400 bp) and high coverage (3.8 million reads, reaching a total of 1,369 megabases). Sequencing data were assembled and generated a 68 K unigene set, from which 41 K were annotated from their best blast hit against the model species Medicago truncatula. Annotated contigs showed an even distribution along M. truncatula pseudochromosomes, suggesting a good representation of the pea genome. 10 K pea contigs were found to be polymorphic among the genetic material surveyed, corresponding to 35 K SNPs. We validated a subset of 1538 SNPs through the GoldenGate assay, proving their ability to structure a diversity panel of breeding germplasm. Among them, 1340 were genetically mapped and used to build a new consensus map comprising a total of 2070 markers. Based on blast analysis, we could establish 1252 bridges between our pea consensus map and the pseudochromosomes of M. truncatula, which provides new insight on synteny between the two species. Conclusions Our approach created significant new resources in pea, i.e. the most comprehensive genetic map to date tightly linked to the model species M. truncatula and a large SNP resource for both academic research and breeding. PMID:24521263
Fang, Lingzhao; Sahana, Goutam; Ma, Peipei; Su, Guosheng; Yu, Ying; Zhang, Shengli; Lund, Mogens Sandø; Sørensen, Peter
2017-05-12
A better understanding of the genetic architecture of complex traits can contribute to improve genomic prediction. We hypothesized that genomic variants associated with mastitis and milk production traits in dairy cattle are enriched in hepatic transcriptomic regions that are responsive to intra-mammary infection (IMI). Genomic markers [e.g. single nucleotide polymorphisms (SNPs)] from those regions, if included, may improve the predictive ability of a genomic model. We applied a genomic feature best linear unbiased prediction model (GFBLUP) to implement the above strategy by considering the hepatic transcriptomic regions responsive to IMI as genomic features. GFBLUP, an extension of GBLUP, includes a separate genomic effect of SNPs within a genomic feature, and allows differential weighting of the individual marker relationships in the prediction equation. Since GFBLUP is computationally intensive, we investigated whether a SNP set test could be a computationally fast way to preselect predictive genomic features. The SNP set test assesses the association between a genomic feature and a trait based on single-SNP genome-wide association studies. We applied these two approaches to mastitis and milk production traits (milk, fat and protein yield) in Holstein (HOL, n = 5056) and Jersey (JER, n = 1231) cattle. We observed that a majority of genomic features were enriched in genomic variants that were associated with mastitis and milk production traits. Compared to GBLUP, the accuracy of genomic prediction with GFBLUP was marginally improved (3.2 to 3.9%) in within-breed prediction. The highest increase (164.4%) in prediction accuracy was observed in across-breed prediction. The significance of genomic features based on the SNP set test were correlated with changes in prediction accuracy of GFBLUP (P < 0.05). GFBLUP provides a framework for integrating multiple layers of biological knowledge to provide novel insights into the biological basis of complex traits, and to improve the accuracy of genomic prediction. The SNP set test might be used as a first-step to improve GFBLUP models. Approaches like GFBLUP and SNP set test will become increasingly useful, as the functional annotations of genomes keep accumulating for a range of species and traits.
Alcina, Antonio; Fedetz, Maria; Fernández, Óscar; Saiz, Albert; Izquierdo, Guillermo; Lucas, Miguel; Leyva, Laura; García-León, Juan-Antonio; Abad-Grau, María del Mar; Alloza, Iraide; Antigüedad, Alfredo; Garcia-Barcina, María J; Vandenbroeck, Koen; Varadé, Jezabel; de la Hera, Belén; Arroyo, Rafael; Comabella, Manuel; Montalban, Xavier; Petit-Marty, Natalia; Navarro, Arcadi; Otaegui, David; Olascoaga, Javier; Blanco, Yolanda; Urcelay, Elena; Matesanz, Fuencisla
2013-01-01
Background and aim Several studies have highlighted the association of the 12q13.3–12q14.1 region with coeliac disease, type 1 diabetes, rheumatoid arthritis and multiple sclerosis (MS); however, the causal variants underlying diseases are still unclear. The authors sought to identify the functional variant of this region associated with MS. Methods Tag-single nucleotide polymorphism (SNP) analysis of the associated region encoding 15 genes was performed in 2876 MS patients and 2910 healthy Caucasian controls together with expression regulation analyses. Results rs6581155, which tagged 18 variants within a region where 9 genes map, was sufficient to model the association. This SNP was in total linkage disequilibrium (LD) with other polymorphisms that associated with the expression levels of FAM119B, AVIL, TSFM, TSPAN31 and CYP27B1 genes in different expression quantitative trait loci studies. Functional annotations from Encyclopedia of DNA Elements (ENCODE) showed that six out of these rs6581155-tagged-SNPs were located in regions with regulatory potential and only one of them, rs10877013, exhibited allele-dependent (ratio A/G=9.5-fold) and orientation-dependent (forward/reverse=2.7-fold) enhancer activity as determined by luciferase reporter assays. This enhancer is located in a region where a long-range chromatin interaction among the promoters and promoter-enhancer of several genes has been described, possibly affecting their expression simultaneously. Conclusions This study determines a functional variant which alters the enhancer activity of a regulatory element in the locus affecting the expression of several genes and explains the association of the 12q13.3–12q14.1 region with MS. PMID:23160276
Naithani, Sushma; Sullivan, Chris; Preece, Justin; Tiwari, Vijay K.; Elser, Justin; Leonard, Jeffrey M.; Sage, Abigail; Gresham, Cathy; Kerhornou, Arnaud; Bolser, Dan; McCarthy, Fiona; Kersey, Paul; Lazo, Gerard R.; Jaiswal, Pankaj
2014-01-01
Background Triticum monococcum (2n) is a close ancestor of T. urartu, the A-genome progenitor of cultivated hexaploid wheat, and is therefore a useful model for the study of components regulating photomorphogenesis in diploid wheat. In order to develop genetic and genomic resources for such a study, we constructed genome-wide transcriptomes of two Triticum monococcum subspecies, the wild winter wheat T. monococcum ssp. aegilopoides (accession G3116) and the domesticated spring wheat T. monococcum ssp. monococcum (accession DV92) by generating de novo assemblies of RNA-Seq data derived from both etiolated and green seedlings. Principal Findings The de novo transcriptome assemblies of DV92 and G3116 represent 120,911 and 117,969 transcripts, respectively. We successfully mapped ∼90% of these transcripts from each accession to barley and ∼95% of the transcripts to T. urartu genomes. However, only ∼77% transcripts mapped to the annotated barley genes and ∼85% transcripts mapped to the annotated T. urartu genes. Differential gene expression analyses revealed 22% more light up-regulated and 35% more light down-regulated transcripts in the G3116 transcriptome compared to DV92. The DV92 and G3116 mRNA sequence reads aligned against the reference barley genome led to the identification of ∼500,000 single nucleotide polymorphism (SNP) and ∼22,000 simple sequence repeat (SSR) sites. Conclusions De novo transcriptome assemblies of two accessions of the diploid wheat T. monococcum provide new empirical transcriptome references for improving Triticeae genome annotations, and insights into transcriptional programming during photomorphogenesis. The SNP and SSR sites identified in our analysis provide additional resources for the development of molecular markers. PMID:24821410
Fine-mapping identifies two additional breast cancer susceptibility loci at 9q31.2
Orr, Nick; Dudbridge, Frank; Dryden, Nicola; Maguire, Sarah; Novo, Daniela; Perrakis, Eleni; Johnson, Nichola; Ghoussaini, Maya; Hopper, John L.; Southey, Melissa C.; Apicella, Carmel; Stone, Jennifer; Schmidt, Marjanka K.; Broeks, Annegien; Van't Veer, Laura J.; Hogervorst, Frans B.; Fasching, Peter A.; Haeberle, Lothar; Ekici, Arif B.; Beckmann, Matthias W.; Gibson, Lorna; Aitken, Zoe; Warren, Helen; Sawyer, Elinor; Tomlinson, Ian; Kerin, Michael J.; Miller, Nicola; Burwinkel, Barbara; Marme, Frederik; Schneeweiss, Andreas; Sohn, Chistof; Guénel, Pascal; Truong, Thérèse; Cordina-Duverger, Emilie; Sanchez, Marie; Bojesen, Stig E.; Nordestgaard, Børge G.; Nielsen, Sune F.; Flyger, Henrik; Benitez, Javier; Zamora, Maria Pilar; Arias Perez, Jose Ignacio; Menéndez, Primitiva; Anton-Culver, Hoda; Neuhausen, Susan L.; Brenner, Hermann; Dieffenbach, Aida Karina; Arndt, Volker; Stegmaier, Christa; Hamann, Ute; Brauch, Hiltrud; Justenhoven, Christina; Brüning, Thomas; Ko, Yon-Dschun; Nevanlinna, Heli; Aittomäki, Kristiina; Blomqvist, Carl; Khan, Sofia; Bogdanova, Natalia; Dörk, Thilo; Lindblom, Annika; Margolin, Sara; Mannermaa, Arto; Kataja, Vesa; Kosma, Veli-Matti; Hartikainen, Jaana M.; Chenevix-Trench, Georgia; Beesley, Jonathan; Lambrechts, Diether; Moisse, Matthieu; Floris, Guiseppe; Beuselinck, Benoit; Chang-Claude, Jenny; Rudolph, Anja; Seibold, Petra; Flesch-Janys, Dieter; Radice, Paolo; Peterlongo, Paolo; Peissel, Bernard; Pensotti, Valeria; Couch, Fergus J.; Olson, Janet E.; Slettedahl, Seth; Vachon, Celine; Giles, Graham G.; Milne, Roger L.; McLean, Catriona; Haiman, Christopher A.; Henderson, Brian E.; Schumacher, Fredrick; Le Marchand, Loic; Simard, Jacques; Goldberg, Mark S.; Labrèche, France; Dumont, Martine; Kristensen, Vessela; Alnæs, Grethe Grenaker; Nord, Silje; Borresen-Dale, Anne-Lise; Zheng, Wei; Deming-Halverson, Sandra; Shrubsole, Martha; Long, Jirong; Winqvist, Robert; Pylkäs, Katri; Jukkola-Vuorinen, Arja; Grip, Mervi; Andrulis, Irene L.; Knight, Julia A.; Glendon, Gord; Tchatchou, Sandrine; Devilee, Peter; Tollenaar, Robertus A. E. M.; Seynaeve, Caroline M.; Van Asperen, Christi J.; Garcia-Closas, Montserrat; Figueroa, Jonine; Chanock, Stephen J.; Lissowska, Jolanta; Czene, Kamila; Darabi, Hatef; Eriksson, Mikael; Klevebring, Daniel; Hooning, Maartje J.; Hollestelle, Antoinette; van Deurzen, Carolien H. M.; Kriege, Mieke; Hall, Per; Li, Jingmei; Liu, Jianjun; Humphreys, Keith; Cox, Angela; Cross, Simon S.; Reed, Malcolm W. R.; Pharoah, Paul D. P.; Dunning, Alison M.; Shah, Mitul; Perkins, Barbara J.; Jakubowska, Anna; Lubinski, Jan; Jaworska-Bieniek, Katarzyna; Durda, Katarzyna; Ashworth, Alan; Swerdlow, Anthony; Jones, Michael; Schoemaker, Minouk J.; Meindl, Alfons; Schmutzler, Rita K.; Olswold, Curtis; Slager, Susan; Toland, Amanda E.; Yannoukakos, Drakoulis; Muir, Kenneth; Lophatananon, Artitaya; Stewart-Brown, Sarah; Siriwanarangsan, Pornthep; Matsuo, Keitaro; Ito, Hidema; Iwata, Hiroji; Ishiguro, Junko; Wu, Anna H.; Tseng, Chiu-chen; Van Den Berg, David; Stram, Daniel O.; Teo, Soo Hwang; Yip, Cheng Har; Kang, Peter; Ikram, Mohammad Kamran; Shu, Xiao-Ou; Lu, Wei; Gao, Yu-Tang; Cai, Hui; Kang, Daehee; Choi, Ji-Yeob; Park, Sue K.; Noh, Dong-Young; Hartman, Mikael; Miao, Hui; Lim, Wei Yen; Lee, Soo Chin; Sangrajrang, Suleeporn; Gaborieau, Valerie; Brennan, Paul; Mckay, James; Wu, Pei-Ei; Hou, Ming-Feng; Yu, Jyh-Cherng; Shen, Chen-Yang; Blot, William; Cai, Qiuyin; Signorello, Lisa B.; Luccarini, Craig; Bayes, Caroline; Ahmed, Shahana; Maranian, Mel; Healey, Catherine S.; González-Neira, Anna; Pita, Guillermo; Alonso, M. Rosario; Álvarez, Nuria; Herrero, Daniel; Tessier, Daniel C.; Vincent, Daniel; Bacot, Francois; Hunter, David J.; Lindstrom, Sara; Dennis, Joe; Michailidou, Kyriaki; Bolla, Manjeet K.; Easton, Douglas F.; dos Santos Silva, Isabel; Fletcher, Olivia; Peto, Julian
2015-01-01
We recently identified a novel susceptibility variant, rs865686, for estrogen-receptor positive breast cancer at 9q31.2. Here, we report a fine-mapping analysis of the 9q31.2 susceptibility locus using 43 160 cases and 42 600 controls of European ancestry ascertained from 52 studies and a further 5795 cases and 6624 controls of Asian ancestry from nine studies. Single nucleotide polymorphism (SNP) rs676256 was most strongly associated with risk in Europeans (odds ratios [OR] = 0.90 [0.88–0.92]; P-value = 1.58 × 10−25). This SNP is one of a cluster of highly correlated variants, including rs865686, that spans ∼14.5 kb. We identified two additional independent association signals demarcated by SNPs rs10816625 (OR = 1.12 [1.08–1.17]; P-value = 7.89 × 10−09) and rs13294895 (OR = 1.09 [1.06–1.12]; P-value = 2.97 × 10−11). SNP rs10816625, but not rs13294895, was also associated with risk of breast cancer in Asian individuals (OR = 1.12 [1.06–1.18]; P-value = 2.77 × 10−05). Functional genomic annotation using data derived from breast cancer cell-line models indicates that these SNPs localise to putative enhancer elements that bind known drivers of hormone-dependent breast cancer, including ER-α, FOXA1 and GATA-3. In vitro analyses indicate that rs10816625 and rs13294895 have allele-specific effects on enhancer activity and suggest chromatin interactions with the KLF4 gene locus. These results demonstrate the power of dense genotyping in large studies to identify independent susceptibility variants. Analysis of associations using subjects with different ancestry, combined with bioinformatic and genomic characterisation, can provide strong evidence for the likely causative alleles and their functional basis. PMID:25652398
Kogelman, Lisette J. A.; Pant, Sameer D.; Fredholm, Merete; Kadarmideen, Haja N.
2014-01-01
Obesity is a complex condition with world-wide exponentially rising prevalence rates, linked with severe diseases like Type 2 Diabetes. Economic and welfare consequences have led to a raised interest in a better understanding of the biological and genetic background. To date, whole genome investigations focusing on single genetic variants have achieved limited success, and the importance of including genetic interactions is becoming evident. Here, the aim was to perform an integrative genomic analysis in an F2 pig resource population that was constructed with an aim to maximize genetic variation of obesity-related phenotypes and genotyped using the 60K SNP chip. Firstly, Genome Wide Association (GWA) analysis was performed on the Obesity Index to locate candidate genomic regions that were further validated using combined Linkage Disequilibrium Linkage Analysis and investigated by evaluation of haplotype blocks. We built Weighted Interaction SNP Hub (WISH) and differentially wired (DW) networks using genotypic correlations amongst obesity-associated SNPs resulting from GWA analysis. GWA results and SNP modules detected by WISH and DW analyses were further investigated by functional enrichment analyses. The functional annotation of SNPs revealed several genes associated with obesity, e.g., NPC2 and OR4D10. Moreover, gene enrichment analyses identified several significantly associated pathways, over and above the GWA study results, that may influence obesity and obesity related diseases, e.g., metabolic processes. WISH networks based on genotypic correlations allowed further identification of various gene ontology terms and pathways related to obesity and related traits, which were not identified by the GWA study. In conclusion, this is the first study to develop a (genetic) obesity index and employ systems genetics in a porcine model to provide important insights into the complex genetic architecture associated with obesity and many biological pathways that underlie it. PMID:25071839
Fine-mapping identifies two additional breast cancer susceptibility loci at 9q31.2.
Orr, Nick; Dudbridge, Frank; Dryden, Nicola; Maguire, Sarah; Novo, Daniela; Perrakis, Eleni; Johnson, Nichola; Ghoussaini, Maya; Hopper, John L; Southey, Melissa C; Apicella, Carmel; Stone, Jennifer; Schmidt, Marjanka K; Broeks, Annegien; Van't Veer, Laura J; Hogervorst, Frans B; Fasching, Peter A; Haeberle, Lothar; Ekici, Arif B; Beckmann, Matthias W; Gibson, Lorna; Aitken, Zoe; Warren, Helen; Sawyer, Elinor; Tomlinson, Ian; Kerin, Michael J; Miller, Nicola; Burwinkel, Barbara; Marme, Frederik; Schneeweiss, Andreas; Sohn, Chistof; Guénel, Pascal; Truong, Thérèse; Cordina-Duverger, Emilie; Sanchez, Marie; Bojesen, Stig E; Nordestgaard, Børge G; Nielsen, Sune F; Flyger, Henrik; Benitez, Javier; Zamora, Maria Pilar; Arias Perez, Jose Ignacio; Menéndez, Primitiva; Anton-Culver, Hoda; Neuhausen, Susan L; Brenner, Hermann; Dieffenbach, Aida Karina; Arndt, Volker; Stegmaier, Christa; Hamann, Ute; Brauch, Hiltrud; Justenhoven, Christina; Brüning, Thomas; Ko, Yon-Dschun; Nevanlinna, Heli; Aittomäki, Kristiina; Blomqvist, Carl; Khan, Sofia; Bogdanova, Natalia; Dörk, Thilo; Lindblom, Annika; Margolin, Sara; Mannermaa, Arto; Kataja, Vesa; Kosma, Veli-Matti; Hartikainen, Jaana M; Chenevix-Trench, Georgia; Beesley, Jonathan; Lambrechts, Diether; Moisse, Matthieu; Floris, Guiseppe; Beuselinck, Benoit; Chang-Claude, Jenny; Rudolph, Anja; Seibold, Petra; Flesch-Janys, Dieter; Radice, Paolo; Peterlongo, Paolo; Peissel, Bernard; Pensotti, Valeria; Couch, Fergus J; Olson, Janet E; Slettedahl, Seth; Vachon, Celine; Giles, Graham G; Milne, Roger L; McLean, Catriona; Haiman, Christopher A; Henderson, Brian E; Schumacher, Fredrick; Le Marchand, Loic; Simard, Jacques; Goldberg, Mark S; Labrèche, France; Dumont, Martine; Kristensen, Vessela; Alnæs, Grethe Grenaker; Nord, Silje; Borresen-Dale, Anne-Lise; Zheng, Wei; Deming-Halverson, Sandra; Shrubsole, Martha; Long, Jirong; Winqvist, Robert; Pylkäs, Katri; Jukkola-Vuorinen, Arja; Grip, Mervi; Andrulis, Irene L; Knight, Julia A; Glendon, Gord; Tchatchou, Sandrine; Devilee, Peter; Tollenaar, Robertus A E M; Seynaeve, Caroline M; Van Asperen, Christi J; Garcia-Closas, Montserrat; Figueroa, Jonine; Chanock, Stephen J; Lissowska, Jolanta; Czene, Kamila; Darabi, Hatef; Eriksson, Mikael; Klevebring, Daniel; Hooning, Maartje J; Hollestelle, Antoinette; van Deurzen, Carolien H M; Kriege, Mieke; Hall, Per; Li, Jingmei; Liu, Jianjun; Humphreys, Keith; Cox, Angela; Cross, Simon S; Reed, Malcolm W R; Pharoah, Paul D P; Dunning, Alison M; Shah, Mitul; Perkins, Barbara J; Jakubowska, Anna; Lubinski, Jan; Jaworska-Bieniek, Katarzyna; Durda, Katarzyna; Ashworth, Alan; Swerdlow, Anthony; Jones, Michael; Schoemaker, Minouk J; Meindl, Alfons; Schmutzler, Rita K; Olswold, Curtis; Slager, Susan; Toland, Amanda E; Yannoukakos, Drakoulis; Muir, Kenneth; Lophatananon, Artitaya; Stewart-Brown, Sarah; Siriwanarangsan, Pornthep; Matsuo, Keitaro; Ito, Hidema; Iwata, Hiroji; Ishiguro, Junko; Wu, Anna H; Tseng, Chiu-Chen; Van Den Berg, David; Stram, Daniel O; Teo, Soo Hwang; Yip, Cheng Har; Kang, Peter; Ikram, Mohammad Kamran; Shu, Xiao-Ou; Lu, Wei; Gao, Yu-Tang; Cai, Hui; Kang, Daehee; Choi, Ji-Yeob; Park, Sue K; Noh, Dong-Young; Hartman, Mikael; Miao, Hui; Lim, Wei Yen; Lee, Soo Chin; Sangrajrang, Suleeporn; Gaborieau, Valerie; Brennan, Paul; Mckay, James; Wu, Pei-Ei; Hou, Ming-Feng; Yu, Jyh-Cherng; Shen, Chen-Yang; Blot, William; Cai, Qiuyin; Signorello, Lisa B; Luccarini, Craig; Bayes, Caroline; Ahmed, Shahana; Maranian, Mel; Healey, Catherine S; González-Neira, Anna; Pita, Guillermo; Alonso, M Rosario; Álvarez, Nuria; Herrero, Daniel; Tessier, Daniel C; Vincent, Daniel; Bacot, Francois; Hunter, David J; Lindstrom, Sara; Dennis, Joe; Michailidou, Kyriaki; Bolla, Manjeet K; Easton, Douglas F; dos Santos Silva, Isabel; Fletcher, Olivia; Peto, Julian
2015-05-15
We recently identified a novel susceptibility variant, rs865686, for estrogen-receptor positive breast cancer at 9q31.2. Here, we report a fine-mapping analysis of the 9q31.2 susceptibility locus using 43 160 cases and 42 600 controls of European ancestry ascertained from 52 studies and a further 5795 cases and 6624 controls of Asian ancestry from nine studies. Single nucleotide polymorphism (SNP) rs676256 was most strongly associated with risk in Europeans (odds ratios [OR] = 0.90 [0.88-0.92]; P-value = 1.58 × 10(-25)). This SNP is one of a cluster of highly correlated variants, including rs865686, that spans ∼14.5 kb. We identified two additional independent association signals demarcated by SNPs rs10816625 (OR = 1.12 [1.08-1.17]; P-value = 7.89 × 10(-09)) and rs13294895 (OR = 1.09 [1.06-1.12]; P-value = 2.97 × 10(-11)). SNP rs10816625, but not rs13294895, was also associated with risk of breast cancer in Asian individuals (OR = 1.12 [1.06-1.18]; P-value = 2.77 × 10(-05)). Functional genomic annotation using data derived from breast cancer cell-line models indicates that these SNPs localise to putative enhancer elements that bind known drivers of hormone-dependent breast cancer, including ER-α, FOXA1 and GATA-3. In vitro analyses indicate that rs10816625 and rs13294895 have allele-specific effects on enhancer activity and suggest chromatin interactions with the KLF4 gene locus. These results demonstrate the power of dense genotyping in large studies to identify independent susceptibility variants. Analysis of associations using subjects with different ancestry, combined with bioinformatic and genomic characterisation, can provide strong evidence for the likely causative alleles and their functional basis. © The Author 2015. Published by Oxford University Press.
Genome wide association study (GWAS) for grain yield in rice cultivated under water deficit.
Pantalião, Gabriel Feresin; Narciso, Marcelo; Guimarães, Cléber; Castro, Adriano; Colombari, José Manoel; Breseghello, Flavio; Rodrigues, Luana; Vianello, Rosana Pereira; Borba, Tereza Oliveira; Brondani, Claudio
2016-12-01
The identification of rice drought tolerant materials is crucial for the development of best performing cultivars for the upland cultivation system. This study aimed to identify markers and candidate genes associated with drought tolerance by Genome Wide Association Study analysis, in order to develop tools for use in rice breeding programs. This analysis was made with 175 upland rice accessions (Oryza sativa), evaluated in experiments with and without water restriction, and 150,325 SNPs. Thirteen SNP markers associated with yield under drought conditions were identified. Through stepwise regression analysis, eight SNP markers were selected and validated in silico, and when tested by PCR, two out of the eight SNP markers were able to identify a group of rice genotypes with higher productivity under drought. These results are encouraging for deriving markers for the routine analysis of marker assisted selection. From the drought experiment, including the genes inherited in linkage blocks, 50 genes were identified, from which 30 were annotated, and 10 were previously related to drought and/or abiotic stress tolerance, such as the transcription factors WRKY and Apetala2, and protein kinases.
Jiang, Qingling; Bao, Chenchang; Yang, Ya’nan; Liu, An; Liu, Fang; Huang, Huiyang; Ye, Haihui
2017-01-01
In crustaceans, muscle growth and development is complicated, and to date substantial knowledge gaps exist. In this study, the claw muscle, hepatopancreas and nervous tissue of the mud crab (Scylla paramamosain) were collected at three fattening stages for sequence by the Illumina sequencing. A total of 127.87 Gb clean data with no less than 3.94 Gb generated for each sample and the cycleQ30 percentages were more than 86.13% for all samples. De Bruijn assembly of these clean data produced 94,853 unigenes, thereinto, 50,059 unigenes were found in claw muscle. A total of 121 differentially expressed genes (DEGs) were revealed in claw muscle from the three fattening stages with a Padj value < 0.01, including 63 genes with annotation. Functional annotation and enrichment analysis showed that the DEGs clusters represented the predominant gene catalog with roles in biochemical processes (glycolysis, phosphorylation and regulation of transcription), molecular function (ATP binding, 6-phosphofructokinase activity, and sequence-specific DNA binding) and cellular component (6-phosphofructokinase complex, plasma membrane, and integral component of membrane). qRT-PCR was employed to further validate certain DEGs. Single nucleotide polymorphism (SNP) analysis obtained 159,322, 125,963 and 166,279 potential SNPs from the muscle transcriptome at stage B, stage C and stage D, respectively. In addition, there were sixteen neuropeptide transcripts being predicted in the claw muscle. The present study provides a comprehensive transcriptome of claw muscle of S. paramamosain during fattening, providing a basis for screening the functional genes that may affect muscle growth of S. paramamosain. PMID:29141033
Oiestad, A J; Martin, J M; Cook, J; Varella, A C; Giroux, M J
2017-07-01
The wheat stem sawfly (WSS) is an economically important pest of wheat in the Northern Great Plains. The primary means of WSS control is resistance associated with the single quantitative trait locus (QTL) , which controls most stem solidness variation. The goal of this study was to identify stem solidness candidate genes via RNA-seq. This study made use of 28 single nucleotide polymorphism (SNP) makers derived from expressed sequence tags (ESTs) linked to contained within a 5.13 cM region. Allele specific expression of EST markers was examined in stem tissue for solid and hollow-stemmed pairs of two spring wheat near isogenic lines (NILs) differing for the QTL. Of the 28 ESTs, 13 were located within annotated genes and 10 had detectable stem expression. Annotated genes corresponding to four of the ESTs were differentially expressed between solid and hollow-stemmed NILs and represent possible stem solidness gene candidates. Further examination of the 5.13 cM region containing the 28 EST markers identified 260 annotated genes. Twenty of the 260 linked genes were up-regulated in hollow NIL stems, while only seven genes were up-regulated in solid NIL stems. An -methyltransferase within the region of interest was identified as a candidate based on differential expression between solid and hollow-stemmed NILs and putative function. Further study of these candidate genes may lead to the identification of the gene(s) controlling stem solidness and an increased ability to select for wheat stem solidness and manage WSS. Copyright © 2017 Crop Science Society of America.
Common variants at the CHEK2 gene locus and risk of epithelial ovarian cancer
Lawrenson, Kate; Iversen, Edwin S.; Tyrer, Jonathan; Weber, Rachel Palmieri; Concannon, Patrick; Hazelett, Dennis J.; Li, Qiyuan; Marks, Jeffrey R.; Berchuck, Andrew; Lee, Janet M.; Aben, Katja K.H.; Anton-Culver, Hoda; Antonenkova, Natalia; Bandera, Elisa V.; Bean, Yukie; Beckmann, Matthias W.; Bisogna, Maria; Bjorge, Line; Bogdanova, Natalia; Brinton, Louise A.; Brooks-Wilson, Angela; Bruinsma, Fiona; Butzow, Ralf; Campbell, Ian G.; Carty, Karen; Chang-Claude, Jenny; Chenevix-Trench, Georgia; Chen, Ann; Chen, Zhihua; Cook, Linda S.; Cramer, Daniel W.; Cunningham, Julie M.; Cybulski, Cezary; Plisiecka-Halasa, Joanna; Dennis, Joe; Dicks, Ed; Doherty, Jennifer A.; Dörk, Thilo; du Bois, Andreas; Eccles, Diana; Easton, Douglas T.; Edwards, Robert P.; Eilber, Ursula; Ekici, Arif B.; Fasching, Peter A.; Fridley, Brooke L.; Gao, Yu-Tang; Gentry-Maharaj, Aleksandra; Giles, Graham G.; Glasspool, Rosalind; Goode, Ellen L.; Goodman, Marc T.; Gronwald, Jacek; Harter, Philipp; Hasmad, Hanis Nazihah; Hein, Alexander; Heitz, Florian; Hildebrandt, Michelle A.T.; Hillemanns, Peter; Hogdall, Estrid; Hogdall, Claus; Hosono, Satoyo; Jakubowska, Anna; Paul, James; Jensen, Allan; Karlan, Beth Y.; Kjaer, Susanne Kruger; Kelemen, Linda E.; Kellar, Melissa; Kelley, Joseph L.; Kiemeney, Lambertus A.; Krakstad, Camilla; Lambrechts, Diether; Lambrechts, Sandrina; Le, Nhu D.; Lee, Alice W.; Cannioto, Rikki; Leminen, Arto; Lester, Jenny; Levine, Douglas A.; Liang, Dong; Lissowska, Jolanta; Lu, Karen; Lubinski, Jan; Lundvall, Lene; Massuger, Leon F.A.G.; Matsuo, Keitaro; McGuire, Valerie; McLaughlin, John R.; Nevanlinna, Heli; McNeish, Iain; Menon, Usha; Modugno, Francesmary; Moysich, Kirsten B.; Narod, Steven A.; Nedergaard, Lotte; Ness, Roberta B.; Noor Azmi, Mat Adenan; Odunsi, Kunle; Olson, Sara H.; Orlow, Irene; Orsulic, Sandra; Pearce, Celeste L.; Pejovic, Tanja; Pelttari, Liisa M.; Permuth-Wey, Jennifer; Phelan, Catherine M.; Pike, Malcolm C.; Poole, Elizabeth M.; Ramus, Susan J.; Risch, Harvey A.; Rosen, Barry; Rossing, Mary Anne; Rothstein, Joseph H.; Rudolph, Anja; Runnebaum, Ingo B.; Rzepecka, Iwona K.; Salvesen, Helga B.; Budzilowska, Agnieszka; Sellers, Thomas A.; Shu, Xiao-Ou; Shvetsov, Yurii B.; Siddiqui, Nadeem; Sieh, Weiva; Song, Honglin; Southey, Melissa C.; Sucheston, Lara; Tangen, Ingvild L.; Teo, Soo-Hwang; Terry, Kathryn L.; Thompson, Pamela J.; Timorek, Agnieszka; Tworoger, Shelley S.; Nieuwenhuysen, Els Van; Vergote, Ignace; Vierkant, Robert A.; Wang-Gohrke, Shan; Walsh, Christine; Wentzensen, Nicolas; Whittemore, Alice S.; Wicklund, Kristine G.; Wilkens, Lynne R.; Woo, Yin-Ling; Wu, Xifeng; Wu, Anna H.; Yang, Hannah; Zheng, Wei; Ziogas, Argyrios; Coetzee, Gerhard A.; Freedman, Matthew L.; Monteiro, Alvaro N.A.; Moes-Sosnowska, Joanna; Kupryjanczyk, Jolanta; Pharoah, Paul D.; Gayther, Simon A.; Schildkraut, Joellen M.
2015-01-01
Genome-wide association studies have identified 20 genomic regions associated with risk of epithelial ovarian cancer (EOC), but many additional risk variants may exist. Here, we evaluated associations between common genetic variants [single nucleotide polymorphisms (SNPs) and indels] in DNA repair genes and EOC risk. We genotyped 2896 common variants at 143 gene loci in DNA samples from 15 397 patients with invasive EOC and controls. We found evidence of associations with EOC risk for variants at FANCA, EXO1, E2F4, E2F2, CREB5 and CHEK2 genes (P ≤ 0.001). The strongest risk association was for CHEK2 SNP rs17507066 with serous EOC (P = 4.74 x 10–7). Additional genotyping and imputation of genotypes from the 1000 genomes project identified a slightly more significant association for CHEK2 SNP rs6005807 (r 2 with rs17507066 = 0.84, odds ratio (OR) 1.17, 95% CI 1.11–1.24, P = 1.1×10−7). We identified 293 variants in the region with likelihood ratios of less than 1:100 for representing the causal variant. Functional annotation identified 25 candidate SNPs that alter transcription factor binding sites within regulatory elements active in EOC precursor tissues. In The Cancer Genome Atlas dataset, CHEK2 gene expression was significantly higher in primary EOCs compared to normal fallopian tube tissues (P = 3.72×10−8). We also identified an association between genotypes of the candidate causal SNP rs12166475 (r 2 = 0.99 with rs6005807) and CHEK2 expression (P = 2.70×10-8). These data suggest that common variants at 22q12.1 are associated with risk of serous EOC and CHEK2 as a plausible target susceptibility gene. PMID:26424751
Li, Ming; Luo, Xiong-jian; Landén, Mikael; Bergen, Sarah E.; Hultman, Christina M.; Li, Xiao; Zhang, Wen; Yao, Yong-Gang; Zhang, Chen; Liu, Jiewei; Mattheisen, Manuel; Cichon, Sven; Mühleisen, Thomas W.; Degenhardt, Franziska A.; Nöthen, Markus M.; Schulze, Thomas G.; Grigoroiu-Serbanescu, Maria; Li, Hao; Fuller, Chris K.; Chen, Chunhui; Dong, Qi; Chen, Chuansheng; Jamain, Stéphane; Leboyer, Marion; Bellivier, Frank; Etain, Bruno; Kahn, Jean-Pierre; Henry, Chantal; Preisig, Martin; Kutalik, Zoltán; Castelao, Enrique; Wright, Adam; Mitchell, Philip B.; Fullerton, Janice M.; Schofield, Peter R.; Montgomery, Grant W.; Medland, Sarah E.; Gordon, Scott D.; Martin, Nicholas G.; Rietschel, Marcella; Liu, Chunyu; Kleinman, Joel E.; Hyde, Thomas M.; Weinberger, Daniel R.; Su, Bing
2016-01-01
Summary Bipolar disorder (BPD) is a highly heritable polygenic disorder. Recent enrichment analyses suggest that there may be true risk variants for BPD among the expression quantitative trait loci (eQTL) in the brain. Aims We sought to assess the impact of eQTL variants on BPD risk by combining data from both BPD genome-wide association study (GWAS) and brain eQTL. Method To detect single-nucleotide polymorphisms (SNPs) that influence expression levels of genes associated with BPD, we jointly analyzed data from a BPD GWAS (7,481 cases and 9,250 controls) and a genome-wide brain (cortical) eQTL (193 healthy controls) using a Bayesian statistical method, with independent follow-up replications. The identified risk SNP was then further tested for association with hippocampal volume (N=5,775) and cognitive performance (N=342) among healthy subjects. Results Integrative analysis revealed a significant association between a brain eQTL rs6088662 in 20q11.22 and BPD (Log Bayes Factor=5.48; BPD p-val=5.85×10−5). Follow-up studies across multiple independent samples confirmed the association of the risk SNP (rs6088662) with gene expression and BPD susceptibility (p-val=3.54×10−8). Further exploratory analysis revealed that rs6088662 is also associated with hippocampal volume and cognitive performance in healthy subjects. Conclusions Our findings suggest that 20q11.22 is likely a risk region for BPD, highlighting the informativeness of integrating functional annotation of genetic variants for gene expression in advancing our understanding of the biological basis underlying complex diseases such as BPD. PMID:26338991
Li, Ming; Luo, Xiong-jian; Landén, Mikael; Bergen, Sarah E; Hultman, Christina M; Li, Xiao; Zhang, Wen; Yao, Yong-Gang; Zhang, Chen; Liu, Jiewei; Mattheisen, Manuel; Cichon, Sven; Mühleisen, Thomas W; Degenhardt, Franziska A; Nöthen, Markus M; Schulze, Thomas G; Grigoroiu-Serbanescu, Maria; Li, Hao; Fuller, Chris K; Chen, Chunhui; Dong, Qi; Chen, Chuansheng; Jamain, Stéphane; Leboyer, Marion; Bellivier, Frank; Etain, Bruno; Kahn, Jean-Pierre; Henry, Chantal; Preisig, Martin; Kutalik, Zoltán; Castelao, Enrique; Wright, Adam; Mitchell, Philip B; Fullerton, Janice M; Schofield, Peter R; Montgomery, Grant W; Medland, Sarah E; Gordon, Scott D; Martin, Nicholas G; Rietschel, Marcella; Liu, Chunyu; Kleinman, Joel E; Hyde, Thomas M; Weinberger, Daniel R; Su, Bing
2016-02-01
Bipolar disorder is a highly heritable polygenic disorder. Recent enrichment analyses suggest that there may be true risk variants for bipolar disorder in the expression quantitative trait loci (eQTL) in the brain. We sought to assess the impact of eQTL variants on bipolar disorder risk by combining data from both bipolar disorder genome-wide association studies (GWAS) and brain eQTL. To detect single nucleotide polymorphisms (SNPs) that influence expression levels of genes associated with bipolar disorder, we jointly analysed data from a bipolar disorder GWAS (7481 cases and 9250 controls) and a genome-wide brain (cortical) eQTL (193 healthy controls) using a Bayesian statistical method, with independent follow-up replications. The identified risk SNP was then further tested for association with hippocampal volume (n = 5775) and cognitive performance (n = 342) among healthy individuals. Integrative analysis revealed a significant association between a brain eQTL rs6088662 on chromosome 20q11.22 and bipolar disorder (log Bayes factor = 5.48; bipolar disorder P = 5.85 × 10(-5)). Follow-up studies across multiple independent samples confirmed the association of the risk SNP (rs6088662) with gene expression and bipolar disorder susceptibility (P = 3.54 × 10(-8)). Further exploratory analysis revealed that rs6088662 is also associated with hippocampal volume and cognitive performance in healthy individuals. Our findings suggest that 20q11.22 is likely a risk region for bipolar disorder; they also highlight the informative value of integrating functional annotation of genetic variants for gene expression in advancing our understanding of the biological basis underlying complex disorders, such as bipolar disorder. © The Royal College of Psychiatrists 2016.
Cingolani, Pablo; Patel, Viral M.; Coon, Melissa; Nguyen, Tung; Land, Susan J.; Ruden, Douglas M.; Lu, Xiangyi
2012-01-01
This paper describes a new program SnpSift for filtering differential DNA sequence variants between two or more experimental genomes after genotoxic chemical exposure. Here, we illustrate how SnpSift can be used to identify candidate phenotype-relevant variants including single nucleotide polymorphisms, multiple nucleotide polymorphisms, insertions, and deletions (InDels) in mutant strains isolated from genome-wide chemical mutagenesis of Drosophila melanogaster. First, the genomes of two independently isolated mutant fly strains that are allelic for a novel recessive male-sterile locus generated by genotoxic chemical exposure were sequenced using the Illumina next-generation DNA sequencer to obtain 20- to 29-fold coverage of the euchromatic sequences. The sequencing reads were processed and variants were called using standard bioinformatic tools. Next, SnpEff was used to annotate all sequence variants and their potential mutational effects on associated genes. Then, SnpSift was used to filter and select differential variants that potentially disrupt a common gene in the two allelic mutant strains. The potential causative DNA lesions were partially validated by capillary sequencing of polymerase chain reaction-amplified DNA in the genetic interval as defined by meiotic mapping and deletions that remove defined regions of the chromosome. Of the five candidate genes located in the genetic interval, the Pka-like gene CG12069 was found to carry a separate pre-mature stop codon mutation in each of the two allelic mutants whereas the other four candidate genes within the interval have wild-type sequences. The Pka-like gene is therefore a strong candidate gene for the male-sterile locus. These results demonstrate that combining SnpEff and SnpSift can expedite the identification of candidate phenotype-causative mutations in chemically mutagenized Drosophila strains. This technique can also be used to characterize the variety of mutations generated by genotoxic chemicals. PMID:22435069
BEACON: automated tool for Bacterial GEnome Annotation ComparisON.
Kalkatawi, Manal; Alam, Intikhab; Bajic, Vladimir B
2015-08-18
Genome annotation is one way of summarizing the existing knowledge about genomic characteristics of an organism. There has been an increased interest during the last several decades in computer-based structural and functional genome annotation. Many methods for this purpose have been developed for eukaryotes and prokaryotes. Our study focuses on comparison of functional annotations of prokaryotic genomes. To the best of our knowledge there is no fully automated system for detailed comparison of functional genome annotations generated by different annotation methods (AMs). The presence of many AMs and development of new ones introduce needs to: a/ compare different annotations for a single genome, and b/ generate annotation by combining individual ones. To address these issues we developed an Automated Tool for Bacterial GEnome Annotation ComparisON (BEACON) that benefits both AM developers and annotation analysers. BEACON provides detailed comparison of gene function annotations of prokaryotic genomes obtained by different AMs and generates extended annotations through combination of individual ones. For the illustration of BEACON's utility, we provide a comparison analysis of multiple different annotations generated for four genomes and show on these examples that the extended annotation can increase the number of genes annotated by putative functions up to 27%, while the number of genes without any function assignment is reduced. We developed BEACON, a fast tool for an automated and a systematic comparison of different annotations of single genomes. The extended annotation assigns putative functions to many genes with unknown functions. BEACON is available under GNU General Public License version 3.0 and is accessible at: http://www.cbrc.kaust.edu.sa/BEACON/ .
2011-01-01
Background Most information on genomic variations and their associations with phenotypes are covered exclusively in scientific publications rather than in structured databases. These texts commonly describe variations using natural language; database identifiers are seldom mentioned. This complicates the retrieval of variations, associated articles, as well as information extraction, e. g. the search for biological implications. To overcome these challenges, procedures to map textual mentions of variations to database identifiers need to be developed. Results This article describes a workflow for normalization of variation mentions, i.e. the association of them to unique database identifiers. Common pitfalls in the interpretation of single nucleotide polymorphism (SNP) mentions are highlighted and discussed. The developed normalization procedure achieves a precision of 98.1 % and a recall of 67.5% for unambiguous association of variation mentions with dbSNP identifiers on a text corpus based on 296 MEDLINE abstracts containing 527 mentions of SNPs. The annotated corpus is freely available at http://www.scai.fraunhofer.de/snp-normalization-corpus.html. Conclusions Comparable approaches usually focus on variations mentioned on the protein sequence and neglect problems for other SNP mentions. The results presented here indicate that normalizing SNPs described on DNA level is more difficult than the normalization of SNPs described on protein level. The challenges associated with normalization are exemplified with ambiguities and errors, which occur in this corpus. PMID:21992066
Ferchaud, Anne-Laure; Pedersen, Susanne H; Bekkevold, Dorte; Jian, Jianbo; Niu, Yongchao; Hansen, Michael M
2014-10-06
The threespine stickleback (Gasterosteus aculeatus) has become an important model species for studying both contemporary and parallel evolution. In particular, differential adaptation to freshwater and marine environments has led to high differentiation between freshwater and marine stickleback populations at the phenotypic trait of lateral plate morphology and the underlying candidate gene Ectodysplacin (EDA). Many studies have focused on this trait and candidate gene, although other genes involved in marine-freshwater adaptation may be equally important. In order to develop a resource for rapid and cost efficient analysis of genetic divergence between freshwater and marine sticklebacks, we generated a low-density SNP (Single Nucleotide Polymorphism) array encompassing markers of chromosome regions under putative directional selection, along with neutral markers for background. RAD (Restriction site Associated DNA) sequencing of sixty individuals representing two freshwater and one marine population led to the identification of 33,993 SNP markers. Ninety-six of these were chosen for the low-density SNP array, among which 70 represented SNPs under putatively directional selection in freshwater vs. marine environments, whereas 26 SNPs were assumed to be neutral. Annotation of these regions revealed several genes that are candidates for affecting stickleback phenotypic variation, some of which have been observed in previous studies whereas others are new. We have developed a cost-efficient low-density SNP array that allows for rapid screening of polymorphisms in threespine stickleback. The array provides a valuable tool for analyzing adaptive divergence between freshwater and marine stickleback populations beyond the well-established candidate gene Ectodysplacin (EDA).
Genome-wide comparative analysis of four Indian Drosophila species.
Mohanty, Sujata; Khanna, Radhika
2017-12-01
Comparative analysis of multiple genomes of closely or distantly related Drosophila species undoubtedly creates excitement among evolutionary biologists in exploring the genomic changes with an ecology and evolutionary perspective. We present herewith the de novo assembled whole genome sequences of four Drosophila species, D. bipectinata, D. takahashii, D. biarmipes and D. nasuta of Indian origin using Next Generation Sequencing technology on an Illumina platform along with their detailed assembly statistics. The comparative genomics analysis, e.g. gene predictions and annotations, functional and orthogroup analysis of coding sequences and genome wide SNP distribution were performed. The whole genome of Zaprionus indianus of Indian origin published earlier by us and the genome sequences of previously sequenced 12 Drosophila species available in the NCBI database were included in the analysis. The present work is a part of our ongoing genomics project of Indian Drosophila species.
Kaya, Hilal Betul; Cetin, Oznur; Kaya, Hulya; Sahin, Mustafa; Sefer, Filiz; Kahraman, Abdullah; Tanyolac, Bahattin
2013-01-01
Background The olive tree (Olea europaea L.) is a diploid (2n = 2x = 46) outcrossing species mainly grown in the Mediterranean area, where it is the most important oil-producing crop. Because of its economic, cultural and ecological importance, various DNA markers have been used in the olive to characterize and elucidate homonyms, synonyms and unknown accessions. However, a comprehensive characterization and a full sequence of its transcriptome are unavailable, leading to the importance of an efficient large-scale single nucleotide polymorphism (SNP) discovery in olive. The objectives of this study were (1) to discover olive SNPs using next-generation sequencing and to identify SNP primers for cultivar identification and (2) to characterize 96 olive genotypes originating from different regions of Turkey. Methodology/Principal Findings Next-generation sequencing technology was used with five distinct olive genotypes and generated cDNA, producing 126,542,413 reads using an Illumina Genome Analyzer IIx. Following quality and size trimming, the high-quality reads were assembled into 22,052 contigs with an average length of 1,321 bases and 45 singletons. The SNPs were filtered and 2,987 high-quality putative SNP primers were identified. The assembled sequences and singletons were subjected to BLAST similarity searches and annotated with a Gene Ontology identifier. To identify the 96 olive genotypes, these SNP primers were applied to the genotypes in combination with amplified fragment length polymorphism (AFLP) and simple sequence repeats (SSR) markers. Conclusions/Significance This study marks the highest number of SNP markers discovered to date from olive genotypes using transcriptome sequencing. The developed SNP markers will provide a useful source for molecular genetic studies, such as genetic diversity and characterization, high density quantitative trait locus (QTL) analysis, association mapping and map-based gene cloning in the olive. High levels of genetic variation among Turkish olive genotypes revealed by SNPs, AFLPs and SSRs allowed us to characterize the Turkish olive genotype. PMID:24058483
USDA-ARS?s Scientific Manuscript database
This study was conducted as an initial assessment of a newly available genotyping assay containing about 34,000 common SNP included on previous SNP chips, and 199,000 sequence variants predicted to affect gene function. Objectives were to identify functional variants associated with birth weight in...
Wang, Yanan; Tang, Zhonglin; Sun, Yaqi; Wang, Hongyang; Wang, Chao; Yu, Shaobo; Liu, Jing; Zhang, Yu; Fan, Bin; Li, Kui; Liu, Bang
2014-01-01
Copy number variations (CNVs) represent a substantial source of structural variants in mammals and contribute to both normal phenotypic variability and disease susceptibility. Although low-resolution CNV maps are produced in many domestic animals, and several reports have been published about the CNVs of porcine genome, the differences between Chinese and western pigs still remain to be elucidated. In this study, we used Porcine SNP60 BeadChip and PennCNV algorithm to perform a genome-wide CNV detection in 302 individuals from six Chinese indigenous breeds (Tongcheng, Laiwu, Luchuan, Bama, Wuzhishan and Ningxiang pigs), three western breeds (Yorkshire, Landrace and Duroc) and one hybrid (Tongcheng×Duroc). A total of 348 CNV Regions (CNVRs) across genome were identified, covering 150.49 Mb of the pig genome or 6.14% of the autosomal genome sequence. In these CNVRs, 213 CNVRs were found to exist only in the six Chinese indigenous breeds, and 60 CNVRs only in the three western breeds. The characters of CNVs in four Chinese normal size breeds (Luchuan, Tongcheng and Laiwu pigs) and two minipig breeds (Bama and Wuzhishan pigs) were also analyzed in this study. Functional annotation suggested that these CNVRs possess a great variety of molecular function and may play important roles in phenotypic and production traits between Chinese and western breeds. Our results are important complementary to the CNV map in pig genome, which provide new information about the diversity of Chinese and western pig breeds, and facilitate further research on porcine genome CNVs.
Sun, Yaqi; Wang, Hongyang; Wang, Chao; Yu, Shaobo; Liu, Jing; Zhang, Yu; Fan, Bin; Li, Kui; Liu, Bang
2014-01-01
Copy number variations (CNVs) represent a substantial source of structural variants in mammals and contribute to both normal phenotypic variability and disease susceptibility. Although low-resolution CNV maps are produced in many domestic animals, and several reports have been published about the CNVs of porcine genome, the differences between Chinese and western pigs still remain to be elucidated. In this study, we used Porcine SNP60 BeadChip and PennCNV algorithm to perform a genome-wide CNV detection in 302 individuals from six Chinese indigenous breeds (Tongcheng, Laiwu, Luchuan, Bama, Wuzhishan and Ningxiang pigs), three western breeds (Yorkshire, Landrace and Duroc) and one hybrid (Tongcheng×Duroc). A total of 348 CNV Regions (CNVRs) across genome were identified, covering 150.49 Mb of the pig genome or 6.14% of the autosomal genome sequence. In these CNVRs, 213 CNVRs were found to exist only in the six Chinese indigenous breeds, and 60 CNVRs only in the three western breeds. The characters of CNVs in four Chinese normal size breeds (Luchuan, Tongcheng and Laiwu pigs) and two minipig breeds (Bama and Wuzhishan pigs) were also analyzed in this study. Functional annotation suggested that these CNVRs possess a great variety of molecular function and may play important roles in phenotypic and production traits between Chinese and western breeds. Our results are important complementary to the CNV map in pig genome, which provide new information about the diversity of Chinese and western pig breeds, and facilitate further research on porcine genome CNVs. PMID:25198154
ESTree db: a Tool for Peach Functional Genomics
Lazzari, Barbara; Caprera, Andrea; Vecchietti, Alberto; Stella, Alessandra; Milanesi, Luciano; Pozzi, Carlo
2005-01-01
Background The ESTree db represents a collection of Prunus persica expressed sequenced tags (ESTs) and is intended as a resource for peach functional genomics. A total of 6,155 successful EST sequences were obtained from four in-house prepared cDNA libraries from Prunus persica mesocarps at different developmental stages. Another 12,475 peach EST sequences were downloaded from public databases and added to the ESTree db. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts and data were collected in a MySQL database. A php-based web interface was developed to query the database. Results The ESTree db version as of April 2005 encompasses 18,630 sequences representing eight libraries. Contig assembly was performed with CAP3. Putative single nucleotide polymorphism (SNP) detection was performed with the AutoSNP program and a search engine was implemented to retrieve results. All the sequences and all the contig consensus sequences were annotated both with blastx against the GenBank nr db and with GOblet against the viridiplantae section of the Gene Ontology db. Links to NiceZyme (Expasy) and to the KEGG metabolic pathways were provided. A local BLAST utility is available. A text search utility allows querying and browsing the database. Statistics were provided on Gene Ontology occurrences to assign sequences to Gene Ontology categories. Conclusion The resulting database is a comprehensive resource of data and links related to peach EST sequences. The Sequence Report and Contig Report pages work as the web interface core structures, giving quick access to data related to each sequence/contig. PMID:16351742
ESTree db: a tool for peach functional genomics.
Lazzari, Barbara; Caprera, Andrea; Vecchietti, Alberto; Stella, Alessandra; Milanesi, Luciano; Pozzi, Carlo
2005-12-01
The ESTree db http://www.itb.cnr.it/estree/ represents a collection of Prunus persica expressed sequenced tags (ESTs) and is intended as a resource for peach functional genomics. A total of 6,155 successful EST sequences were obtained from four in-house prepared cDNA libraries from Prunus persica mesocarps at different developmental stages. Another 12,475 peach EST sequences were downloaded from public databases and added to the ESTree db. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts and data were collected in a MySQL database. A php-based web interface was developed to query the database. The ESTree db version as of April 2005 encompasses 18,630 sequences representing eight libraries. Contig assembly was performed with CAP3. Putative single nucleotide polymorphism (SNP) detection was performed with the AutoSNP program and a search engine was implemented to retrieve results. All the sequences and all the contig consensus sequences were annotated both with blastx against the GenBank nr db and with GOblet against the viridiplantae section of the Gene Ontology db. Links to NiceZyme (Expasy) and to the KEGG metabolic pathways were provided. A local BLAST utility is available. A text search utility allows querying and browsing the database. Statistics were provided on Gene Ontology occurrences to assign sequences to Gene Ontology categories. The resulting database is a comprehensive resource of data and links related to peach EST sequences. The Sequence Report and Contig Report pages work as the web interface core structures, giving quick access to data related to each sequence/contig.
Kent, Jack W
2016-02-03
New technologies for acquisition of genomic data, while offering unprecedented opportunities for genetic discovery, also impose severe burdens of interpretation and penalties for multiple testing. The Pathway-based Analyses Group of the Genetic Analysis Workshop 19 (GAW19) sought reduction of multiple-testing burden through various approaches to aggregation of highdimensional data in pathways informed by prior biological knowledge. Experimental methods testedincluded the use of "synthetic pathways" (random sets of genes) to estimate power and false-positive error rate of methods applied to simulated data; data reduction via independent components analysis, single-nucleotide polymorphism (SNP)-SNP interaction, and use of gene sets to estimate genetic similarity; and general assessment of the efficacy of prior biological knowledge to reduce the dimensionality of complex genomic data. The work of this group explored several promising approaches to managing high-dimensional data, with the caveat that these methods are necessarily constrained by the quality of external bioinformatic annotation.
Identification of Candidate Transcription Factor Binding Sites in the Cattle Genome
Bickhart, Derek M.; Liu, George E.
2013-01-01
A resource that provides candidate transcription factor binding sites (TFBSs) does not currently exist for cattle. Such data is necessary, as predicted sites may serve as excellent starting locations for future omics studies to develop transcriptional regulation hypotheses. In order to generate this resource, we employed a phylogenetic footprinting approach—using sequence conservation across cattle, human and dog—and position-specific scoring matrices to identify 379,333 putative TFBSs upstream of nearly 8000 Mammalian Gene Collection (MGC) annotated genes within the cattle genome. Comparisons of our predictions to known binding site loci within the PCK1, ACTA1 and G6PC promoter regions revealed 75% sensitivity for our method of discovery. Additionally, we intersected our predictions with known cattle SNP variants in dbSNP and on the Illumina BovineHD 770k and Bos 1 SNP chips, finding 7534, 444 and 346 overlaps, respectively. Due to our stringent filtering criteria, these results represent high quality predictions of putative TFBSs within the cattle genome. All binding site predictions are freely available at http://bfgl.anri.barc.usda.gov/BovineTFBS/ or http://199.133.54.77/BovineTFBS. PMID:23433959
Thyssen, Gregory N; Fang, David D; Turley, Rickie B; Florane, Christopher; Li, Ping; Naoumkina, Marina
2015-09-01
Mapping-by-sequencing and SNP marker analysis were used to fine map the Ligon-lintless-1 ( Li 1 ) short fiber mutation in tetraploid cotton to a 255-kb region that contains 16 annotated proteins. The Ligon-lintless-1 (Li 1 ) mutant of cotton (Gossypium hirsutum L.) has been studied as a model for cotton fiber development since its identification in 1929; however, the causative mutation has not been identified yet. Here we report the fine genetic mapping of the mutation to a 255-kb region that contains only 16 annotated genes in the reference Gossypium raimondii genome. We took advantage of the incompletely dominant dwarf vegetative phenotype to identify 100 mutants (Li 1 /Li 1 ) and 100 wild-type (li 1 /li 1 ) homozygotes from a mapping population of 2567 F2 plants, which we bulked and deep sequenced. Since only homozygotes were sequenced, we were able to use a high stringency in SNP calling to rapidly narrow down the region harboring the Li 1 locus, and designed subgenome-specific SNP markers to test the population. We characterized the expression of all sixteen genes in the region by RNA sequencing of elongating fibers and by RT-qPCR at seven time points spanning fiber development. One of the most highly expressed genes found in this interval in wild-type fiber cells is 40-fold under-expressed at the day of anthesis (DOA) in the mutant fiber cells. This gene is a major facilitator superfamily protein, part of the large family of proteins that includes auxin and sugar transporters. Interestingly, nearly all genes in this region were most highly expressed at DOA and showed a high degree of co-expression. Further characterization is required to determine if transport of hormones or carbohydrates is involved in both the dwarf and lintless phenotypes of Li 1 plants.
Variation analysis and gene annotation of eight MHC haplotypes: The MHC Haplotype Project
Horton, Roger; Gibson, Richard; Coggill, Penny; Miretti, Marcos; Allcock, Richard J.; Almeida, Jeff; Forbes, Simon; Gilbert, James G. R.; Halls, Karen; Harrow, Jennifer L.; Hart, Elizabeth; Howe, Kevin; Jackson, David K.; Palmer, Sophie; Roberts, Anne N.; Sims, Sarah; Stewart, C. Andrew; Traherne, James A.; Trevanion, Steve; Wilming, Laurens; Rogers, Jane; de Jong, Pieter J.; Elliott, John F.; Sawcer, Stephen; Todd, John A.; Trowsdale, John
2008-01-01
The human major histocompatibility complex (MHC) is contained within about 4 Mb on the short arm of chromosome 6 and is recognised as the most variable region in the human genome. The primary aim of the MHC Haplotype Project was to provide a comprehensively annotated reference sequence of a single, human leukocyte antigen-homozygous MHC haplotype and to use it as a basis against which variations could be assessed from seven other similarly homozygous cell lines, representative of the most common MHC haplotypes in the European population. Comparison of the haplotype sequences, including four haplotypes not previously analysed, resulted in the identification of >44,000 variations, both substitutions and indels (insertions and deletions), which have been submitted to the dbSNP database. The gene annotation uncovered haplotype-specific differences and confirmed the presence of more than 300 loci, including over 160 protein-coding genes. Combined analysis of the variation and annotation datasets revealed 122 gene loci with coding substitutions of which 97 were non-synonymous. The haplotype (A3-B7-DR15; PGF cell line) designated as the new MHC reference sequence, has been incorporated into the human genome assembly (NCBI35 and subsequent builds), and constitutes the largest single-haplotype sequence of the human genome to date. The extensive variation and annotation data derived from the analysis of seven further haplotypes have been made publicly available and provide a framework and resource for future association studies of all MHC-associated diseases and transplant medicine. PMID:18193213
GarlicESTdb: an online database and mining tool for garlic EST sequences.
Kim, Dae-Won; Jung, Tae-Sung; Nam, Seong-Hyeuk; Kwon, Hyuk-Ryul; Kim, Aeri; Chae, Sung-Hwa; Choi, Sang-Haeng; Kim, Dong-Wook; Kim, Ryong Nam; Park, Hong-Seog
2009-05-18
Allium sativum., commonly known as garlic, is a species in the onion genus (Allium), which is a large and diverse one containing over 1,250 species. Its close relatives include chives, onion, leek and shallot. Garlic has been used throughout recorded history for culinary, medicinal use and health benefits. Currently, the interest in garlic is highly increasing due to nutritional and pharmaceutical value including high blood pressure and cholesterol, atherosclerosis and cancer. For all that, there are no comprehensive databases available for Expressed Sequence Tags(EST) of garlic for gene discovery and future efforts of genome annotation. That is why we developed a new garlic database and applications to enable comprehensive analysis of garlic gene expression. GarlicESTdb is an integrated database and mining tool for large-scale garlic (Allium sativum) EST sequencing. A total of 21,595 ESTs collected from an in-house cDNA library were used to construct the database. The analysis pipeline is an automated system written in JAVA and consists of the following components: automatic preprocessing of EST reads, assembly of raw sequences, annotation of the assembled sequences, storage of the analyzed information into MySQL databases, and graphic display of all processed data. A web application was implemented with the latest J2EE (Java 2 Platform Enterprise Edition) software technology (JSP/EJB/JavaServlet) for browsing and querying the database, for creation of dynamic web pages on the client side, and for mapping annotated enzymes to KEGG pathways, the AJAX framework was also used partially. The online resources, such as putative annotation, single nucleotide polymorphisms (SNP) and tandem repeat data sets, can be searched by text, explored on the website, searched using BLAST, and downloaded. To archive more significant BLAST results, a curation system was introduced with which biologists can easily edit best-hit annotation information for others to view. The GarlicESTdb web application is freely available at http://garlicdb.kribb.re.kr. GarlicESTdb is the first incorporated online information database of EST sequences isolated from garlic that can be freely accessed and downloaded. It has many useful features for interactive mining of EST contigs and datasets from each library, including curation of annotated information, expression profiling, information retrieval, and summary of statistics of functional annotation. Consequently, the development of GarlicESTdb will provide a crucial contribution to biologists for data-mining and more efficient experimental studies.
SNP-RFLPing 2: an updated and integrated PCR-RFLP tool for SNP genotyping.
Chang, Hsueh-Wei; Cheng, Yu-Huei; Chuang, Li-Yeh; Yang, Cheng-Hong
2010-04-08
PCR-restriction fragment length polymorphism (RFLP) assay is a cost-effective method for SNP genotyping and mutation detection, but the manual mining for restriction enzyme sites is challenging and cumbersome. Three years after we constructed SNP-RFLPing, a freely accessible database and analysis tool for restriction enzyme mining of SNPs, significant improvements over the 2006 version have been made and incorporated into the latest version, SNP-RFLPing 2. The primary aim of SNP-RFLPing 2 is to provide comprehensive PCR-RFLP information with multiple functionality about SNPs, such as SNP retrieval to multiple species, different polymorphism types (bi-allelic, tri-allelic, tetra-allelic or indels), gene-centric searching, HapMap tagSNPs, gene ontology-based searching, miRNAs, and SNP500Cancer. The RFLP restriction enzymes and the corresponding PCR primers for the natural and mutagenic types of each SNP are simultaneously analyzed. All the RFLP restriction enzyme prices are also provided to aid selection. Furthermore, the previously encountered updating problems for most SNP related databases are resolved by an on-line retrieval system. The user interfaces for functional SNP analyses have been substantially improved and integrated. SNP-RFLPing 2 offers a new and user-friendly interface for RFLP genotyping that can be used in association studies and is freely available at http://bio.kuas.edu.tw/snp-rflping2.
Huo, Dezheng; Feng, Ye; Haddad, Stephen; Zheng, Yonglan; Yao, Song; Han, Yoo-Jeong; Ogundiran, Temidayo O; Adebamowo, Clement; Ojengbede, Oladosu; Falusi, Adeyinka G; Zheng, Wei; Blot, William; Cai, Qiuyin; Signorello, Lisa; John, Esther M; Bernstein, Leslie; Hu, Jennifer J; Ziegler, Regina G; Nyante, Sarah; Bandera, Elisa V; Ingles, Sue A; Press, Michael F; Deming, Sandra L; Rodriguez-Gil, Jorge L; Nathanson, Katherine L; Domchek, Susan M; Rebbeck, Timothy R; Ruiz-Narváez, Edward A; Sucheston-Campbell, Lara E; Bensen, Jeannette T; Simon, Michael S; Hennis, Anselm; Nemesure, Barbara; Leske, M Cristina; Ambs, Stefan; Chen, Lin S; Qian, Frank; Gamazon, Eric R; Lunetta, Kathryn L; Cox, Nancy J; Chanock, Stephen J; Kolonel, Laurence N; Olshan, Andrew F; Ambrosone, Christine B; Olopade, Olufunmilayo I; Palmer, Julie R; Haiman, Christopher A
2016-11-01
Multiple breast cancer loci have been identified in previous genome-wide association studies, but they were mainly conducted in populations of European ancestry. Women of African ancestry are more likely to have young-onset and oestrogen receptor (ER) negative breast cancer for reasons that are unknown and understudied. To identify genetic risk factors for breast cancer in women of African descent, we conducted a meta-analysis of two genome-wide association studies of breast cancer; one study consists of 1,657 cases and 2,029 controls genotyped with Illumina’s HumanOmni2.5 BeadChip and the other study included 3,016 cases and 2,745 controls genotyped using Illumina Human1M-Duo BeadChip. The top 18,376 single nucleotide polymorphisms (SNP) from the meta-analysis were replicated in the third study that consists of 1,984 African Americans cases and 2,939 controls. We found that SNP rs13074711, 26.5 Kb upstream of TNFSF10 at 3q26.21, was significantly associated with risk of oestrogen receptor (ER)-negative breast cancer (odds ratio [OR]=1.29, 95% CI: 1.18-1.40; P = 1.8 × 10 − 8). Functional annotations suggest that the TNFSF10 gene may be involved in breast cancer aetiology, but further functional experiments are needed. In addition, we confirmed SNP rs10069690 was the best indicator for ER-negative breast cancer at 5p15.33 (OR = 1.30; P = 2.4 × 10 − 10) and identified rs12998806 as the best indicator for ER-positive breast cancer at 2q35 (OR = 1.34; P = 2.2 × 10 − 8) for women of African ancestry. These findings demonstrated additional susceptibility alleles for breast cancer can be revealed in diverse populations and have important public health implications in building race/ethnicity-specific risk prediction model for breast cancer.
Xu, Qing; Mei, Gui; Sun, Dongxiao; Zhang, Qin; Zhang, Yuan; Yin, Cengceng; Chen, Huiyong; Ding, Xiangdong; Liu, Jianfeng
2012-11-02
We previously localized a quantitative trait locus (QTL) on bovine chromosome 6 affecting milk production traits to a 1.5-Mb region between BMS483 and MNB-209 via genome scanning followed by fine mapping. Totally 15 genes were mapped within such linkage region through bioinformatic analysis of the cattle-human comparative map and bovine genome assembly. Of them, the UDP-glucose dehydrogenase (UGDH) was suggested as a potential positional candidate gene for milk production traits based on its corresponding physiological and biochemical functions and genetic effects. By sequencing all the coding exons and the untranslated regions in UGDH with pooled DNA of 8 sires represented the separated families detected in our previous studies, a total of ten SNPs were identified and genotyped in 1417 Holstein cows of 8 separation families. Individual SNP-based association analysis revealed 4 significant associations of SNP Ex1-1, SNP Int3-1, SNP Int5-1, and SNP Ex12-3 with milk yield (P < 0.05), and 2 significant associations of SNP Ex1-1 and SNP Ex12-3 with protein yield (P < 0.05). Furthermore, our haplotype-based association analyses indicated that haplotypes G-C-C, formed by SNP Ex12-2-SNP Int11-1-SNP Ex11-1, T-G, formed by SNP Int9-3-SNP Int9-2, and C-C, formed by SNP Int5-1-SNP Int3-1, are significantly associated with protein percentage (F=4.15; P=0.0418) and fat percentage (F=5.18~7.25; P=0.0072~0.0231). Finally, by using an in vitro expression assay, we demonstrated that the A allele of SNP Ex1-1 and T allele of SNP Ex11-1of UGDH significantly decreases the expression of UGDH by 68.0% at the RNA, and 50.1% at the protein level, suggesting that SNP Ex1-1 and Ex11-1 represent two functional polymorphisms affecting expression of UGDH and may partly contributed to the observed association of the gene with milk production traits in our samples. Taken together, our findings strongly indicate that UGDH gene could be involved in genetic variation underlying the QTL for milk production traits.
GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations
Paila, Umadevi; Chapman, Brad A.; Kirchner, Rory; Quinlan, Aaron R.
2013-01-01
Modern DNA sequencing technologies enable geneticists to rapidly identify genetic variation among many human genomes. However, isolating the minority of variants underlying disease remains an important, yet formidable challenge for medical genetics. We have developed GEMINI (GEnome MINIng), a flexible software package for exploring all forms of human genetic variation. Unlike existing tools, GEMINI integrates genetic variation with a diverse and adaptable set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration. Whereas other methods provide an inflexible set of variant filters or prioritization methods, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. We demonstrate GEMINI's utility for exploring variation in personal genomes and family based genetic studies, and illustrate its ability to scale to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to provide researchers with a standard framework for medical genomics. PMID:23874191
Karaesmen, Ezgi; Rizvi, Abbas A.; Preus, Leah M.; McCarthy, Philip L.; Pasquini, Marcelo C.; Onel, Kenan; Zhu, Xiaochun; Spellman, Stephen; Haiman, Christopher A.; Stram, Daniel O.; Pooler, Loreall; Sheng, Xin; Zhu, Qianqian; Yan, Li; Liu, Qian; Hu, Qiang; Webb, Amy; Brock, Guy; Clay-Gilmour, Alyssa I.; Battaglia, Sebastiano; Tritchler, David; Liu, Song; Hahn, Theresa
2017-01-01
Multiple candidate gene-association studies of non-HLA single-nucleotide polymorphisms (SNPs) and outcomes after blood or marrow transplant (BMT) have been conducted. We identified 70 publications reporting 45 SNPs in 36 genes significantly associated with disease-related mortality, progression-free survival, transplant-related mortality, and/or overall survival after BMT. Replication and validation of these SNP associations were performed using DISCOVeRY-BMT (Determining the Influence of Susceptibility COnveying Variants Related to one-Year mortality after BMT), a well-powered genome-wide association study consisting of 2 cohorts, totaling 2888 BMT recipients with acute myeloid leukemia, acute lymphoblastic leukemia, or myelodysplastic syndrome, and their HLA-matched unrelated donors, reported to the Center for International Blood and Marrow Transplant Research. Gene-based tests were used to assess the aggregate effect of SNPs on outcome. None of the previously reported significant SNPs replicated at P < .05 in DISCOVeRY-BMT. Validation analyses showed association with one previously reported donor SNP at P < .05 and survival; more associations would be anticipated by chance alone. No gene-based tests were significant at P < .05. Functional annotation with publicly available data shows these candidate SNPs most likely do not have biochemical function; only 13% of candidate SNPs correlate with gene expression or are predicted to impact transcription factor binding. Of these, half do not impact the candidate gene of interest; the other half correlate with expression of multiple genes. These findings emphasize the peril of pursing candidate approaches and the importance of adequately powered tests of unbiased genome-wide associations with BMT clinical outcomes given the ultimate goal of improving patient outcomes. PMID:28811306
Feuermann, Marc; Gaudet, Pascale; Mi, Huaiyu; Lewis, Suzanna E; Thomas, Paul D
2016-01-01
We previously reported a paradigm for large-scale phylogenomic analysis of gene families that takes advantage of the large corpus of experimentally supported Gene Ontology (GO) annotations. This 'GO Phylogenetic Annotation' approach integrates GO annotations from evolutionarily related genes across ∼100 different organisms in the context of a gene family tree, in which curators build an explicit model of the evolution of gene functions. GO Phylogenetic Annotation models the gain and loss of functions in a gene family tree, which is used to infer the functions of uncharacterized (or incompletely characterized) gene products, even for human proteins that are relatively well studied. Here, we report our results from applying this paradigm to two well-characterized cellular processes, apoptosis and autophagy. This revealed several important observations with respect to GO annotations and how they can be used for function inference. Notably, we applied only a small fraction of the experimentally supported GO annotations to infer function in other family members. The majority of other annotations describe indirect effects, phenotypes or results from high throughput experiments. In addition, we show here how feedback from phylogenetic annotation leads to significant improvements in the PANTHER trees, the GO annotations and GO itself. Thus GO phylogenetic annotation both increases the quantity and improves the accuracy of the GO annotations provided to the research community. We expect these phylogenetically based annotations to be of broad use in gene enrichment analysis as well as other applications of GO annotations.Database URL: http://amigo.geneontology.org/amigo. © The Author(s) 2016. Published by Oxford University Press.
Nuñez-Acuña, Gustavo; Valenzuela-Muñoz, Valentina; Gallardo-Escárate, Cristian
2014-06-01
The salmon louse Caligus rogercresseyi is the dominant ectoparasite species affecting the salmon aquaculture industry in the Southern hemisphere, and it is currently the main cause for economic losses in Chilean aquaculture. However, despite the great concern over Caligus infestations, genomic information on this louse is still scarce, even while the need to develop high-resolution molecular markers is growing. This study provides the first deep transcriptome survey to identify thousands of SNP markers from C. rogercresseyi, with a total of 69,466 SNPs identified using the MiSeq platform (Illumina®), 30,605 (52%) of which were found in contigs successfully annotated against known protein databases. Furthermore, in silico gene expression profiles associated with SNP variants were evaluated, and the results evidenced a wide array of genes that were down- and upregulated throughout the developmental stages of C. rogercresseyi. Interestingly, putative KEGG pathways involved in resistance to antiparasitic agents were also identified, where ten pathways were associated with the nervous system and one was related to ABC transporters. Taken together, this information could be highly useful for investigating the molecular underpinnings involved in the susceptibility or resistance of salmon lice to chemical treatments. Copyright © 2014 Elsevier Inc. All rights reserved.
Development of genetic markers in abalone through construction of a SNP database.
Kang, J-H; Appleyard, S A; Elliott, N G; Jee, Y-J; Lee, J B; Kang, S W; Baek, M K; Han, Y S; Choi, T-J; Lee, Y S
2011-06-01
In the absence of a reference genome, single-nucleotide polymorphisms (SNP) discovery in a group of abalone species was undertaken by random sequence assembly. A web-based interface was constructed, and 11 932 DNA sequences from the genus Haliotis were assembled, with 1321 contigs built. Of these, 118 contigs that consisted of at least ten annotation groups were selected. The 1577 putative SNPs were identified from the 118 contigs, with SNPs in several HSP70 gene contigs confirmed by PCR amplification of an 809-bp DNA fragment. SNPs in the HSP70 gene were compared across eight abalone species. A total of 129 polymorphic sites, including heterozygote sites within and among species, were observed. Phylogenetic analysis of the partial HSP70 gene region showed separation of the tested abalone into two groups, one reflecting the southern hemisphere species and the other the northern hemisphere species. Interestingly, Haliotis iris from New Zealand showed a closer relationship to species distributed in the northern Pacific region. Although HSP genes are known to be highly conserved among taxa, the validation of polymorphic SNPs from HSP70 in this mollusc demonstrates the applicability of cross-species SNP markers in abalone and the first step towards universal nuclear markers in Haliotis. © 2010 NFRDI, Animal Genetics © 2010 Stichting International Foundation for Animal Genetics.
Considerations to improve functional annotations in biological databases.
Benítez-Páez, Alfonso
2009-12-01
Despite the great effort to design efficient systems allowing the electronic indexation of information concerning genes, proteins, structures, and interactions published daily in scientific journals, some problems are still observed in specific tasks such as functional annotation. The annotation of function is a critical issue for bioinformatic routines, such as for instance, in functional genomics and the further prediction of unknown protein function, which are highly dependent of the quality of existing annotations. Some information management systems evolve to efficiently incorporate information from large-scale projects, but often, annotation of single records from the literature is difficult and slow. In this short report, functional characterizations of a representative sample of the entire set of uncharacterized proteins from Escherichia coli K12 was compiled from Swiss-Prot, PubMed, and EcoCyc and demonstrate a functional annotation deficit in biological databases. Some issues are postulated as causes of the lack of annotation, and different solutions are evaluated and proposed to avoid them. The hope is that as a consequence of these observations, there will be new impetus to improve the speed and quality of functional annotation and ultimately provide updated, reliable information to the scientific community.
Lu, Qiongshi; Hu, Yiming; Sun, Jiehuan; Cheng, Yuwei; Cheung, Kei-Hoi; Zhao, Hongyu
2015-05-27
Identifying functional regions in the human genome is a major goal in human genetics. Great efforts have been made to functionally annotate the human genome either through computational predictions, such as genomic conservation, or high-throughput experiments, such as the ENCODE project. These efforts have resulted in a rich collection of functional annotation data of diverse types that need to be jointly analyzed for integrated interpretation and annotation. Here we present GenoCanyon, a whole-genome annotation method that performs unsupervised statistical learning using 22 computational and experimental annotations thereby inferring the functional potential of each position in the human genome. With GenoCanyon, we are able to predict many of the known functional regions. The ability of predicting functional regions as well as its generalizable statistical framework makes GenoCanyon a unique and powerful tool for whole-genome annotation. The GenoCanyon web server is available at http://genocanyon.med.yale.edu.
The Bos taurus–Bos indicus balance in fertility and milk related genes
Lehnert, Sigrid A.; Mudadu, Mauricio A.; Coutinho, Luiz; Regitano, Luciana; George, Andrew; Reverter, Antonio
2017-01-01
Numerical approaches to high-density single nucleotide polymorphism (SNP) data are often employed independently to address individual questions. We linked independent approaches in a bioinformatics pipeline for further insight. The pipeline driven by heterozygosity and Hardy-Weinberg equilibrium (HWE) analyses was applied to characterize Bos taurus and Bos indicus ancestry. We infer a gene co-heterozygosity network that regulates bovine fertility, from data on 18,363 cattle with genotypes for 729,068 SNP. Hierarchical clustering separated populations according to Bos taurus and Bos indicus ancestry. The weights of the first principal component were subjected to Normal mixture modelling allowing the estimation of a gene’s contribution to the Bos taurus-Bos indicus axis. We used deviation from HWE, contribution to Bos indicus content and association to fertility traits to select 1,284 genes. With this set, we developed a co-heterozygosity network where the group of genes annotated as fertility-related had significantly higher Bos indicus content compared to other functional classes of genes, while the group of genes associated with milk production had significantly higher Bos taurus content. The network analysis resulted in capturing novel gene associations of relevance to bovine domestication events. We report transcription factors that are likely to regulate genes associated with cattle domestication and tropical adaptation. Our pipeline can be generalized to any scenarios where population structure requires scrutiny at the molecular level, particularly in the presence of a priori set of genes known to impact a phenotype of evolutionary interest such as fertility. PMID:28763475
A false single nucleotide polymorphism generated by gene duplication compromises meat traceability.
Sanz, Arianne; Ordovás, Laura; Zaragoza, Pilar; Sanz, Albina; de Blas, Ignacio; Rodellar, Clementina
2012-07-01
Controlling meat traceability using SNPs is an effective method of ensuring food safety. We have analyzed several SNPs to create a panel for bovine genetic identification and traceability studies. One of these was the transversion g.329C>T (Genbank accession no. AJ496781) on the cytochrome P450 17A1 gene, which has been included in previously published panels. Using minisequencing reactions, we have tested 701 samples belonging to eight Spanish cattle breeds. Surprisingly, an excess of heterozygotes was detected, implying an extreme departure from Hardy-Weinberg equilibrium (P<0.001). By alignment analysis and sequencing, we detected that the g.329C>T SNP is a false positive polymorphism, which allows us to explain the inflated heterozygotic value. We recommend that this ambiguous SNP, as well as other polymorphisms located in this region, should not be used in identification, traceability or disease association studies. Annotation of these false SNPs should improve association studies and avoid misinterpretations. Copyright © 2012 Elsevier Ltd. All rights reserved.
Dumitriu, Alexandra; Latourelle, Jeanne C; Hadzi, Tiffany C; Pankratz, Nathan; Garza, Dan; Miller, John P; Vance, Jeffery M; Foroud, Tatiana; Beach, Thomas G; Myers, Richard H
2012-06-01
Parkinson disease (PD) is a complex neurodegenerative disorder with largely unknown genetic mechanisms. While the degeneration of dopaminergic neurons in PD mainly takes place in the substantia nigra pars compacta (SN) region, other brain areas, including the prefrontal cortex, develop Lewy bodies, the neuropathological hallmark of PD. We generated and analyzed expression data from the prefrontal cortex Brodmann Area 9 (BA9) of 27 PD and 26 control samples using the 44K One-Color Agilent 60-mer Whole Human Genome Microarray. All samples were male, without significant Alzheimer disease pathology and with extensive pathological annotation available. 507 of the 39,122 analyzed expression probes were different between PD and control samples at false discovery rate (FDR) of 5%. One of the genes with significantly increased expression in PD was the forkhead box O1 (FOXO1) transcription factor. Notably, genes carrying the FoxO1 binding site were significantly enriched in the FDR-significant group of genes (177 genes covered by 189 probes), suggesting a role for FoxO1 upstream of the observed expression changes. Single-nucleotide polymorphisms (SNPs) selected from a recent meta-analysis of PD genome-wide association studies (GWAS) were successfully genotyped in 50 out of the 53 microarray brains, allowing a targeted expression-SNP (eSNP) analysis for 52 SNPs associated with PD affection at genome-wide significance and the 189 probes from FoxO1 regulated genes. A significant association was observed between a SNP in the cyclin G associated kinase (GAK) gene and a probe in the spermine oxidase (SMOX) gene. Further examination of the FOXO1 region in a meta-analysis of six available GWAS showed two SNPs significantly associated with age at onset of PD. These results implicate FOXO1 as a PD-relevant gene and warrant further functional analyses of its transcriptional regulatory mechanisms.
Common variants at the CHEK2 gene locus and risk of epithelial ovarian cancer.
Lawrenson, Kate; Iversen, Edwin S; Tyrer, Jonathan; Weber, Rachel Palmieri; Concannon, Patrick; Hazelett, Dennis J; Li, Qiyuan; Marks, Jeffrey R; Berchuck, Andrew; Lee, Janet M; Aben, Katja K H; Anton-Culver, Hoda; Antonenkova, Natalia; Bandera, Elisa V; Bean, Yukie; Beckmann, Matthias W; Bisogna, Maria; Bjorge, Line; Bogdanova, Natalia; Brinton, Louise A; Brooks-Wilson, Angela; Bruinsma, Fiona; Butzow, Ralf; Campbell, Ian G; Carty, Karen; Chang-Claude, Jenny; Chenevix-Trench, Georgia; Chen, Ann; Chen, Zhihua; Cook, Linda S; Cramer, Daniel W; Cunningham, Julie M; Cybulski, Cezary; Plisiecka-Halasa, Joanna; Dennis, Joe; Dicks, Ed; Doherty, Jennifer A; Dörk, Thilo; du Bois, Andreas; Eccles, Diana; Easton, Douglas T; Edwards, Robert P; Eilber, Ursula; Ekici, Arif B; Fasching, Peter A; Fridley, Brooke L; Gao, Yu-Tang; Gentry-Maharaj, Aleksandra; Giles, Graham G; Glasspool, Rosalind; Goode, Ellen L; Goodman, Marc T; Gronwald, Jacek; Harter, Philipp; Hasmad, Hanis Nazihah; Hein, Alexander; Heitz, Florian; Hildebrandt, Michelle A T; Hillemanns, Peter; Hogdall, Estrid; Hogdall, Claus; Hosono, Satoyo; Jakubowska, Anna; Paul, James; Jensen, Allan; Karlan, Beth Y; Kjaer, Susanne Kruger; Kelemen, Linda E; Kellar, Melissa; Kelley, Joseph L; Kiemeney, Lambertus A; Krakstad, Camilla; Lambrechts, Diether; Lambrechts, Sandrina; Le, Nhu D; Lee, Alice W; Cannioto, Rikki; Leminen, Arto; Lester, Jenny; Levine, Douglas A; Liang, Dong; Lissowska, Jolanta; Lu, Karen; Lubinski, Jan; Lundvall, Lene; Massuger, Leon F A G; Matsuo, Keitaro; McGuire, Valerie; McLaughlin, John R; Nevanlinna, Heli; McNeish, Iain; Menon, Usha; Modugno, Francesmary; Moysich, Kirsten B; Narod, Steven A; Nedergaard, Lotte; Ness, Roberta B; Noor Azmi, Mat Adenan; Odunsi, Kunle; Olson, Sara H; Orlow, Irene; Orsulic, Sandra; Pearce, Celeste L; Pejovic, Tanja; Pelttari, Liisa M; Permuth-Wey, Jennifer; Phelan, Catherine M; Pike, Malcolm C; Poole, Elizabeth M; Ramus, Susan J; Risch, Harvey A; Rosen, Barry; Rossing, Mary Anne; Rothstein, Joseph H; Rudolph, Anja; Runnebaum, Ingo B; Rzepecka, Iwona K; Salvesen, Helga B; Budzilowska, Agnieszka; Sellers, Thomas A; Shu, Xiao-Ou; Shvetsov, Yurii B; Siddiqui, Nadeem; Sieh, Weiva; Song, Honglin; Southey, Melissa C; Sucheston, Lara; Tangen, Ingvild L; Teo, Soo-Hwang; Terry, Kathryn L; Thompson, Pamela J; Timorek, Agnieszka; Tworoger, Shelley S; Van Nieuwenhuysen, Els; Vergote, Ignace; Vierkant, Robert A; Wang-Gohrke, Shan; Walsh, Christine; Wentzensen, Nicolas; Whittemore, Alice S; Wicklund, Kristine G; Wilkens, Lynne R; Woo, Yin-Ling; Wu, Xifeng; Wu, Anna H; Yang, Hannah; Zheng, Wei; Ziogas, Argyrios; Coetzee, Gerhard A; Freedman, Matthew L; Monteiro, Alvaro N A; Moes-Sosnowska, Joanna; Kupryjanczyk, Jolanta; Pharoah, Paul D; Gayther, Simon A; Schildkraut, Joellen M
2015-11-01
Genome-wide association studies have identified 20 genomic regions associated with risk of epithelial ovarian cancer (EOC), but many additional risk variants may exist. Here, we evaluated associations between common genetic variants [single nucleotide polymorphisms (SNPs) and indels] in DNA repair genes and EOC risk. We genotyped 2896 common variants at 143 gene loci in DNA samples from 15 397 patients with invasive EOC and controls. We found evidence of associations with EOC risk for variants at FANCA, EXO1, E2F4, E2F2, CREB5 and CHEK2 genes (P ≤ 0.001). The strongest risk association was for CHEK2 SNP rs17507066 with serous EOC (P = 4.74 x 10(-7)). Additional genotyping and imputation of genotypes from the 1000 genomes project identified a slightly more significant association for CHEK2 SNP rs6005807 (r (2) with rs17507066 = 0.84, odds ratio (OR) 1.17, 95% CI 1.11-1.24, P = 1.1×10(-7)). We identified 293 variants in the region with likelihood ratios of less than 1:100 for representing the causal variant. Functional annotation identified 25 candidate SNPs that alter transcription factor binding sites within regulatory elements active in EOC precursor tissues. In The Cancer Genome Atlas dataset, CHEK2 gene expression was significantly higher in primary EOCs compared to normal fallopian tube tissues (P = 3.72×10(-8)). We also identified an association between genotypes of the candidate causal SNP rs12166475 (r (2) = 0.99 with rs6005807) and CHEK2 expression (P = 2.70×10(-8)). These data suggest that common variants at 22q12.1 are associated with risk of serous EOC and CHEK2 as a plausible target susceptibility gene. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Dumitriu, Alexandra; Latourelle, Jeanne C.; Hadzi, Tiffany C.; Pankratz, Nathan; Garza, Dan; Miller, John P.; Vance, Jeffery M.; Foroud, Tatiana; Beach, Thomas G.; Myers, Richard H.
2012-01-01
Parkinson disease (PD) is a complex neurodegenerative disorder with largely unknown genetic mechanisms. While the degeneration of dopaminergic neurons in PD mainly takes place in the substantia nigra pars compacta (SN) region, other brain areas, including the prefrontal cortex, develop Lewy bodies, the neuropathological hallmark of PD. We generated and analyzed expression data from the prefrontal cortex Brodmann Area 9 (BA9) of 27 PD and 26 control samples using the 44K One-Color Agilent 60-mer Whole Human Genome Microarray. All samples were male, without significant Alzheimer disease pathology and with extensive pathological annotation available. 507 of the 39,122 analyzed expression probes were different between PD and control samples at false discovery rate (FDR) of 5%. One of the genes with significantly increased expression in PD was the forkhead box O1 (FOXO1) transcription factor. Notably, genes carrying the FoxO1 binding site were significantly enriched in the FDR–significant group of genes (177 genes covered by 189 probes), suggesting a role for FoxO1 upstream of the observed expression changes. Single-nucleotide polymorphisms (SNPs) selected from a recent meta-analysis of PD genome-wide association studies (GWAS) were successfully genotyped in 50 out of the 53 microarray brains, allowing a targeted expression–SNP (eSNP) analysis for 52 SNPs associated with PD affection at genome-wide significance and the 189 probes from FoxO1 regulated genes. A significant association was observed between a SNP in the cyclin G associated kinase (GAK) gene and a probe in the spermine oxidase (SMOX) gene. Further examination of the FOXO1 region in a meta-analysis of six available GWAS showed two SNPs significantly associated with age at onset of PD. These results implicate FOXO1 as a PD–relevant gene and warrant further functional analyses of its transcriptional regulatory mechanisms. PMID:22761592
RNA-Seq identifies SNP markers for growth traits in rainbow trout.
Salem, Mohamed; Vallejo, Roger L; Leeds, Timothy D; Palti, Yniv; Liu, Sixin; Sabbagh, Annas; Rexroad, Caird E; Yao, Jianbo
2012-01-01
Fast growth is an important and highly desired trait, which affects the profitability of food animal production, with feed costs accounting for the largest proportion of production costs. Traditional phenotype-based selection is typically used to select for growth traits; however, genetic improvement is slow over generations. Single nucleotide polymorphisms (SNPs) explain 90% of the genetic differences between individuals; therefore, they are most suitable for genetic evaluation and strategies that employ molecular genetics for selective breeding. SNPs found within or near a coding sequence are of particular interest because they are more likely to alter the biological function of a protein. We aimed to use SNPs to identify markers and genes associated with genetic variation in growth. RNA-Seq whole-transcriptome analysis of pooled cDNA samples from a population of rainbow trout selected for improved growth versus unselected genetic cohorts (10 fish from 1 full-sib family each) identified SNP markers associated with growth-rate. The allelic imbalances (the ratio between the allele frequencies of the fast growing sample and that of the slow growing sample) were considered at scores >5.0 as an amplification and <0.2 as loss of heterozygosity. A subset of SNPs (n = 54) were validated and evaluated for association with growth traits in 778 individuals of a three-generation parent/offspring panel representing 40 families. Twenty-two SNP markers and one mitochondrial haplotype were significantly associated with growth traits. Polymorphism of 48 of the markers was confirmed in other commercially important aquaculture stocks. Many markers were clustered into genes of metabolic energy production pathways and are suitable candidates for genetic selection. The study demonstrates that RNA-Seq at low sequence coverage of divergent populations is a fast and effective means of identifying SNPs, with allelic imbalances between phenotypes. This technique is suitable for marker development in non-model species lacking complete and well-annotated genome reference sequences.
Reverter, A; Porto-Neto, L R; Fortes, M R S; McCulloch, R; Lyons, R E; Moore, S; Nicol, D; Henshall, J; Lehnert, S A
2016-10-01
We introduce an innovative approach to lowering the overall cost of obtaining genomic EBV (GEBV) and encourage their use in commercial extensive herds of Brahman beef cattle. In our approach, the DNA genotyping of cow herds from 2 independent properties was performed using a high-density bovine SNP chip on DNA from pooled blood samples, grouped according to the result of a pregnancy test following their first and second joining opportunities. For the DNA pooling strategy, 15 to 28 blood samples from the same phenotype and contemporary group were allocated to pools. Across the 2 properties, a total of 183 pools were created representing 4,164 cows. In addition, blood samples from 309 bulls from the same properties were also taken. After genotyping and quality control, 74,584 remaining SNP were used for analyses. Pools and individual DNA samples were related by means of a "hybrid" genomic relationship matrix. The pooled genotyping analysis of 2 large and independent commercial populations of tropical beef cattle was able to recover significant and plausible associations between SNP and pregnancy test outcome. We discuss 24 SNP with significant association ( < 1.0 × 10) and mapped within 40 kb of an annotated gene. We have established a method to estimate the GEBV in young herd bulls for a trait that is currently unable to be predicted at all. In summary, our novel approach allowed us to conduct genomic analyses of fertility in 2 large commercial Brahman herds managed under extensive pastoral conditions.
Yang, Xue-Dong; Tan, Hua-Wei; Zhu, Wei-Min
2016-01-01
Spinach (Spinacia oleracea L.), which originated in central and western Asia, belongs to the family Amaranthaceae. Spinach is one of most important leafy vegetables with a high nutritional value as well as being a perfect research material for plant sex chromosome models. As the completion of genome assembly and gene prediction of spinach, we developed SpinachDB (http://222.73.98.124/spinachdb) to store, annotate, mine and analyze genomics and genetics datasets efficiently. In this study, all of 21702 spinach genes were annotated. A total of 15741 spinach genes were catalogued into 4351 families, including identification of a substantial number of transcription factors. To construct a high-density genetic map, a total of 131592 SSRs and 1125743 potential SNPs located in 548801 loci of spinach genome were identified in 11 cultivated and wild spinach cultivars. The expression profiles were also performed with RNA-seq data using the FPKM method, which could be used to compare the genes. Paralogs in spinach and the orthologous genes in Arabidopsis, grape, sugar beet and rice were identified for comparative genome analysis. Finally, the SpinachDB website contains seven main sections, including the homepage; the GBrowse map that integrates genome, genes, SSR and SNP marker information; the Blast alignment service; the gene family classification search tool; the orthologous and paralogous gene pairs search tool; and the download and useful contact information. SpinachDB will be continually expanded to include newly generated robust genomics and genetics data sets along with the associated data mining and analysis tools.
Zheng, Zequn; Zhang, Qisen; Zhou, Gaofeng; Sweetingham, Mark W.; Howieson, John G.; Li, Chengdao
2013-01-01
Lupin (Lupinus angustifolius L.) is the most recently domesticated crop in major agricultural cultivation. Its seeds are high in protein and dietary fibre, but low in oil and starch. Medical and dietetic studies have shown that consuming lupin-enriched food has significant health benefits. We report the draft assembly from a whole genome shotgun sequencing dataset for this legume species with 26.9x coverage of the genome, which is predicted to contain 57,807 genes. Analysis of the annotated genes with metabolic pathways provided a partial understanding of some key features of lupin, such as the amino acid profile of storage proteins in seeds. Furthermore, we applied the NGS-based RAD-sequencing technology to obtain 8,244 sequence-defined markers for anchoring the genomic sequences. A total of 4,214 scaffolds from the genome sequence assembly were aligned into the genetic map. The combination of the draft assembly and a sequence-defined genetic map made it possible to locate and study functional genes of agronomic interest. The identification of co-segregating SNP markers, scaffold sequences and gene annotation facilitated the identification of a candidate R gene associated with resistance to the major lupin disease anthracnose. We demonstrated that the combination of medium-depth genome sequencing and a high-density genetic linkage map by application of NGS technology is a cost-effective approach to generating genome sequence data and a large number of molecular markers to study the genomics, genetics and functional genes of lupin, and to apply them to molecular plant breeding. This strategy does not require prior genome knowledge, which potentiates its application to a wide range of non-model species. PMID:23734219
Yang, Huaan; Tao, Ye; Zheng, Zequn; Zhang, Qisen; Zhou, Gaofeng; Sweetingham, Mark W; Howieson, John G; Li, Chengdao
2013-01-01
Lupin (Lupinus angustifolius L.) is the most recently domesticated crop in major agricultural cultivation. Its seeds are high in protein and dietary fibre, but low in oil and starch. Medical and dietetic studies have shown that consuming lupin-enriched food has significant health benefits. We report the draft assembly from a whole genome shotgun sequencing dataset for this legume species with 26.9x coverage of the genome, which is predicted to contain 57,807 genes. Analysis of the annotated genes with metabolic pathways provided a partial understanding of some key features of lupin, such as the amino acid profile of storage proteins in seeds. Furthermore, we applied the NGS-based RAD-sequencing technology to obtain 8,244 sequence-defined markers for anchoring the genomic sequences. A total of 4,214 scaffolds from the genome sequence assembly were aligned into the genetic map. The combination of the draft assembly and a sequence-defined genetic map made it possible to locate and study functional genes of agronomic interest. The identification of co-segregating SNP markers, scaffold sequences and gene annotation facilitated the identification of a candidate R gene associated with resistance to the major lupin disease anthracnose. We demonstrated that the combination of medium-depth genome sequencing and a high-density genetic linkage map by application of NGS technology is a cost-effective approach to generating genome sequence data and a large number of molecular markers to study the genomics, genetics and functional genes of lupin, and to apply them to molecular plant breeding. This strategy does not require prior genome knowledge, which potentiates its application to a wide range of non-model species.
Transcriptome-enabled marker discovery and mapping of plastochron-related genes in Petunia spp.
Guo, Yufang; Wiegert-Rininger, Krystle E; Vallejo, Veronica A; Barry, Cornelius S; Warner, Ryan M
2015-09-24
Petunia (Petunia × hybrida), derived from a hybrid between P. axillaris and P. integrifolia, is one of the most economically important bedding plant crops and Petunia spp. serve as model systems for investigating the mechanisms underlying diverse mating systems and pollination syndromes. In addition, we have previously described genetic variation and quantitative trait loci (QTL) related to petunia development rate and morphology, which represent important breeding targets for the floriculture industry to improve crop production and performance. Despite the importance of petunia as a crop, the floriculture industry has been slow to adopt marker assisted selection to facilitate breeding strategies and there remains a limited availability of sequences and molecular markers from the genus compared to other economically important members of the Solanaceae family such as tomato, potato and pepper. Here we report the de novo assembly, annotation and characterization of transcriptomes from P. axillaris, P. exserta and P. integrifolia. Each transcriptome assembly was derived from five tissue libraries (callus, 3-week old seedlings, shoot apices, flowers of mixed developmental stages, and trichomes). A total of 74,573, 54,913, and 104,739 assembled transcripts were recovered from P. axillaris, P. exserta and P. integrifolia, respectively and following removal of multiple isoforms, 32,994 P. axillaris, 30,225 P. exserta, and 33,540 P. integrifolia high quality representative transcripts were extracted for annotation and expression analysis. The transcriptome data was mined for single nucleotide polymorphisms (SNP) and simple sequence repeat (SSR) markers, yielding 89,007 high quality SNPs and 2949 SSRs, respectively. 15,701 SNPs were computationally converted into user-friendly cleaved amplified polymorphic sequence (CAPS) markers and a subset of SNP and CAPS markers were experimentally verified. CAPS markers developed from plastochron-related homologous transcripts from P. axillaris were mapped in an interspecific Petunia population and evaluated for co-localization with QTL for development rate. The high quality of the three Petunia spp. transcriptomes coupled with the utility of the SNP data will serve as a resource for further exploration of genetic diversity within the genus and will facilitate efforts to develop genetic and physical maps to aid the identification of QTL associated with traits of interest.
DMET-analyzer: automatic analysis of Affymetrix DMET data.
Guzzi, Pietro Hiram; Agapito, Giuseppe; Di Martino, Maria Teresa; Arbitrio, Mariamena; Tassone, Pierfrancesco; Tagliaferri, Pierosandro; Cannataro, Mario
2012-10-05
Clinical Bioinformatics is currently growing and is based on the integration of clinical and omics data aiming at the development of personalized medicine. Thus the introduction of novel technologies able to investigate the relationship among clinical states and biological machineries may help the development of this field. For instance the Affymetrix DMET platform (drug metabolism enzymes and transporters) is able to study the relationship among the variation of the genome of patients and drug metabolism, detecting SNPs (Single Nucleotide Polymorphism) on genes related to drug metabolism. This may allow for instance to find genetic variants in patients which present different drug responses, in pharmacogenomics and clinical studies. Despite this, there is currently a lack in the development of open-source algorithms and tools for the analysis of DMET data. Existing software tools for DMET data generally allow only the preprocessing of binary data (e.g. the DMET-Console provided by Affymetrix) and simple data analysis operations, but do not allow to test the association of the presence of SNPs with the response to drugs. We developed DMET-Analyzer a tool for the automatic association analysis among the variation of the patient genomes and the clinical conditions of patients, i.e. the different response to drugs. The proposed system allows: (i) to automatize the workflow of analysis of DMET-SNP data avoiding the use of multiple tools; (ii) the automatic annotation of DMET-SNP data and the search in existing databases of SNPs (e.g. dbSNP), (iii) the association of SNP with pathway through the search in PharmaGKB, a major knowledge base for pharmacogenomic studies. DMET-Analyzer has a simple graphical user interface that allows users (doctors/biologists) to upload and analyse DMET files produced by Affymetrix DMET-Console in an interactive way. The effectiveness and easy use of DMET Analyzer is demonstrated through different case studies regarding the analysis of clinical datasets produced in the University Hospital of Catanzaro, Italy. DMET Analyzer is a novel tool able to automatically analyse data produced by the DMET-platform in case-control association studies. Using such tool user may avoid wasting time in the manual execution of multiple statistical tests avoiding possible errors and reducing the amount of time needed for a whole experiment. Moreover annotations and the direct link to external databases may increase the biological knowledge extracted. The system is freely available for academic purposes at: https://sourceforge.net/projects/dmetanalyzer/files/
Explaining the disease phenotype of intergenic SNP through predicted long range regulation
Chen, Jingqi; Tian, Weidong
2016-01-01
Thousands of disease-associated SNPs (daSNPs) are located in intergenic regions (IGR), making it difficult to understand their association with disease phenotypes. Recent analysis found that non-coding daSNPs were frequently located in or approximate to regulatory elements, inspiring us to try to explain the disease phenotypes of IGR daSNPs through nearby regulatory sequences. Hence, after locating the nearest distal regulatory element (DRE) to a given IGR daSNP, we applied a computational method named INTREPID to predict the target genes regulated by the DRE, and then investigated their functional relevance to the IGR daSNP's disease phenotypes. 36.8% of all IGR daSNP-disease phenotype associations investigated were possibly explainable through the predicted target genes, which were enriched with, were functionally relevant to, or consisted of the corresponding disease genes. This proportion could be further increased to 60.5% if the LD SNPs of daSNPs were also considered. Furthermore, the predicted SNP-target gene pairs were enriched with known eQTL/mQTL SNP-gene relationships. Overall, it's likely that IGR daSNPs may contribute to disease phenotypes by interfering with the regulatory function of their nearby DREs and causing abnormal expression of disease genes. PMID:27280978
Generation and analysis of expressed sequence tags in the extreme large genomes Lilium and Tulipa.
Shahin, Arwa; van Kaauwen, Martijn; Esselink, Danny; Bargsten, Joachim W; van Tuyl, Jaap M; Visser, Richard G F; Arens, Paul
2012-11-20
Bulbous flowers such as lily and tulip (Liliaceae family) are monocot perennial herbs that are economically very important ornamental plants worldwide. However, there are hardly any genetic studies performed and genomic resources are lacking. To build genomic resources and develop tools to speed up the breeding in both crops, next generation sequencing was implemented. We sequenced and assembled transcriptomes of four lily and five tulip genotypes using 454 pyro-sequencing technology. Successfully, we developed the first set of 81,791 contigs with an average length of 514 bp for tulip, and enriched the very limited number of 3,329 available ESTs (Expressed Sequence Tags) for lily with 52,172 contigs with an average length of 555 bp. The contigs together with singletons covered on average 37% of lily and 39% of tulip estimated transcriptome. Mining lily and tulip sequence data for SSRs (Simple Sequence Repeats) showed that di-nucleotide repeats were twice more abundant in UTRs (UnTranslated Regions) compared to coding regions, while tri-nucleotide repeats were equally spread over coding and UTR regions. Two sets of single nucleotide polymorphism (SNP) markers suitable for high throughput genotyping were developed. In the first set, no SNPs flanking the target SNP (50 bp on either side) were allowed. In the second set, one SNP in the flanking regions was allowed, which resulted in a 2 to 3 fold increase in SNP marker numbers compared with the first set. Orthologous groups between the two flower bulbs: lily and tulip (12,017 groups) and among the three monocot species: lily, tulip, and rice (6,900 groups) were determined using OrthoMCL. Orthologous groups were screened for common SNP markers and EST-SSRs to study synteny between lily and tulip, which resulted in 113 common SNP markers and 292 common EST-SSR. Lily and tulip contigs generated were annotated and described according to Gene Ontology terminology. Two transcriptome sets were built that are valuable resources for marker development, comparative genomic studies and candidate gene approaches. Next generation sequencing of leaf transcriptome is very effective; however, deeper sequencing and using more tissues and stages is advisable for extended comparative studies.
Relax with CouchDB - Into the non-relational DBMS era of Bioinformatics
Manyam, Ganiraju; Payton, Michelle A.; Roth, Jack A.; Abruzzo, Lynne V.; Coombes, Kevin R.
2012-01-01
With the proliferation of high-throughput technologies, genome-level data analysis has become common in molecular biology. Bioinformaticians are developing extensive resources to annotate and mine biological features from high-throughput data. The underlying database management systems for most bioinformatics software are based on a relational model. Modern non-relational databases offer an alternative that has flexibility, scalability, and a non-rigid design schema. Moreover, with an accelerated development pace, non-relational databases like CouchDB can be ideal tools to construct bioinformatics utilities. We describe CouchDB by presenting three new bioinformatics resources: (a) geneSmash, which collates data from bioinformatics resources and provides automated gene-centric annotations, (b) drugBase, a database of drug-target interactions with a web interface powered by geneSmash, and (c) HapMap-CN, which provides a web interface to query copy number variations from three SNP-chip HapMap datasets. In addition to the web sites, all three systems can be accessed programmatically via web services. PMID:22609849
A survey of application: genomics and genetic programming, a new frontier.
Khan, Mohammad Wahab; Alam, Mansaf
2012-08-01
The aim of this paper is to provide an introduction to the rapidly developing field of genetic programming (GP). Particular emphasis is placed on the application of GP to genomics. First, the basic methodology of GP is introduced. This is followed by a review of applications in the areas of gene network inference, gene expression data analysis, SNP analysis, epistasis analysis and gene annotation. Finally this paper concluded by suggesting potential avenues of possible future research on genetic programming, opportunities to extend the technique, and areas for possible practical applications. Copyright © 2012 Elsevier Inc. All rights reserved.
pyGeno: A Python package for precision medicine and proteogenomics.
Daouda, Tariq; Perreault, Claude; Lemieux, Sébastien
2016-01-01
pyGeno is a Python package mainly intended for precision medicine applications that revolve around genomics and proteomics. It integrates reference sequences and annotations from Ensembl, genomic polymorphisms from the dbSNP database and data from next-gen sequencing into an easy to use, memory-efficient and fast framework, therefore allowing the user to easily explore subject-specific genomes and proteomes. Compared to a standalone program, pyGeno gives the user access to the complete expressivity of Python, a general programming language. Its range of application therefore encompasses both short scripts and large scale genome-wide studies.
pyGeno: A Python package for precision medicine and proteogenomics
Daouda, Tariq; Perreault, Claude; Lemieux, Sébastien
2016-01-01
pyGeno is a Python package mainly intended for precision medicine applications that revolve around genomics and proteomics. It integrates reference sequences and annotations from Ensembl, genomic polymorphisms from the dbSNP database and data from next-gen sequencing into an easy to use, memory-efficient and fast framework, therefore allowing the user to easily explore subject-specific genomes and proteomes. Compared to a standalone program, pyGeno gives the user access to the complete expressivity of Python, a general programming language. Its range of application therefore encompasses both short scripts and large scale genome-wide studies. PMID:27785359
A SNP resource for Douglas-fir: de novo transcriptome assembly and SNP detection and validation.
Howe, Glenn T; Yu, Jianbin; Knaus, Brian; Cronn, Richard; Kolpak, Scott; Dolan, Peter; Lorenz, W Walter; Dean, Jeffrey F D
2013-02-28
Douglas-fir (Pseudotsuga menziesii), one of the most economically and ecologically important tree species in the world, also has one of the largest tree breeding programs. Although the coastal and interior varieties of Douglas-fir (vars. menziesii and glauca) are native to North America, the coastal variety is also widely planted for timber production in Europe, New Zealand, Australia, and Chile. Our main goal was to develop a SNP resource large enough to facilitate genomic selection in Douglas-fir breeding programs. To accomplish this, we developed a 454-based reference transcriptome for coastal Douglas-fir, annotated and evaluated the quality of the reference, identified putative SNPs, and then validated a sample of those SNPs using the Illumina Infinium genotyping platform. We assembled a reference transcriptome consisting of 25,002 isogroups (unique gene models) and 102,623 singletons from 2.76 million 454 and Sanger cDNA sequences from coastal Douglas-fir. We identified 278,979 unique SNPs by mapping the 454 and Sanger sequences to the reference, and by mapping four datasets of Illumina cDNA sequences from multiple seed sources, genotypes, and tissues. The Illumina datasets represented coastal Douglas-fir (64.00 and 13.41 million reads), interior Douglas-fir (80.45 million reads), and a Yakima population similar to interior Douglas-fir (8.99 million reads). We assayed 8067 SNPs on 260 trees using an Illumina Infinium SNP genotyping array. Of these SNPs, 5847 (72.5%) were called successfully and were polymorphic. Based on our validation efficiency, our SNP database may contain as many as ~200,000 true SNPs, and as many as ~69,000 SNPs that could be genotyped at ~20,000 gene loci using an Infinium II array-more SNPs than are needed to use genomic selection in tree breeding programs. Ultimately, these genomic resources will enhance Douglas-fir breeding and allow us to better understand landscape-scale patterns of genetic variation and potential responses to climate change.
Mourad, Amira M I; Sallam, Ahmed; Belamkar, Vikas; Wegulo, Stephen; Bowden, Robert; Jin, Yue; Mahdy, Ezzat; Bakheit, Bahy; El-Wafaa, Atif A; Poland, Jesse; Baenziger, Peter S
2018-01-01
Stem rust (caused by Puccinia graminis f. sp. tritici Erikss. & E. Henn.), is a major disease in wheat ( Triticum aestivium L.). However, in recent years it occurs rarely in Nebraska due to weather and the effective selection and gene pyramiding of resistance genes. To understand the genetic basis of stem rust resistance in Nebraska winter wheat, we applied genome-wide association study (GWAS) on a set of 270 winter wheat genotypes (A-set). Genotyping was carried out using genotyping-by-sequencing and ∼35,000 high-quality SNPs were identified. The tested genotypes were evaluated for their resistance to the common stem rust race in Nebraska (QFCSC) in two replications. Marker-trait association identified 32 SNP markers, which were significantly (Bonferroni corrected P < 0.05) associated with the resistance on chromosome 2D. The chromosomal location of the significant SNPs (chromosome 2D) matched the location of Sr6 gene which was expected in these genotypes based on pedigree information. A highly significant linkage disequilibrium (LD, r 2 ) was found between the significant SNPs and the specific SSR marker for the Sr6 gene ( Xcfd43 ). This suggests the significant SNP markers are tagging Sr6 gene. Out of the 32 significant SNPs, eight SNPs were in six genes that are annotated as being linked to disease resistance in the IWGSC RefSeq v1.0. The 32 significant SNP markers were located in nine haplotype blocks. All the 32 significant SNPs were validated in a set of 60 different genotypes (V-set) using single marker analysis. SNP markers identified in this study can be used in marker-assisted selection, genomic selection, and to develop KASP (Kompetitive Allele Specific PCR) marker for the Sr6 gene. Novel SNPs for Sr6 gene, an important stem rust resistant gene, were identified and validated in this study. These SNPs can be used to improve stem rust resistance in wheat.
A SNP resource for Douglas-fir: de novo transcriptome assembly and SNP detection and validation
2013-01-01
Background Douglas-fir (Pseudotsuga menziesii), one of the most economically and ecologically important tree species in the world, also has one of the largest tree breeding programs. Although the coastal and interior varieties of Douglas-fir (vars. menziesii and glauca) are native to North America, the coastal variety is also widely planted for timber production in Europe, New Zealand, Australia, and Chile. Our main goal was to develop a SNP resource large enough to facilitate genomic selection in Douglas-fir breeding programs. To accomplish this, we developed a 454-based reference transcriptome for coastal Douglas-fir, annotated and evaluated the quality of the reference, identified putative SNPs, and then validated a sample of those SNPs using the Illumina Infinium genotyping platform. Results We assembled a reference transcriptome consisting of 25,002 isogroups (unique gene models) and 102,623 singletons from 2.76 million 454 and Sanger cDNA sequences from coastal Douglas-fir. We identified 278,979 unique SNPs by mapping the 454 and Sanger sequences to the reference, and by mapping four datasets of Illumina cDNA sequences from multiple seed sources, genotypes, and tissues. The Illumina datasets represented coastal Douglas-fir (64.00 and 13.41 million reads), interior Douglas-fir (80.45 million reads), and a Yakima population similar to interior Douglas-fir (8.99 million reads). We assayed 8067 SNPs on 260 trees using an Illumina Infinium SNP genotyping array. Of these SNPs, 5847 (72.5%) were called successfully and were polymorphic. Conclusions Based on our validation efficiency, our SNP database may contain as many as ~200,000 true SNPs, and as many as ~69,000 SNPs that could be genotyped at ~20,000 gene loci using an Infinium II array—more SNPs than are needed to use genomic selection in tree breeding programs. Ultimately, these genomic resources will enhance Douglas-fir breeding and allow us to better understand landscape-scale patterns of genetic variation and potential responses to climate change. PMID:23445355
AutoFACT: An Automatic Functional Annotation and Classification Tool
Koski, Liisa B; Gray, Michael W; Lang, B Franz; Burger, Gertraud
2005-01-01
Background Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets. Results We present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1) analyzes nucleotide and protein sequence data; (2) determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3) assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4) generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%. Conclusion AutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at . PMID:15960857
EuCAP, a Eukaryotic Community Annotation Package, and its application to the rice genome
Thibaud-Nissen, Françoise; Campbell, Matthew; Hamilton, John P; Zhu, Wei; Buell, C Robin
2007-01-01
Background Despite the improvements of tools for automated annotation of genome sequences, manual curation at the structural and functional level can provide an increased level of refinement to genome annotation. The Institute for Genomic Research Rice Genome Annotation (hereafter named the Osa1 Genome Annotation) is the product of an automated pipeline and, for this reason, will benefit from the input of biologists with expertise in rice and/or particular gene families. Leveraging knowledge from a dispersed community of scientists is a demonstrated way of improving a genome annotation. This requires tools that facilitate 1) the submission of gene annotation to an annotation project, 2) the review of the submitted models by project annotators, and 3) the incorporation of the submitted models in the ongoing annotation effort. Results We have developed the Eukaryotic Community Annotation Package (EuCAP), an annotation tool, and have applied it to the rice genome. The primary level of curation by community annotators (CA) has been the annotation of gene families. Annotation can be submitted by email or through the EuCAP Web Tool. The CA models are aligned to the rice pseudomolecules and the coordinates of these alignments, along with functional annotation, are stored in the MySQL EuCAP Gene Model database. Web pages displaying the alignments of the CA models to the Osa1 Genome models are automatically generated from the EuCAP Gene Model database. The alignments are reviewed by the project annotators (PAs) in the context of experimental evidence. Upon approval by the PAs, the CA models, along with the corresponding functional annotations, are integrated into the Osa1 Genome Annotation. The CA annotations, grouped by family, are displayed on the Community Annotation pages of the project website , as well as in the Community Annotation track of the Genome Browser. Conclusion We have applied EuCAP to rice. As of July 2007, the structural and/or functional annotation of 1,094 genes representing 57 families have been deposited and integrated into the current gene set. All of the EuCAP components are open-source, thereby allowing the implementation of EuCAP for the annotation of other genomes. EuCAP is available at . PMID:17961238
Yang, Jing; Zhou, Haixia; Liang, Binmiao; Xiao, Jun; Su, Zhiguang; Chen, Hong; Ma, Chunlan; Li, Dengxue; Feng, Yulin; Ou, Xuemei
2014-02-01
Recent genome-wide association studies have shown associations between variants at five loci (TNS1, GSTCD, HTR4, AGER and THSD4) and chronic obstructive pulmonary disease (COPD) or lung function. However, their association with COPD has not been proven in Chinese Han population, nor have COPD-related phenotypes been studied. The objective of this study was to look for associations between five single nucleotide polymorphisms (SNP) in these novel candidate genes and COPD susceptibility or lung function in a Chinese Han population. Allele and genotype data on 680 COPD patients and 687 healthy controls for sentinel SNP in these five loci were investigated. Allele frequencies and genotype distributions were compared between cases and controls, and odds ratios were calculated. Potential relationships between these SNP and COPD-related lung function were assessed. No significant associations were found between any of the SNP and COPD in cases and controls. The SNP (rs3995090) in HTR4 was associated with COPD (adjusted P = 0.022) in never-smokers, and the SNP (rs2070600) in AGER was associated with forced expiratory volume in 1 s (FEV1 %) predicted (β = -0.066, adjusted P = 0.016) and FEV1 /forced vital capacity (β = -0.071, adjusted P = 0.009) in all subjects. The variant at HTR4 was associated with COPD in never-smokers, and the SNP in AGER was associated with pulmonary function in a Chinese Han population. © 2013 The Authors. Respirology © 2013 Asian Pacific Society of Respirology.
Aubry, Marc; Monnier, Annabelle; Chicault, Celine; de Tayrac, Marie; Galibert, Marie-Dominique; Burgun, Anita; Mosser, Jean
2006-01-01
Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence) or text-mining of the published scientific literature (literature profiling). Results We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human) and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2) and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells). Conclusion The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions. PMID:16674810
MIPS bacterial genomes functional annotation benchmark dataset.
Tetko, Igor V; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Fobo, Gisela; Ruepp, Andreas; Antonov, Alexey V; Surmeli, Dimitrij; Mewes, Hans-Wernen
2005-05-15
Any development of new methods for automatic functional annotation of proteins according to their sequences requires high-quality data (as benchmark) as well as tedious preparatory work to generate sequence parameters required as input data for the machine learning methods. Different program settings and incompatible protocols make a comparison of the analyzed methods difficult. The MIPS Bacterial Functional Annotation Benchmark dataset (MIPS-BFAB) is a new, high-quality resource comprising four bacterial genomes manually annotated according to the MIPS functional catalogue (FunCat). These resources include precalculated sequence parameters, such as sequence similarity scores, InterPro domain composition and other parameters that could be used to develop and benchmark methods for functional annotation of bacterial protein sequences. These data are provided in XML format and can be used by scientists who are not necessarily experts in genome annotation. BFAB is available at http://mips.gsf.de/proj/bfab
Huang, Dandan; Yi, Xianfu; Zhang, Shijie; Zheng, Zhanye; Wang, Panwen; Xuan, Chenghao; Sham, Pak Chung; Wang, Junwen; Li, Mulin Jun
2018-05-16
Genome-wide association studies have generated over thousands of susceptibility loci for many human complex traits, and yet for most of these associations the true causal variants remain unknown. Tissue/cell type-specific prediction and prioritization of non-coding regulatory variants will facilitate the identification of causal variants and underlying pathogenic mechanisms for particular complex diseases and traits. By leveraging recent large-scale functional genomics/epigenomics data, we develop an intuitive web server, GWAS4D (http://mulinlab.tmu.edu.cn/gwas4d or http://mulinlab.org/gwas4d), that systematically evaluates GWAS signals and identifies context-specific regulatory variants. The updated web server includes six major features: (i) updates the regulatory variant prioritization method with our new algorithm; (ii) incorporates 127 tissue/cell type-specific epigenomes data; (iii) integrates motifs of 1480 transcriptional regulators from 13 public resources; (iv) uniformly processes Hi-C data and generates significant interactions at 5 kb resolution across 60 tissues/cell types; (v) adds comprehensive non-coding variant functional annotations; (vi) equips a highly interactive visualization function for SNP-target interaction. Using a GWAS fine-mapped set for 161 coronary artery disease risk loci, we demonstrate that GWAS4D is able to efficiently prioritize disease-causal regulatory variants.
SNP-array lesions in core binding factor acute myeloid leukemia
Duployez, Nicolas; Boudry-Labis, Elise; Roumier, Christophe; Boissel, Nicolas; Petit, Arnaud; Geffroy, Sandrine; Helevaut, Nathalie; Celli-Lebras, Karine; Terré, Christine; Fenneteau, Odile; Cuccuini, Wendy; Luquet, Isabelle; Lapillonne, Hélène; Lacombe, Catherine; Cornillet, Pascale; Ifrah, Norbert; Dombret, Hervé; Leverger, Guy; Jourdan, Eric; Preudhomme, Claude
2018-01-01
Acute myeloid leukemia (AML) with t(8;21) and inv(16), together referred as core binding factor (CBF)-AML, are recognized as unique entities. Both rearrangements share a common pathophysiology, the disruption of the CBF, and a relatively good prognosis. Experiments have demonstrated that CBF rearrangements were insufficient to induce leukemia, implying the existence of cooperating events. To explore these aberrations, we performed single nucleotide polymorphism (SNP)-array in a well-annotated cohort of 198 patients with CBF-AML. Excluding breakpoint-associated lesions, the most frequent events included loss of a sex chromosome (53%), deletions at 9q21 (12%) and 7q36 (9%) in patients with t(8;21) compared with trisomy 22 (13%), trisomy 8 (10%) and 7q36 deletions (12%) in patients with inv(16). SNP-array revealed novel recurrent genetic alterations likely to be involved in CBF-AML leukemogenesis. ZBTB7A mutations (20% of t(8;21)-AML) were shown to be a target of copy-neutral losses of heterozygosity (CN-LOH) at chromosome 19p. FOXP1 focal deletions were identified in 5% of inv(16)-AML while sequence analysis revealed that 2% carried FOXP1 truncating mutations. Finally, CCDC26 disruption was found in both subtypes (4.5% of the whole cohort) and possibly highlighted a new lesion associated with aberrant tyrosine kinase signaling in this particular subtype of leukemia. PMID:29464086
Mahato, Ajay Kumar; Sharma, Nimisha; Singh, Akshay; Srivastav, Manish; Jaiprakash; Singh, Sanjay Kumar; Singh, Anand Kumar; Sharma, Tilak Raj; Singh, Nagendra Kumar
2016-01-01
Mango (Mangifera indica L.) is called "king of fruits" due to its sweetness, richness of taste, diversity, large production volume and a variety of end usage. Despite its huge economic importance genomic resources in mango are scarce and genetics of useful horticultural traits are poorly understood. Here we generated deep coverage leaf RNA sequence data for mango parental varieties 'Neelam', 'Dashehari' and their hybrid 'Amrapali' using next generation sequencing technologies. De-novo sequence assembly generated 27,528, 20,771 and 35,182 transcripts for the three genotypes, respectively. The transcripts were further assembled into a non-redundant set of 70,057 unigenes that were used for SSR and SNP identification and annotation. Total 5,465 SSR loci were identified in 4,912 unigenes with 288 type I SSR (n ≥ 20 bp). One hundred type I SSR markers were randomly selected of which 43 yielded PCR amplicons of expected size in the first round of validation and were designated as validated genic-SSR markers. Further, 22,306 SNPs were identified by aligning high quality sequence reads of the three mango varieties to the reference unigene set, revealing significantly enhanced SNP heterozygosity in the hybrid Amrapali. The present study on leaf RNA sequencing of mango varieties and their hybrid provides useful genomic resource for genetic improvement of mango.
Mahato, Ajay Kumar; Sharma, Nimisha; Singh, Akshay; Srivastav, Manish; Jaiprakash; Singh, Sanjay Kumar; Singh, Anand Kumar; Sharma, Tilak Raj; Singh, Nagendra Kumar
2016-01-01
Mango (Mangifera indica L.) is called “king of fruits” due to its sweetness, richness of taste, diversity, large production volume and a variety of end usage. Despite its huge economic importance genomic resources in mango are scarce and genetics of useful horticultural traits are poorly understood. Here we generated deep coverage leaf RNA sequence data for mango parental varieties ‘Neelam’, ‘Dashehari’ and their hybrid ‘Amrapali’ using next generation sequencing technologies. De-novo sequence assembly generated 27,528, 20,771 and 35,182 transcripts for the three genotypes, respectively. The transcripts were further assembled into a non-redundant set of 70,057 unigenes that were used for SSR and SNP identification and annotation. Total 5,465 SSR loci were identified in 4,912 unigenes with 288 type I SSR (n ≥ 20 bp). One hundred type I SSR markers were randomly selected of which 43 yielded PCR amplicons of expected size in the first round of validation and were designated as validated genic-SSR markers. Further, 22,306 SNPs were identified by aligning high quality sequence reads of the three mango varieties to the reference unigene set, revealing significantly enhanced SNP heterozygosity in the hybrid Amrapali. The present study on leaf RNA sequencing of mango varieties and their hybrid provides useful genomic resource for genetic improvement of mango. PMID:27736892
SNP-array lesions in core binding factor acute myeloid leukemia.
Duployez, Nicolas; Boudry-Labis, Elise; Roumier, Christophe; Boissel, Nicolas; Petit, Arnaud; Geffroy, Sandrine; Helevaut, Nathalie; Celli-Lebras, Karine; Terré, Christine; Fenneteau, Odile; Cuccuini, Wendy; Luquet, Isabelle; Lapillonne, Hélène; Lacombe, Catherine; Cornillet, Pascale; Ifrah, Norbert; Dombret, Hervé; Leverger, Guy; Jourdan, Eric; Preudhomme, Claude
2018-01-19
Acute myeloid leukemia (AML) with t(8;21) and inv(16), together referred as core binding factor (CBF)-AML, are recognized as unique entities. Both rearrangements share a common pathophysiology, the disruption of the CBF, and a relatively good prognosis. Experiments have demonstrated that CBF rearrangements were insufficient to induce leukemia, implying the existence of cooperating events. To explore these aberrations, we performed single nucleotide polymorphism (SNP)-array in a well-annotated cohort of 198 patients with CBF-AML. Excluding breakpoint-associated lesions, the most frequent events included loss of a sex chromosome (53%), deletions at 9q21 (12%) and 7q36 (9%) in patients with t(8;21) compared with trisomy 22 (13%), trisomy 8 (10%) and 7q36 deletions (12%) in patients with inv(16). SNP-array revealed novel recurrent genetic alterations likely to be involved in CBF-AML leukemogenesis. ZBTB7A mutations (20% of t(8;21)-AML) were shown to be a target of copy-neutral losses of heterozygosity (CN-LOH) at chromosome 19p. FOXP1 focal deletions were identified in 5% of inv(16)-AML while sequence analysis revealed that 2% carried FOXP1 truncating mutations. Finally, CCDC26 disruption was found in both subtypes (4.5% of the whole cohort) and possibly highlighted a new lesion associated with aberrant tyrosine kinase signaling in this particular subtype of leukemia.
Cross-organism learning method to discover new gene functionalities.
Domeniconi, Giacomo; Masseroli, Marco; Moro, Gianluca; Pinoli, Pietro
2016-04-01
Knowledge of gene and protein functions is paramount for the understanding of physiological and pathological biological processes, as well as in the development of new drugs and therapies. Analyses for biomedical knowledge discovery greatly benefit from the availability of gene and protein functional feature descriptions expressed through controlled terminologies and ontologies, i.e., of gene and protein biomedical controlled annotations. In the last years, several databases of such annotations have become available; yet, these valuable annotations are incomplete, include errors and only some of them represent highly reliable human curated information. Computational techniques able to reliably predict new gene or protein annotations with an associated likelihood value are thus paramount. Here, we propose a novel cross-organisms learning approach to reliably predict new functionalities for the genes of an organism based on the known controlled annotations of the genes of another, evolutionarily related and better studied, organism. We leverage a new representation of the annotation discovery problem and a random perturbation of the available controlled annotations to allow the application of supervised algorithms to predict with good accuracy unknown gene annotations. Taking advantage of the numerous gene annotations available for a well-studied organism, our cross-organisms learning method creates and trains better prediction models, which can then be applied to predict new gene annotations of a target organism. We tested and compared our method with the equivalent single organism approach on different gene annotation datasets of five evolutionarily related organisms (Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum). Results show both the usefulness of the perturbation method of available annotations for better prediction model training and a great improvement of the cross-organism models with respect to the single-organism ones, without influence of the evolutionary distance between the considered organisms. The generated ranked lists of reliably predicted annotations, which describe novel gene functionalities and have an associated likelihood value, are very valuable both to complement available annotations, for better coverage in biomedical knowledge discovery analyses, and to quicken the annotation curation process, by focusing it on the prioritized novel annotations predicted. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Evaluating Functional Annotations of Enzymes Using the Gene Ontology.
Holliday, Gemma L; Davidson, Rebecca; Akiva, Eyal; Babbitt, Patricia C
2017-01-01
The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25-29, 2000) is a powerful tool in the informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the nearest well annotated homologue of a protein of interest to predicting where misannotation has occurred to knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function based on sequence similarity. These can range from identification of misannotation or other errors in a predicted function to accurate function prediction for an enzyme of entirely unknown function. Although GO annotation applies to any gene products, we focus here a describing our approach for hierarchical classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids Res 42(Database issue):D521-530, 2014) as a guide for informed utilisation of annotation transfer based on GO terms.
Li, Guoxi; Zhao, Yinli; Liu, Zhonghu; Gao, Chunsheng; Yan, Fengbin; Liu, Bianzhi; Feng, Jianxin
2015-06-01
Common carp (Cyprinus carpio) is one of the most important aquacultured species of the family Cyprinidae, and breeding this species for disease resistance is becoming more and more important. However, at the genome or transcriptome levels, study of the immunogenetics of disease resistance in the common carp is lacking. In this study, 60,316,906 and 75,200,328 paired-end clean reads were obtained from two cDNA libraries of the common carp spleen by Illumina paired-end sequencing technology. Totally, 130,293 unique transcript fragments (unigenes) were assembled, with an average length of 1400.57 bp. Approximately 105,612 (81.06%) unigenes could be annotated according to their homology with matches in the Nr, Nt, Swiss-Prot, COG, GO, or KEGG databases, and they were found to represent 46,747 non-redundant genes. Comparative analysis showed that 59.82% of the unigenes have significant similarity to zebrafish Refseq proteins. Gene expression comparison revealed that 10,432 and 6889 annotated unigenes were, respectively, up- and down-regulated with at least twofold changes between two developmental stages of the common carp spleen. Gene ontology and KEGG analysis were performed to classify all unigenes into functional categories for understanding gene functions and regulation pathways. In addition, 46,847 simple sequence repeats (SSRs) were detected from 35,618 unigenes, and a large number of single nucleotide polymorphism (SNP) and insertion/deletion (INDEL) sites were identified in the spleen transcriptome of common carp. This study has characterized the spleen transcriptome of the common carp for the first time, providing a valuable resource for a better understanding of the common carp immune system and defense mechanisms. This knowledge will also facilitate future functional studies on common carp immunogenetics that may eventually be applied in breeding programs. Copyright © 2015 Elsevier Ltd. All rights reserved.
A domain-centric solution to functional genomics via dcGO Predictor
2013-01-01
Background Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics. Results Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool. Conclusions As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era. PMID:23514627
PANNZER2: a rapid functional annotation web server.
Törönen, Petri; Medlar, Alan; Holm, Liisa
2018-05-08
The unprecedented growth of high-throughput sequencing has led to an ever-widening annotation gap in protein databases. While computational prediction methods are available to make up the shortfall, a majority of public web servers are hindered by practical limitations and poor performance. Here, we introduce PANNZER2 (Protein ANNotation with Z-scoRE), a fast functional annotation web server that provides both Gene Ontology (GO) annotations and free text description predictions. PANNZER2 uses SANSparallel to perform high-performance homology searches, making bulk annotation based on sequence similarity practical. PANNZER2 can output GO annotations from multiple scoring functions, enabling users to see which predictions are robust across predictors. Finally, PANNZER2 predictions scored within the top 10 methods for molecular function and biological process in the CAFA2 NK-full benchmark. The PANNZER2 web server is updated on a monthly schedule and is accessible at http://ekhidna2.biocenter.helsinki.fi/sanspanz/. The source code is available under the GNU Public Licence v3.
Schnoes, Alexandra M.; Ream, David C.; Thorman, Alexander W.; Babbitt, Patricia C.; Friedberg, Iddo
2013-01-01
The ongoing functional annotation of proteins relies upon the work of curators to capture experimental findings from scientific literature and apply them to protein sequence and structure data. However, with the increasing use of high-throughput experimental assays, a small number of experimental studies dominate the functional protein annotations collected in databases. Here, we investigate just how prevalent is the “few articles - many proteins” phenomenon. We examine the experimentally validated annotation of proteins provided by several groups in the GO Consortium, and show that the distribution of proteins per published study is exponential, with 0.14% of articles providing the source of annotations for 25% of the proteins in the UniProt-GOA compilation. Since each of the dominant articles describes the use of an assay that can find only one function or a small group of functions, this leads to substantial biases in what we know about the function of many proteins. Mass-spectrometry, microscopy and RNAi experiments dominate high throughput experiments. Consequently, the functional information derived from these experiments is mostly of the subcellular location of proteins, and of the participation of proteins in embryonic developmental pathways. For some organisms, the information provided by different studies overlap by a large amount. We also show that the information provided by high throughput experiments is less specific than those provided by low throughput experiments. Given the experimental techniques available, certain biases in protein function annotation due to high-throughput experiments are unavoidable. Knowing that these biases exist and understanding their characteristics and extent is important for database curators, developers of function annotation programs, and anyone who uses protein function annotation data to plan experiments. PMID:23737737
Explaining the disease phenotype of intergenic SNP through predicted long range regulation.
Chen, Jingqi; Tian, Weidong
2016-10-14
Thousands of disease-associated SNPs (daSNPs) are located in intergenic regions (IGR), making it difficult to understand their association with disease phenotypes. Recent analysis found that non-coding daSNPs were frequently located in or approximate to regulatory elements, inspiring us to try to explain the disease phenotypes of IGR daSNPs through nearby regulatory sequences. Hence, after locating the nearest distal regulatory element (DRE) to a given IGR daSNP, we applied a computational method named INTREPID to predict the target genes regulated by the DRE, and then investigated their functional relevance to the IGR daSNP's disease phenotypes. 36.8% of all IGR daSNP-disease phenotype associations investigated were possibly explainable through the predicted target genes, which were enriched with, were functionally relevant to, or consisted of the corresponding disease genes. This proportion could be further increased to 60.5% if the LD SNPs of daSNPs were also considered. Furthermore, the predicted SNP-target gene pairs were enriched with known eQTL/mQTL SNP-gene relationships. Overall, it's likely that IGR daSNPs may contribute to disease phenotypes by interfering with the regulatory function of their nearby DREs and causing abnormal expression of disease genes. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Cappola, Thomas P; Matkovich, Scot J; Wang, Wei; van Booven, Derek; Li, Mingyao; Wang, Xuexia; Qu, Liming; Sweitzer, Nancy K; Fang, James C; Reilly, Muredach P; Hakonarson, Hakon; Nerbonne, Jeanne M; Dorn, Gerald W
2011-02-08
Common heart failure has a strong undefined heritable component. Two recent independent cardiovascular SNP array studies identified a common SNP at 1p36 in intron 2 of the HSPB7 gene as being associated with heart failure. HSPB7 resequencing identified other risk alleles but no functional gene variants. Here, we further show no effect of the HSPB7 SNP on cardiac HSPB7 mRNA levels or splicing, suggesting that the SNP marks the position of a functional variant in another gene. Accordingly, we used massively parallel platforms to resequence all coding exons of the adjacent CLCNKA gene, which encodes the K(a) renal chloride channel (ClC-K(a)). Of 51 exonic CLCNKA variants identified, one SNP (rs10927887, encoding Arg83Gly) was common, in linkage disequilibrium with the heart failure risk SNP in HSPB7, and associated with heart failure in two independent Caucasian referral populations (n = 2,606 and 1,168; combined P = 2.25 × 10(-6)). Individual genotyping of rs10927887 in the two study populations and a third independent heart failure cohort (combined n = 5,489) revealed an additive allele effect on heart failure risk that is independent of age, sex, and prior hypertension (odds ratio = 1.27 per allele copy; P = 8.3 × 10(-7)). Functional characterization of recombinant wild-type Arg83 and variant Gly83 ClC-K(a) chloride channel currents revealed ≈ 50% loss-of-function of the variant channel. These findings identify a common, functionally significant genetic risk factor for Caucasian heart failure. The variant CLCNKA risk allele, telegraphed by linked variants in the adjacent HSPB7 gene, uncovers a previously overlooked genetic mechanism affecting the cardio-renal axis.
Enhanced functionalities for annotating and indexing clinical text with the NCBO Annotator.
Tchechmedjiev, Andon; Abdaoui, Amine; Emonet, Vincent; Melzi, Soumia; Jonnagaddala, Jitendra; Jonquet, Clement
2018-06-01
Second use of clinical data commonly involves annotating biomedical text with terminologies and ontologies. The National Center for Biomedical Ontology Annotator is a frequently used annotation service, originally designed for biomedical data, but not very suitable for clinical text annotation. In order to add new functionalities to the NCBO Annotator without hosting or modifying the original Web service, we have designed a proxy architecture that enables seamless extensions by pre-processing of the input text and parameters, and post processing of the annotations. We have then implemented enhanced functionalities for annotating and indexing free text such as: scoring, detection of context (negation, experiencer, temporality), new output formats and coarse-grained concept recognition (with UMLS Semantic Groups). In this paper, we present the NCBO Annotator+, a Web service which incorporates these new functionalities as well as a small set of evaluation results for concept recognition and clinical context detection on two standard evaluation tasks (Clef eHealth 2017, SemEval 2014). The Annotator+ has been successfully integrated into the SIFR BioPortal platform-an implementation of NCBO BioPortal for French biomedical terminologies and ontologies-to annotate English text. A Web user interface is available for testing and ontology selection (http://bioportal.lirmm.fr/ncbo_annotatorplus); however the Annotator+ is meant to be used through the Web service application programming interface (http://services.bioportal.lirmm.fr/ncbo_annotatorplus). The code is openly available, and we also provide a Docker packaging to enable easy local deployment to process sensitive (e.g. clinical) data in-house (https://github.com/sifrproject). andon.tchechmedjiev@lirmm.fr. Supplementary data are available at Bioinformatics online.
dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts
Vincent, Jonathan; Dai, Zhanwu; Ravel, Catherine; Choulet, Frédéric; Mouzeyar, Said; Bouzidi, M. Fouad; Agier, Marie; Martre, Pierre
2013-01-01
The functional annotation of genes based on sequence homology with genes from model species genomes is time-consuming because it is necessary to mine several unrelated databases. The aim of the present work was to develop a functional annotation database for common wheat Triticum aestivum (L.). The database, named dbWFA, is based on the reference NCBI UniGene set, an expressed gene catalogue built by expressed sequence tag clustering, and on full-length coding sequences retrieved from the TriFLDB database. Information from good-quality heterogeneous sources, including annotations for model plant species Arabidopsis thaliana (L.) Heynh. and Oryza sativa L., was gathered and linked to T. aestivum sequences through BLAST-based homology searches. Even though the complexity of the transcriptome cannot yet be fully appreciated, we developed a tool to easily and promptly obtain information from multiple functional annotation systems (Gene Ontology, MapMan bin codes, MIPS Functional Categories, PlantCyc pathway reactions and TAIR gene families). The use of dbWFA is illustrated here with several query examples. We were able to assign a putative function to 45% of the UniGenes and 81% of the full-length coding sequences from TriFLDB. Moreover, comparison of the annotation of the whole T. aestivum UniGene set along with curated annotations of the two model species assessed the accuracy of the annotation provided by dbWFA. To further illustrate the use of dbWFA, genes specifically expressed during the early cell division or late storage polymer accumulation phases of T. aestivum grain development were identified using a clustering analysis and then annotated using dbWFA. The annotation of these two sets of genes was consistent with previous analyses of T. aestivum grain transcriptomes and proteomes. Database URL: urgi.versailles.inra.fr/dbWFA/ PMID:23660284
He, Zihuai; Xu, Bin; Lee, Seunggeun; Ionita-Laza, Iuliana
2017-09-07
Substantial progress has been made in the functional annotation of genetic variation in the human genome. Integrative analysis that incorporates such functional annotations into sequencing studies can aid the discovery of disease-associated genetic variants, especially those with unknown function and located outside protein-coding regions. Direct incorporation of one functional annotation as weight in existing dispersion and burden tests can suffer substantial loss of power when the functional annotation is not predictive of the risk status of a variant. Here, we have developed unified tests that can utilize multiple functional annotations simultaneously for integrative association analysis with efficient computational techniques. We show that the proposed tests significantly improve power when variant risk status can be predicted by functional annotations. Importantly, when functional annotations are not predictive of risk status, the proposed tests incur only minimal loss of power in relation to existing dispersion and burden tests, and under certain circumstances they can even have improved power by learning a weight that better approximates the underlying disease model in a data-adaptive manner. The tests can be constructed with summary statistics of existing dispersion and burden tests for sequencing data, therefore allowing meta-analysis of multiple studies without sharing individual-level data. We applied the proposed tests to a meta-analysis of noncoding rare variants in Metabochip data on 12,281 individuals from eight studies for lipid traits. By incorporating the Eigen functional score, we detected significant associations between noncoding rare variants in SLC22A3 and low-density lipoprotein and total cholesterol, associations that are missed by standard dispersion and burden tests. Copyright © 2017 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.
Díaz-Villaseñor, Andrea; Cruz, Laura; Cebrián, Arturo; Hernández-Ramírez, Raúl U.; Hiriart, Marcia; García-Vargas, Gonzálo; Bassol, Susana; Sordo, Monserrat; Gandolfi, A. Jay; Klimecki, Walter T.; López-Carillo, Lizbeth; Cebrián, Mariano E.; Ostrosky-Wegman, Patricia
2013-01-01
The incidence of type 2 diabetes mellitus (T2DM) is increasing worldwide and diverse environmental and genetic risk factors are well recognized. Single nucleotide polymorphisms (SNPs) in the calpain-10 gene (CAPN-10), which encodes a protein involved in the secretion and action of insulin, and chronic exposure to inorganic arsenic (iAs) through drinking water have been independently associated with an increase in the risk for T2DM. In the present work we evaluated if CAPN-10 SNPs and iAs exposure jointly contribute to the outcome of T2DM. Insulin secretion (beta-cell function) and insulin sensitivity were evaluated indirectly through validated indexes (HOMA2) in subjects with and without T2DM who have been exposed to a gradient of iAs in their drinking water in northern Mexico. The results were analyzed taking into account the presence of the risk factor SNPs SNP-43 and -44 in CAPN-10. Subjects with T2DM had significantly lower beta-cell function and insulin sensitivity. An inverse association was found between beta-cell function and iAs exposure, the association being more pronounced in subjects with T2DM. Subjects without T2DM who were carriers of the at-risk genotype SNP-43 or -44, also had significantly lower beta-cell function. The association of SNP-43 with beta-cell function was dependent on iAs exposure, age, gender and BMI, whereas the association with SNP-44 was independent of all of these factors. Chronic exposure to iAs seems to be a risk factor for T2DM in humans through the reduction of beta-cell function, with an enhanced effect seen in the presence of the at-risk genotype of SNP-43 in CAPN-10. Carriers of CAPN-10 SNP-44 have also shown reduced beta-cell function. PMID:23349674
Bromberg, Yana; Yachdav, Guy; Ofran, Yanay; Schneider, Reinhard; Rost, Burkhard
2009-05-01
The rapidly increasing quantity of protein sequence data continues to widen the gap between available sequences and annotations. Comparative modeling suggests some aspects of the 3D structures of approximately half of all known proteins; homology- and network-based inferences annotate some aspect of function for a similar fraction of the proteome. For most known protein sequences, however, there is detailed knowledge about neither their function nor their structure. Comprehensive efforts towards the expert curation of sequence annotations have failed to meet the demand of the rapidly increasing number of available sequences. Only the automated prediction of protein function in the absence of homology can close the gap between available sequences and annotations in the foreseeable future. This review focuses on two novel methods for automated annotation, and briefly presents an outlook on how modern web software may revolutionize the field of protein sequence annotation. First, predictions of protein binding sites and functional hotspots, and the evolution of these into the most successful type of prediction of protein function from sequence will be discussed. Second, a new tool, comprehensive in silico mutagenesis, which contributes important novel predictions of function and at the same time prepares for the onset of the next sequencing revolution, will be described. While these two new sub-fields of protein prediction represent the breakthroughs that have been achieved methodologically, it will then be argued that a different development might further change the way biomedical researchers benefit from annotations: modern web software can connect the worldwide web in any browser with the 'Deep Web' (ie, proprietary data resources). The availability of this direct connection, and the resulting access to a wealth of data, may impact drug discovery and development more than any existing method that contributes to protein annotation.
NoGOA: predicting noisy GO annotations using evidences and sparse representation.
Yu, Guoxian; Lu, Chang; Wang, Jun
2017-07-21
Gene Ontology (GO) is a community effort to represent functional features of gene products. GO annotations (GOA) provide functional associations between GO terms and gene products. Due to resources limitation, only a small portion of annotations are manually checked by curators, and the others are electronically inferred. Although quality control techniques have been applied to ensure the quality of annotations, the community consistently report that there are still considerable noisy (or incorrect) annotations. Given the wide application of annotations, however, how to identify noisy annotations is an important but yet seldom studied open problem. We introduce a novel approach called NoGOA to predict noisy annotations. NoGOA applies sparse representation on the gene-term association matrix to reduce the impact of noisy annotations, and takes advantage of sparse representation coefficients to measure the semantic similarity between genes. Secondly, it preliminarily predicts noisy annotations of a gene based on aggregated votes from semantic neighborhood genes of that gene. Next, NoGOA estimates the ratio of noisy annotations for each evidence code based on direct annotations in GOA files archived on different periods, and then weights entries of the association matrix via estimated ratios and propagates weights to ancestors of direct annotations using GO hierarchy. Finally, it integrates evidence-weighted association matrix and aggregated votes to predict noisy annotations. Experiments on archived GOA files of six model species (H. sapiens, A. thaliana, S. cerevisiae, G. gallus, B. Taurus and M. musculus) demonstrate that NoGOA achieves significantly better results than other related methods and removing noisy annotations improves the performance of gene function prediction. The comparative study justifies the effectiveness of integrating evidence codes with sparse representation for predicting noisy GO annotations. Codes and datasets are available at http://mlda.swu.edu.cn/codes.php?name=NoGOA .
2013-01-01
Background Modern banana cultivars are primarily interspecific triploid hybrids of two species, Musa acuminata and Musa balbisiana, which respectively contribute the A- and B-genomes. The M. balbisiana genome has been associated with improved vigour and tolerance to biotic and abiotic stresses and is thus a target for Musa breeding programs. However, while a reference M. acuminata genome has recently been released (Nature 488:213–217, 2012), little sequence data is available for the corresponding B-genome. To address these problems we carried out Next Generation gDNA sequencing of the wild diploid M. balbisiana variety ‘Pisang Klutuk Wulung’ (PKW). Our strategy was to align PKW gDNA reads against the published A-genome and to extract the mapped consensus sequences for subsequent rounds of evaluation and gene annotation. Results The resulting B-genome is 79% the size of the A-genome, and contains 36,638 predicted functional gene sequences which is nearly identical to the 36,542 of the A-genome. There is substantial sequence divergence from the A-genome at a frequency of 1 homozygous SNP per 23.1 bp, and a high degree of heterozygosity corresponding to one heterozygous SNP per 55.9 bp. Using expressed small RNA data, a similar number of microRNA sequences were predicted in both A- and B-genomes, but additional novel miRNAs were detected, including some that are unique to each genome. The usefulness of this B-genome sequence was evaluated by mapping RNA-seq data from a set of triploid AAA and AAB hybrids simultaneously to both genomes. Results for the plantains demonstrated the expected 2:1 distribution of reads across the A- and B-genomes, but for the AAA genomes, results show they contain regions of significant homology to the B-genome supporting proposals that there has been a history of interspecific recombination between homeologous A and B chromosomes in Musa hybrids. Conclusions We have generated and annotated a draft reference Musa B-genome and demonstrate that this can be used for molecular genetic mapping of gene transcripts and small RNA expression data from several allopolyploid banana cultivars. This draft therefore represents a valuable resource to support the study of metabolism in inter- and intraspecific triploid Musa hybrids and to help direct breeding programs. PMID:24094114
Du, Yushen; Wu, Nicholas C.; Jiang, Lin; Zhang, Tianhao; Gong, Danyang; Shu, Sara; Wu, Ting-Ting
2016-01-01
ABSTRACT Identification and annotation of functional residues are fundamental questions in protein sequence analysis. Sequence and structure conservation provides valuable information to tackle these questions. It is, however, limited by the incomplete sampling of sequence space in natural evolution. Moreover, proteins often have multiple functions, with overlapping sequences that present challenges to accurate annotation of the exact functions of individual residues by conservation-based methods. Using the influenza A virus PB1 protein as an example, we developed a method to systematically identify and annotate functional residues. We used saturation mutagenesis and high-throughput sequencing to measure the replication capacity of single nucleotide mutations across the entire PB1 protein. After predicting protein stability upon mutations, we identified functional PB1 residues that are essential for viral replication. To further annotate the functional residues important to the canonical or noncanonical functions of viral RNA-dependent RNA polymerase (vRdRp), we performed a homologous-structure analysis with 16 different vRdRp structures. We achieved high sensitivity in annotating the known canonical polymerase functional residues. Moreover, we identified a cluster of noncanonical functional residues located in the loop region of the PB1 β-ribbon. We further demonstrated that these residues were important for PB1 protein nuclear import through the interaction with Ran-binding protein 5. In summary, we developed a systematic and sensitive method to identify and annotate functional residues that are not restrained by sequence conservation. Importantly, this method is generally applicable to other proteins about which homologous-structure information is available. PMID:27803181
A novel approach to analyzing fMRI and SNP data via parallel independent component analysis
NASA Astrophysics Data System (ADS)
Liu, Jingyu; Pearlson, Godfrey; Calhoun, Vince; Windemuth, Andreas
2007-03-01
There is current interest in understanding genetic influences on brain function in both the healthy and the disordered brain. Parallel independent component analysis, a new method for analyzing multimodal data, is proposed in this paper and applied to functional magnetic resonance imaging (fMRI) and a single nucleotide polymorphism (SNP) array. The method aims to identify the independent components of each modality and the relationship between the two modalities. We analyzed 92 participants, including 29 schizophrenia (SZ) patients, 13 unaffected SZ relatives, and 50 healthy controls. We found a correlation of 0.79 between one fMRI component and one SNP component. The fMRI component consists of activations in cingulate gyrus, multiple frontal gyri, and superior temporal gyrus. The related SNP component is contributed to significantly by 9 SNPs located in sets of genes, including those coding for apolipoprotein A-I, and C-III, malate dehydrogenase 1 and the gamma-aminobutyric acid alpha-2 receptor. A significant difference in the presences of this SNP component is found between the SZ group (SZ patients and their relatives) and the control group. In summary, we constructed a framework to identify the interactions between brain functional and genetic information; our findings provide new insight into understanding genetic influences on brain function in a common mental disorder.
Sallman, David A.; Basiorka, Ashley A.; Irvine, Brittany A.; Zhang, Ling; Epling-Burnette, P.K.; Rollison, Dana E.; Mallo, Mar; Sokol, Lubomir; Solé, Francesc; Maciejewski, Jaroslaw; List, Alan F.
2015-01-01
P53 is a key regulator of many cellular processes and is negatively regulated by the human homolog of murine double minute-2 (MDM2) E3 ubiquitin ligase. Single nucleotide polymorphisms (SNPs) of either gene alone, and in combination, are linked to cancer susceptibility, disease progression, and therapy response. We analyzed the interaction of TP53 R72P and MDM2 SNP309 SNPs in relationship to outcome in patients with myelodysplastic syndromes (MDS). Sanger sequencing was performed on DNA isolated from 208 MDS cases. Utilizing a novel functional SNP scoring system ranging from +2 to −2 based on predicted p53 activity, we found statistically significant differences in overall survival (OS) (p = 0.02) and progression-free survival (PFS) (p = 0.02) in non-del(5q) MDS patients with low functional scores. In univariate analysis, only IPSS and the functional SNP score predicted OS and PFS in non-del(5q) patients. In multivariate analysis, the functional SNP score was independent of IPSS for OS and PFS. These data underscore the importance of TP53 R72P and MDM2 SNP309 SNPs in MDS, and provide a novel scoring system independent of IPSS that is predictive for disease outcome. PMID:26416416
Year 2 Report: Protein Function Prediction Platform
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhou, C E
2012-04-27
Upon completion of our second year of development in a 3-year development cycle, we have completed a prototype protein structure-function annotation and function prediction system: Protein Function Prediction (PFP) platform (v.0.5). We have met our milestones for Years 1 and 2 and are positioned to continue development in completion of our original statement of work, or a reasonable modification thereof, in service to DTRA Programs involved in diagnostics and medical countermeasures research and development. The PFP platform is a multi-scale computational modeling system for protein structure-function annotation and function prediction. As of this writing, PFP is the only existing fullymore » automated, high-throughput, multi-scale modeling, whole-proteome annotation platform, and represents a significant advance in the field of genome annotation (Fig. 1). PFP modules perform protein functional annotations at the sequence, systems biology, protein structure, and atomistic levels of biological complexity (Fig. 2). Because these approaches provide orthogonal means of characterizing proteins and suggesting protein function, PFP processing maximizes the protein functional information that can currently be gained by computational means. Comprehensive annotation of pathogen genomes is essential for bio-defense applications in pathogen characterization, threat assessment, and medical countermeasure design and development in that it can short-cut the time and effort required to select and characterize protein biomarkers.« less
Measuring semantic similarities by combining gene ontology annotations and gene co-function networks
Peng, Jiajie; Uygun, Sahra; Kim, Taehyong; ...
2015-02-14
Background: Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms. Results: We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstratemore » that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families. Conclusions: Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited.« less
Functional Annotations of Paralogs: A Blessing and a Curse
Zallot, Rémi; Harrison, Katherine J.; Kolaczkowski, Bryan; de Crécy-Lagard, Valérie
2016-01-01
Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines. PMID:27618105
GAVIN: Gene-Aware Variant INterpretation for medical sequencing.
van der Velde, K Joeri; de Boer, Eddy N; van Diemen, Cleo C; Sikkema-Raddatz, Birgit; Abbott, Kristin M; Knopperts, Alain; Franke, Lude; Sijmons, Rolf H; de Koning, Tom J; Wijmenga, Cisca; Sinke, Richard J; Swertz, Morris A
2017-01-16
We present Gene-Aware Variant INterpretation (GAVIN), a new method that accurately classifies variants for clinical diagnostic purposes. Classifications are based on gene-specific calibrations of allele frequencies from the ExAC database, likely variant impact using SnpEff, and estimated deleteriousness based on CADD scores for >3000 genes. In a benchmark on 18 clinical gene sets, we achieve a sensitivity of 91.4% and a specificity of 76.9%. This accuracy is unmatched by 12 other tools. We provide GAVIN as an online MOLGENIS service to annotate VCF files and as an open source executable for use in bioinformatic pipelines. It can be found at http://molgenis.org/gavin .
The variant call format and VCFtools.
Danecek, Petr; Auton, Adam; Abecasis, Goncalo; Albers, Cornelis A; Banks, Eric; DePristo, Mark A; Handsaker, Robert E; Lunter, Gerton; Marth, Gabor T; Sherry, Stephen T; McVean, Gilean; Durbin, Richard
2011-08-01
The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. http://vcftools.sourceforge.net
Amar, David; Frades, Itziar; Danek, Agnieszka; Goldberg, Tatyana; Sharma, Sanjeev K; Hedley, Pete E; Proux-Wera, Estelle; Andreasson, Erik; Shamir, Ron; Tzfadia, Oren; Alexandersson, Erik
2014-12-05
For most organisms, even if their genome sequence is available, little functional information about individual genes or proteins exists. Several annotation pipelines have been developed for functional analysis based on sequence, 'omics', and literature data. However, researchers encounter little guidance on how well they perform. Here, we used the recently sequenced potato genome as a case study. The potato genome was selected since its genome is newly sequenced and it is a non-model plant even if there is relatively ample information on individual potato genes, and multiple gene expression profiles are available. We show that the automatic gene annotations of potato have low accuracy when compared to a "gold standard" based on experimentally validated potato genes. Furthermore, we evaluate six state-of-the-art annotation pipelines and show that their predictions are markedly dissimilar (Jaccard similarity coefficient of 0.27 between pipelines on average). To overcome this discrepancy, we introduce a simple GO structure-based algorithm that reconciles the predictions of the different pipelines. We show that the integrated annotation covers more genes, increases by over 50% the number of highly co-expressed GO processes, and obtains much higher agreement with the gold standard. We find that different annotation pipelines produce different results, and show how to integrate them into a unified annotation that is of higher quality than each single pipeline. We offer an improved functional annotation of both PGSC and ITAG potato gene models, as well as tools that can be applied to additional pipelines and improve annotation in other organisms. This will greatly aid future functional analysis of '-omics' datasets from potato and other organisms with newly sequenced genomes. The new potato annotations are available with this paper.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Leung, Elo; Huang, Amy; Cadag, Eithon
In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less
Leung, Elo; Huang, Amy; Cadag, Eithon; ...
2016-01-20
In this study, we introduce the Protein Sequence Annotation Tool (PSAT), a web-based, sequence annotation meta-server for performing integrated, high-throughput, genome-wide sequence analyses. Our goals in building PSAT were to (1) create an extensible platform for integration of multiple sequence-based bioinformatics tools, (2) enable functional annotations and enzyme predictions over large input protein fasta data sets, and (3) provide a web interface for convenient execution of the tools. In this paper, we demonstrate the utility of PSAT by annotating the predicted peptide gene products of Herbaspirillum sp. strain RV1423, importing the results of PSAT into EC2KEGG, and using the resultingmore » functional comparisons to identify a putative catabolic pathway, thereby distinguishing RV1423 from a well annotated Herbaspirillum species. This analysis demonstrates that high-throughput enzyme predictions, provided by PSAT processing, can be used to identify metabolic potential in an otherwise poorly annotated genome. Lastly, PSAT is a meta server that combines the results from several sequence-based annotation and function prediction codes, and is available at http://psat.llnl.gov/psat/. PSAT stands apart from other sequencebased genome annotation systems in providing a high-throughput platform for rapid de novo enzyme predictions and sequence annotations over large input protein sequence data sets in FASTA. PSAT is most appropriately applied in annotation of large protein FASTA sets that may or may not be associated with a single genome.« less
Stafuzza, Nedenia Bonvino; Zerlotini, Adhemar; Lobo, Francisco Pereira; Yamagishi, Michel Eduardo Beleza; Chud, Tatiane Cristina Seleguim; Caetano, Alexandre Rodrigues; Munari, Danísio Prado; Garrick, Dorian J; Machado, Marco Antonio; Martins, Marta Fonseca; Carvalho, Maria Raquel; Cole, John Bruce; Barbosa da Silva, Marcos Vinicius Gualberto
2017-01-01
Whole-genome re-sequencing, alignment and annotation analyses were undertaken for 12 sires representing four important cattle breeds in Brazil: Guzerat (multi-purpose), Gyr, Girolando and Holstein (dairy production). A total of approximately 4.3 billion reads from an Illumina HiSeq 2000 sequencer generated for each animal 10.7 to 16.4-fold genome coverage. A total of 27,441,279 single nucleotide variations (SNVs) and 3,828,041 insertions/deletions (InDels) were detected in the samples, of which 2,557,670 SNVs and 883,219 InDels were novel. The submission of these genetic variants to the dbSNP database significantly increased the number of known variants, particularly for the indicine genome. The concordance rate between genotypes obtained using the Bovine HD BeadChip array and the same variants identified by sequencing was about 99.05%. The annotation of variants identified numerous non-synonymous SNVs and frameshift InDels which could affect phenotypic variation. Functional enrichment analysis was performed and revealed that variants in the olfactory transduction pathway was over represented in all four cattle breeds, while the ECM-receptor interaction pathway was over represented in Girolando and Guzerat breeds, the ABC transporters pathway was over represented only in Holstein breed, and the metabolic pathways was over represented only in Gyr breed. The genetic variants discovered here provide a rich resource to help identify potential genomic markers and their associated molecular mechanisms that impact economically important traits for Gyr, Girolando, Guzerat and Holstein breeding programs.
Lobo, Francisco Pereira; Yamagishi, Michel Eduardo Beleza; Chud, Tatiane Cristina Seleguim; Caetano, Alexandre Rodrigues; Munari, Danísio Prado; Garrick, Dorian J.; Machado, Marco Antonio; Martins, Marta Fonseca; Carvalho, Maria Raquel; Cole, John Bruce; Barbosa da Silva, Marcos Vinicius Gualberto
2017-01-01
Whole-genome re-sequencing, alignment and annotation analyses were undertaken for 12 sires representing four important cattle breeds in Brazil: Guzerat (multi-purpose), Gyr, Girolando and Holstein (dairy production). A total of approximately 4.3 billion reads from an Illumina HiSeq 2000 sequencer generated for each animal 10.7 to 16.4-fold genome coverage. A total of 27,441,279 single nucleotide variations (SNVs) and 3,828,041 insertions/deletions (InDels) were detected in the samples, of which 2,557,670 SNVs and 883,219 InDels were novel. The submission of these genetic variants to the dbSNP database significantly increased the number of known variants, particularly for the indicine genome. The concordance rate between genotypes obtained using the Bovine HD BeadChip array and the same variants identified by sequencing was about 99.05%. The annotation of variants identified numerous non-synonymous SNVs and frameshift InDels which could affect phenotypic variation. Functional enrichment analysis was performed and revealed that variants in the olfactory transduction pathway was over represented in all four cattle breeds, while the ECM-receptor interaction pathway was over represented in Girolando and Guzerat breeds, the ABC transporters pathway was over represented only in Holstein breed, and the metabolic pathways was over represented only in Gyr breed. The genetic variants discovered here provide a rich resource to help identify potential genomic markers and their associated molecular mechanisms that impact economically important traits for Gyr, Girolando, Guzerat and Holstein breeding programs. PMID:28323836
AgBase: supporting functional modeling in agricultural organisms
McCarthy, Fiona M.; Gresham, Cathy R.; Buza, Teresia J.; Chouvarine, Philippe; Pillai, Lakshmi R.; Kumar, Ranjit; Ozkan, Seval; Wang, Hui; Manda, Prashanti; Arick, Tony; Bridges, Susan M.; Burgess, Shane C.
2011-01-01
AgBase (http://www.agbase.msstate.edu/) provides resources to facilitate modeling of functional genomics data and structural and functional annotation of agriculturally important animal, plant, microbe and parasite genomes. The website is redesigned to improve accessibility and ease of use, including improved search capabilities. Expanded capabilities include new dedicated pages for horse, cat, dog, cotton, rice and soybean. We currently provide 590 240 Gene Ontology (GO) annotations to 105 454 gene products in 64 different species, including GO annotations linked to transcripts represented on agricultural microarrays. For many of these arrays, this provides the only functional annotation available. GO annotations are available for download and we provide comprehensive, species-specific GO annotation files for 18 different organisms. The tools available at AgBase have been expanded and several existing tools improved based upon user feedback. One of seven new tools available at AgBase, GOModeler, supports hypothesis testing from functional genomics data. We host several associated databases and provide genome browsers for three agricultural pathogens. Moreover, we provide comprehensive training resources (including worked examples and tutorials) via links to Educational Resources at the AgBase website. PMID:21075795
APPRIS: annotation of principal and alternative splice isoforms
Rodriguez, Jose Manuel; Maietta, Paolo; Ezkurdia, Iakes; Pietrelli, Alessandro; Wesselink, Jan-Jaap; Lopez, Gonzalo; Valencia, Alfonso; Tress, Michael L.
2013-01-01
Here, we present APPRIS (http://appris.bioinfo.cnio.es), a database that houses annotations of human splice isoforms. APPRIS has been designed to provide value to manual annotations of the human genome by adding reliable protein structural and functional data and information from cross-species conservation. The visual representation of the annotations provided by APPRIS for each gene allows annotators and researchers alike to easily identify functional changes brought about by splicing events. In addition to collecting, integrating and analyzing reliable predictions of the effect of splicing events, APPRIS also selects a single reference sequence for each gene, here termed the principal isoform, based on the annotations of structure, function and conservation for each transcript. APPRIS identifies a principal isoform for 85% of the protein-coding genes in the GENCODE 7 release for ENSEMBL. Analysis of the APPRIS data shows that at least 70% of the alternative (non-principal) variants would lose important functional or structural information relative to the principal isoform. PMID:23161672
Cowper-Sal·lari, Richard; Zhang, Xiaoyang; Wright, Jason B.; Bailey, Swneke D.; Cole, Michael D.; Eeckhoute, Jerome; Moore, Jason H.; Lupien, Mathieu
2012-01-01
Genome-wide association studies (GWASs) have identified thousands of single nucleotide polymorphisms (SNPs) associated with human traits and diseases. But because the vast majority of these SNPs are located in the noncoding regions of the genome their risk promoting mechanisms are elusive. Employing a new methodology combining cistromics, epigenomics and genotype imputation we annotate the noncoding regions of the genome in breast cancer cells and systematically identify the functional nature of SNPs associated with breast cancer risk. Our results demonstrate that breast cancer risk-associated SNPs are enriched in the cistromes of FOXA1 and ESR1 and the epigenome of H3K4me1 in a cancer and cell-type-specific manner. Furthermore, the majority of these risk-associated SNPs modulate the affinity of chromatin for FOXA1 at distal regulatory elements, which results in allele-specific gene expression, exemplified by the effect of the rs4784227 SNP on the TOX3 gene found within the 16q12.1 risk locus. PMID:23001124
Li, Zhihua; Fan, Jingyi; Li, Ni; Zhu, Meng; Zhang, Jiahui; Wang, Yuzhuo; Geng, Liguo; Cheng, Yang; Ma, Hongxia; Jin, Guangfu; Dai, Juncheng; Hu, Zhibin; Shen, Hongbing
2018-05-29
Genome-wide association studies (GWAS) and fine mapping studies have identified multiple lung cancer susceptibility variants in TERT-CLPTM1L region. However, it is still unclear about the relationship between these risk variants and the independent lung cancer risk signals in this region. Therefore, we evaluated the independent susceptibility signals for lung cancer and explored the potential functional variants in this region. Sequential conditional analysis was used to detect the independent susceptibility loci based on four lung cancer GWAS datasets with 12,843 lung cases and 12,639 controls. Comprehensively functional annotations were performed for each independent signal. Three independent susceptibility signals were identified in multi-ethnic population. For the first signal, rs2736100 showed the most significant association with lung cancer risk (C > A, OR = 0.82, 95%CI: 0.79-0.85, P = 1.98 × 10 -25 ). Rs36019446 was the top-ranked site (A > G, OR = 0.88, 95%CI: 0.84-0.92, P = 1.74 × 10 -9 ) in the second signal. For the third signal, rs326048 was the leading SNP (A > G, OR = 0.91, 95%CI: 0.87-0.95, P = 1.38 × 10 -5 ). The following subgroup analysis found the same three loci among Asian population. Further, we compared the difference between various subgroup populations. Functional annotations revealed that rs2736100, rs27996 (r 2 = 0.85 with rs36019446) and rs326049 (r 2 = 0.73 with rs326048) could be potential functional variants in these three risk signals, respectively. In conclusion, although multiple variants have been found associated with lung cancer risk in TERT-CLPTM1L region, our findings indicated that there are three independent lung cancer susceptibility signals in this region. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.
PFAAT version 2.0: a tool for editing, annotating, and analyzing multiple sequence alignments.
Caffrey, Daniel R; Dana, Paul H; Mathur, Vidhya; Ocano, Marco; Hong, Eun-Jong; Wang, Yaoyu E; Somaroo, Shyamal; Caffrey, Brian E; Potluri, Shobha; Huang, Enoch S
2007-10-11
By virtue of their shared ancestry, homologous sequences are similar in their structure and function. Consequently, multiple sequence alignments are routinely used to identify trends that relate to function. This type of analysis is particularly productive when it is combined with structural and phylogenetic analysis. Here we describe the release of PFAAT version 2.0, a tool for editing, analyzing, and annotating multiple sequence alignments. Support for multiple annotations is a key component of this release as it provides a framework for most of the new functionalities. The sequence annotations are accessible from the alignment and tree, where they are typically used to label sequences or hyperlink them to related databases. Sequence annotations can be created manually or extracted automatically from UniProt entries. Once a multiple sequence alignment is populated with sequence annotations, sequences can be easily selected and sorted through a sophisticated search dialog. The selected sequences can be further analyzed using statistical methods that explicitly model relationships between the sequence annotations and residue properties. Residue annotations are accessible from the alignment viewer and are typically used to designate binding sites or properties for a particular residue. Residue annotations are also searchable, and allow one to quickly select alignment columns for further sequence analysis, e.g. computing percent identities. Other features include: novel algorithms to compute sequence conservation, mapping conservation scores to a 3D structure in Jmol, displaying secondary structure elements, and sorting sequences by residue composition. PFAAT provides a framework whereby end-users can specify knowledge for a protein family in the form of annotation. The annotations can be combined with sophisticated analysis to test hypothesis that relate to sequence, structure and function.
Automated and Accurate Estimation of Gene Family Abundance from Shotgun Metagenomes
Nayfach, Stephen; Bradley, Patrick H.; Wyman, Stacia K.; Laurent, Timothy J.; Williams, Alex; Eisen, Jonathan A.; Pollard, Katherine S.; Sharpton, Thomas J.
2015-01-01
Shotgun metagenomic DNA sequencing is a widely applicable tool for characterizing the functions that are encoded by microbial communities. Several bioinformatic tools can be used to functionally annotate metagenomes, allowing researchers to draw inferences about the functional potential of the community and to identify putative functional biomarkers. However, little is known about how decisions made during annotation affect the reliability of the results. Here, we use statistical simulations to rigorously assess how to optimize annotation accuracy and speed, given parameters of the input data like read length and library size. We identify best practices in metagenome annotation and use them to guide the development of the Shotgun Metagenome Annotation Pipeline (ShotMAP). ShotMAP is an analytically flexible, end-to-end annotation pipeline that can be implemented either on a local computer or a cloud compute cluster. We use ShotMAP to assess how different annotation databases impact the interpretation of how marine metagenome and metatranscriptome functional capacity changes across seasons. We also apply ShotMAP to data obtained from a clinical microbiome investigation of inflammatory bowel disease. This analysis finds that gut microbiota collected from Crohn’s disease patients are functionally distinct from gut microbiota collected from either ulcerative colitis patients or healthy controls, with differential abundance of metabolic pathways related to host-microbiome interactions that may serve as putative biomarkers of disease. PMID:26565399
Generation and analysis of expressed sequence tags in the extreme large genomes Lilium and Tulipa
2012-01-01
Background Bulbous flowers such as lily and tulip (Liliaceae family) are monocot perennial herbs that are economically very important ornamental plants worldwide. However, there are hardly any genetic studies performed and genomic resources are lacking. To build genomic resources and develop tools to speed up the breeding in both crops, next generation sequencing was implemented. We sequenced and assembled transcriptomes of four lily and five tulip genotypes using 454 pyro-sequencing technology. Results Successfully, we developed the first set of 81,791 contigs with an average length of 514 bp for tulip, and enriched the very limited number of 3,329 available ESTs (Expressed Sequence Tags) for lily with 52,172 contigs with an average length of 555 bp. The contigs together with singletons covered on average 37% of lily and 39% of tulip estimated transcriptome. Mining lily and tulip sequence data for SSRs (Simple Sequence Repeats) showed that di-nucleotide repeats were twice more abundant in UTRs (UnTranslated Regions) compared to coding regions, while tri-nucleotide repeats were equally spread over coding and UTR regions. Two sets of single nucleotide polymorphism (SNP) markers suitable for high throughput genotyping were developed. In the first set, no SNPs flanking the target SNP (50 bp on either side) were allowed. In the second set, one SNP in the flanking regions was allowed, which resulted in a 2 to 3 fold increase in SNP marker numbers compared with the first set. Orthologous groups between the two flower bulbs: lily and tulip (12,017 groups) and among the three monocot species: lily, tulip, and rice (6,900 groups) were determined using OrthoMCL. Orthologous groups were screened for common SNP markers and EST-SSRs to study synteny between lily and tulip, which resulted in 113 common SNP markers and 292 common EST-SSR. Lily and tulip contigs generated were annotated and described according to Gene Ontology terminology. Conclusions Two transcriptome sets were built that are valuable resources for marker development, comparative genomic studies and candidate gene approaches. Next generation sequencing of leaf transcriptome is very effective; however, deeper sequencing and using more tissues and stages is advisable for extended comparative studies. PMID:23167289
Du, Yushen; Wu, Nicholas C; Jiang, Lin; Zhang, Tianhao; Gong, Danyang; Shu, Sara; Wu, Ting-Ting; Sun, Ren
2016-11-01
Identification and annotation of functional residues are fundamental questions in protein sequence analysis. Sequence and structure conservation provides valuable information to tackle these questions. It is, however, limited by the incomplete sampling of sequence space in natural evolution. Moreover, proteins often have multiple functions, with overlapping sequences that present challenges to accurate annotation of the exact functions of individual residues by conservation-based methods. Using the influenza A virus PB1 protein as an example, we developed a method to systematically identify and annotate functional residues. We used saturation mutagenesis and high-throughput sequencing to measure the replication capacity of single nucleotide mutations across the entire PB1 protein. After predicting protein stability upon mutations, we identified functional PB1 residues that are essential for viral replication. To further annotate the functional residues important to the canonical or noncanonical functions of viral RNA-dependent RNA polymerase (vRdRp), we performed a homologous-structure analysis with 16 different vRdRp structures. We achieved high sensitivity in annotating the known canonical polymerase functional residues. Moreover, we identified a cluster of noncanonical functional residues located in the loop region of the PB1 β-ribbon. We further demonstrated that these residues were important for PB1 protein nuclear import through the interaction with Ran-binding protein 5. In summary, we developed a systematic and sensitive method to identify and annotate functional residues that are not restrained by sequence conservation. Importantly, this method is generally applicable to other proteins about which homologous-structure information is available. To fully comprehend the diverse functions of a protein, it is essential to understand the functionality of individual residues. Current methods are highly dependent on evolutionary sequence conservation, which is usually limited by sampling size. Sequence conservation-based methods are further confounded by structural constraints and multifunctionality of proteins. Here we present a method that can systematically identify and annotate functional residues of a given protein. We used a high-throughput functional profiling platform to identify essential residues. Coupling it with homologous-structure comparison, we were able to annotate multiple functions of proteins. We demonstrated the method with the PB1 protein of influenza A virus and identified novel functional residues in addition to its canonical function as an RNA-dependent RNA polymerase. Not limited to virology, this method is generally applicable to other proteins that can be functionally selected and about which homologous-structure information is available. Copyright © 2016 Du et al.
Relax with CouchDB--into the non-relational DBMS era of bioinformatics.
Manyam, Ganiraju; Payton, Michelle A; Roth, Jack A; Abruzzo, Lynne V; Coombes, Kevin R
2012-07-01
With the proliferation of high-throughput technologies, genome-level data analysis has become common in molecular biology. Bioinformaticians are developing extensive resources to annotate and mine biological features from high-throughput data. The underlying database management systems for most bioinformatics software are based on a relational model. Modern non-relational databases offer an alternative that has flexibility, scalability, and a non-rigid design schema. Moreover, with an accelerated development pace, non-relational databases like CouchDB can be ideal tools to construct bioinformatics utilities. We describe CouchDB by presenting three new bioinformatics resources: (a) geneSmash, which collates data from bioinformatics resources and provides automated gene-centric annotations, (b) drugBase, a database of drug-target interactions with a web interface powered by geneSmash, and (c) HapMap-CN, which provides a web interface to query copy number variations from three SNP-chip HapMap datasets. In addition to the web sites, all three systems can be accessed programmatically via web services. Copyright © 2012 Elsevier Inc. All rights reserved.
Evaluation of relational and NoSQL database architectures to manage genomic annotations.
Schulz, Wade L; Nelson, Brent G; Felker, Donn K; Durant, Thomas J S; Torres, Richard
2016-12-01
While the adoption of next generation sequencing has rapidly expanded, the informatics infrastructure used to manage the data generated by this technology has not kept pace. Historically, relational databases have provided much of the framework for data storage and retrieval. Newer technologies based on NoSQL architectures may provide significant advantages in storage and query efficiency, thereby reducing the cost of data management. But their relative advantage when applied to biomedical data sets, such as genetic data, has not been characterized. To this end, we compared the storage, indexing, and query efficiency of a common relational database (MySQL), a document-oriented NoSQL database (MongoDB), and a relational database with NoSQL support (PostgreSQL). When used to store genomic annotations from the dbSNP database, we found the NoSQL architectures to outperform traditional, relational models for speed of data storage, indexing, and query retrieval in nearly every operation. These findings strongly support the use of novel database technologies to improve the efficiency of data management within the biological sciences. Copyright © 2016 Elsevier Inc. All rights reserved.
PHYLOViZ: phylogenetic inference and data visualization for sequence based typing methods
2012-01-01
Background With the decrease of DNA sequencing costs, sequence-based typing methods are rapidly becoming the gold standard for epidemiological surveillance. These methods provide reproducible and comparable results needed for a global scale bacterial population analysis, while retaining their usefulness for local epidemiological surveys. Online databases that collect the generated allelic profiles and associated epidemiological data are available but this wealth of data remains underused and are frequently poorly annotated since no user-friendly tool exists to analyze and explore it. Results PHYLOViZ is platform independent Java software that allows the integrated analysis of sequence-based typing methods, including SNP data generated from whole genome sequence approaches, and associated epidemiological data. goeBURST and its Minimum Spanning Tree expansion are used for visualizing the possible evolutionary relationships between isolates. The results can be displayed as an annotated graph overlaying the query results of any other epidemiological data available. Conclusions PHYLOViZ is a user-friendly software that allows the combined analysis of multiple data sources for microbial epidemiological and population studies. It is freely available at http://www.phyloviz.net. PMID:22568821
Improving Microbial Genome Annotations in an Integrated Database Context
Chen, I-Min A.; Markowitz, Victor M.; Chu, Ken; Anderson, Iain; Mavromatis, Konstantinos; Kyrpides, Nikos C.; Ivanova, Natalia N.
2013-01-01
Effective comparative analysis of microbial genomes requires a consistent and complete view of biological data. Consistency regards the biological coherence of annotations, while completeness regards the extent and coverage of functional characterization for genomes. We have developed tools that allow scientists to assess and improve the consistency and completeness of microbial genome annotations in the context of the Integrated Microbial Genomes (IMG) family of systems. All publicly available microbial genomes are characterized in IMG using different functional annotation and pathway resources, thus providing a comprehensive framework for identifying and resolving annotation discrepancies. A rule based system for predicting phenotypes in IMG provides a powerful mechanism for validating functional annotations, whereby the phenotypic traits of an organism are inferred based on the presence of certain metabolic reactions and pathways and compared to experimentally observed phenotypes. The IMG family of systems are available at http://img.jgi.doe.gov/. PMID:23424620
Protein function prediction using neighbor relativity in protein-protein interaction network.
Moosavi, Sobhan; Rahgozar, Masoud; Rahimi, Amir
2013-04-01
There is a large gap between the number of discovered proteins and the number of functionally annotated ones. Due to the high cost of determining protein function by wet-lab research, function prediction has become a major task for computational biology and bioinformatics. Some researches utilize the proteins interaction information to predict function for un-annotated proteins. In this paper, we propose a novel approach called "Neighbor Relativity Coefficient" (NRC) based on interaction network topology which estimates the functional similarity between two proteins. NRC is calculated for each pair of proteins based on their graph-based features including distance, common neighbors and the number of paths between them. In order to ascribe function to an un-annotated protein, NRC estimates a weight for each neighbor to transfer its annotation to the unknown protein. Finally, the unknown protein will be annotated by the top score transferred functions. We also investigate the effect of using different coefficients for various types of functions. The proposed method has been evaluated on Saccharomyces cerevisiae and Homo sapiens interaction networks. The performance analysis demonstrates that NRC yields better results in comparison with previous protein function prediction approaches that utilize interaction network. Copyright © 2012 Elsevier Ltd. All rights reserved.
GO-FAANG meeting: A gathering on functional annotation of animal genomes
USDA-ARS?s Scientific Manuscript database
The FAANG (Functional Annotation of Animal Genomes) Consortium recently held a Gathering On FAANG (GO-FAANG) Workshop in Washington, DC on October 7-8, 2015. This consortium is a grass-roots organization formed to advance the annotation of newly assembled genomes of non-model organisms (www.faang.or...
Siew, Joyce Phui Yee; Khan, Asif M; Tan, Paul T J; Koh, Judice L Y; Seah, Seng Hong; Koo, Chuay Yeng; Chai, Siaw Ching; Armugam, Arunmozhiarasi; Brusic, Vladimir; Jeyaseelan, Kandiah
2004-12-12
Sequence annotations, functional and structural data on snake venom neurotoxins (svNTXs) are scattered across multiple databases and literature sources. Sequence annotations and structural data are available in the public molecular databases, while functional data are almost exclusively available in the published articles. There is a need for a specialized svNTXs database that contains NTX entries, which are organized, well annotated and classified in a systematic manner. We have systematically analyzed svNTXs and classified them using structure-function groups based on their structural, functional and phylogenetic properties. Using conserved motifs in each phylogenetic group, we built an intelligent module for the prediction of structural and functional properties of unknown NTXs. We also developed an annotation tool to aid the functional prediction of newly identified NTXs as an additional resource for the venom research community. We created a searchable online database of NTX proteins sequences (http://research.i2r.a-star.edu.sg/Templar/DB/snake_neurotoxin). This database can also be found under Swiss-Prot Toxin Annotation Project website (http://www.expasy.org/sprot/).
Dereeper, Alexis; Nicolas, Stéphane; Le Cunff, Loïc; Bacilieri, Roberto; Doligez, Agnès; Peros, Jean-Pierre; Ruiz, Manuel; This, Patrice
2011-05-05
High-throughput re-sequencing, new genotyping technologies and the availability of reference genomes allow the extensive characterization of Single Nucleotide Polymorphisms (SNPs) and insertion/deletion events (indels) in many plant species. The rapidly increasing amount of re-sequencing and genotyping data generated by large-scale genetic diversity projects requires the development of integrated bioinformatics tools able to efficiently manage, analyze, and combine these genetic data with genome structure and external data. In this context, we developed SNiPlay, a flexible, user-friendly and integrative web-based tool dedicated to polymorphism discovery and analysis. It integrates:1) a pipeline, freely accessible through the internet, combining existing softwares with new tools to detect SNPs and to compute different types of statistical indices and graphical layouts for SNP data. From standard sequence alignments, genotyping data or Sanger sequencing traces given as input, SNiPlay detects SNPs and indels events and outputs submission files for the design of Illumina's SNP chips. Subsequently, it sends sequences and genotyping data into a series of modules in charge of various processes: physical mapping to a reference genome, annotation (genomic position, intron/exon location, synonymous/non-synonymous substitutions), SNP frequency determination in user-defined groups, haplotype reconstruction and network, linkage disequilibrium evaluation, and diversity analysis (Pi, Watterson's Theta, Tajima's D).Furthermore, the pipeline allows the use of external data (such as phenotype, geographic origin, taxa, stratification) to define groups and compare statistical indices.2) a database storing polymorphisms, genotyping data and grapevine sequences released by public and private projects. It allows the user to retrieve SNPs using various filters (such as genomic position, missing data, polymorphism type, allele frequency), to compare SNP patterns between populations, and to export genotyping data or sequences in various formats. Our experiments on grapevine genetic projects showed that SNiPlay allows geneticists to rapidly obtain advanced results in several key research areas of plant genetic diversity. Both the management and treatment of large amounts of SNP data are rendered considerably easier for end-users through automation and integration. Current developments are taking into account new advances in high-throughput technologies.SNiPlay is available at: http://sniplay.cirad.fr/.
Gene Ontology annotation of the rice blast fungus, Magnaporthe oryzae
Meng, Shaowu; Brown, Douglas E; Ebbole, Daniel J; Torto-Alalibo, Trudy; Oh, Yeon Yee; Deng, Jixin; Mitchell, Thomas K; Dean, Ralph A
2009-01-01
Background Magnaporthe oryzae, the causal agent of blast disease of rice, is the most destructive disease of rice worldwide. The genome of this fungal pathogen has been sequenced and an automated annotation has recently been updated to Version 6 . However, a comprehensive manual curation remains to be performed. Gene Ontology (GO) annotation is a valuable means of assigning functional information using standardized vocabulary. We report an overview of the GO annotation for Version 5 of M. oryzae genome assembly. Methods A similarity-based (i.e., computational) GO annotation with manual review was conducted, which was then integrated with a literature-based GO annotation with computational assistance. For similarity-based GO annotation a stringent reciprocal best hits method was used to identify similarity between predicted proteins of M. oryzae and GO proteins from multiple organisms with published associations to GO terms. Significant alignment pairs were manually reviewed. Functional assignments were further cross-validated with manually reviewed data, conserved domains, or data determined by wet lab experiments. Additionally, biological appropriateness of the functional assignments was manually checked. Results In total, 6,286 proteins received GO term assignment via the homology-based annotation, including 2,870 hypothetical proteins. Literature-based experimental evidence, such as microarray, MPSS, T-DNA insertion mutation, or gene knockout mutation, resulted in 2,810 proteins being annotated with GO terms. Of these, 1,673 proteins were annotated with new terms developed for Plant-Associated Microbe Gene Ontology (PAMGO). In addition, 67 experiment-determined secreted proteins were annotated with PAMGO terms. Integration of the two data sets resulted in 7,412 proteins (57%) being annotated with 1,957 distinct and specific GO terms. Unannotated proteins were assigned to the 3 root terms. The Version 5 GO annotation is publically queryable via the GO site . Additionally, the genome of M. oryzae is constantly being refined and updated as new information is incorporated. For the latest GO annotation of Version 6 genome, please visit our website . The preliminary GO annotation of Version 6 genome is placed at a local MySql database that is publically queryable via a user-friendly interface Adhoc Query System. Conclusion Our analysis provides comprehensive and robust GO annotations of the M. oryzae genome assemblies that will be solid foundations for further functional interrogation of M. oryzae. PMID:19278556
The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4).
Huntemann, Marcel; Ivanova, Natalia N; Mavromatis, Konstantinos; Tripp, H James; Paez-Espino, David; Palaniappan, Krishnaveni; Szeto, Ernest; Pillay, Manoj; Chen, I-Min A; Pati, Amrita; Nielsen, Torben; Markowitz, Victor M; Kyrpides, Nikos C
2015-01-01
The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. Structural annotation is followed by assignment of protein product names and functions.
Identification of widespread adenosine nucleotide binding in Mycobacterium tuberculosis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ansong, Charles; Ortega, Corrie; Payne, Samuel H.
The annotation of protein function is almost completely performed by in silico approaches. However, computational prediction of protein function is frequently incomplete and error prone. In Mycobacterium tuberculosis (Mtb), ~25% of all genes have no predicted function and are annotated as hypothetical proteins. This lack of functional information severely limits our understanding of Mtb pathogenicity. Current tools for experimental functional annotation are limited and often do not scale to entire protein families. Here, we report a generally applicable chemical biology platform to functionally annotate bacterial proteins by combining activity-based protein profiling (ABPP) and quantitative LC-MS-based proteomics. As an example ofmore » this approach for high-throughput protein functional validation and discovery, we experimentally annotate the families of ATP-binding proteins in Mtb. Our data experimentally validate prior in silico predictions of >250 ATPases and adenosine nucleotide-binding proteins, and reveal 73 hypothetical proteins as novel ATP-binding proteins. We identify adenosine cofactor interactions with many hypothetical proteins containing a diversity of unrelated sequences, providing a new and expanded view of adenosine nucleotide binding in Mtb. Furthermore, many of these hypothetical proteins are both unique to Mycobacteria and essential for infection, suggesting specialized functions in mycobacterial physiology and pathogenicity. Thus, we provide a generally applicable approach for high throughput protein function discovery and validation, and highlight several ways in which application of activity-based proteomics data can improve the quality of functional annotations to facilitate novel biological insights.« less
The Proteome Folding Project: Proteome-scale prediction of structure and function
Drew, Kevin; Winters, Patrick; Butterfoss, Glenn L.; Berstis, Viktors; Uplinger, Keith; Armstrong, Jonathan; Riffle, Michael; Schweighofer, Erik; Bovermann, Bill; Goodlett, David R.; Davis, Trisha N.; Shasha, Dennis; Malmström, Lars; Bonneau, Richard
2011-01-01
The incompleteness of proteome structure and function annotation is a critical problem for biologists and, in particular, severely limits interpretation of high-throughput and next-generation experiments. We have developed a proteome annotation pipeline based on structure prediction, where function and structure annotations are generated using an integration of sequence comparison, fold recognition, and grid-computing-enabled de novo structure prediction. We predict protein domain boundaries and three-dimensional (3D) structures for protein domains from 94 genomes (including human, Arabidopsis, rice, mouse, fly, yeast, Escherichia coli, and worm). De novo structure predictions were distributed on a grid of more than 1.5 million CPUs worldwide (World Community Grid). We generated significant numbers of new confident fold annotations (9% of domains that are otherwise unannotated in these genomes). We demonstrate that predicted structures can be combined with annotations from the Gene Ontology database to predict new and more specific molecular functions. PMID:21824995
Amin Al Olama, Ali; Dadaev, Tokhir; Hazelett, Dennis J; Li, Qiuyan; Leongamornlert, Daniel; Saunders, Edward J; Stephens, Sarah; Cieza-Borrella, Clara; Whitmore, Ian; Benlloch Garcia, Sara; Giles, Graham G; Southey, Melissa C; Fitzgerald, Liesel; Gronberg, Henrik; Wiklund, Fredrik; Aly, Markus; Henderson, Brian E; Schumacher, Fredrick; Haiman, Christopher A; Schleutker, Johanna; Wahlfors, Tiina; Tammela, Teuvo L; Nordestgaard, Børge G; Key, Tim J; Travis, Ruth C; Neal, David E; Donovan, Jenny L; Hamdy, Freddie C; Pharoah, Paul; Pashayan, Nora; Khaw, Kay-Tee; Stanford, Janet L; Thibodeau, Stephen N; Mcdonnell, Shannon K; Schaid, Daniel J; Maier, Christiane; Vogel, Walther; Luedeke, Manuel; Herkommer, Kathleen; Kibel, Adam S; Cybulski, Cezary; Wokołorczyk, Dominika; Kluzniak, Wojciech; Cannon-Albright, Lisa; Brenner, Hermann; Butterbach, Katja; Arndt, Volker; Park, Jong Y; Sellers, Thomas; Lin, Hui-Yi; Slavov, Chavdar; Kaneva, Radka; Mitev, Vanio; Batra, Jyotsna; Clements, Judith A; Spurdle, Amanda; Teixeira, Manuel R; Paulo, Paula; Maia, Sofia; Pandha, Hardev; Michael, Agnieszka; Kierzek, Andrzej; Govindasami, Koveela; Guy, Michelle; Lophatonanon, Artitaya; Muir, Kenneth; Viñuela, Ana; Brown, Andrew A; Freedman, Mathew; Conti, David V; Easton, Douglas; Coetzee, Gerhard A; Eeles, Rosalind A; Kote-Jarai, Zsofia
2015-10-01
Genome-wide association studies (GWAS) have identified numerous common prostate cancer (PrCa) susceptibility loci. We have fine-mapped 64 GWAS regions known at the conclusion of the iCOGS study using large-scale genotyping and imputation in 25 723 PrCa cases and 26 274 controls of European ancestry. We detected evidence for multiple independent signals at 16 regions, 12 of which contained additional newly identified significant associations. A single signal comprising a spectrum of correlated variation was observed at 39 regions; 35 of which are now described by a novel more significantly associated lead SNP, while the originally reported variant remained as the lead SNP only in 4 regions. We also confirmed two association signals in Europeans that had been previously reported only in East-Asian GWAS. Based on statistical evidence and linkage disequilibrium (LD) structure, we have curated and narrowed down the list of the most likely candidate causal variants for each region. Functional annotation using data from ENCODE filtered for PrCa cell lines and eQTL analysis demonstrated significant enrichment for overlap with bio-features within this set. By incorporating the novel risk variants identified here alongside the refined data for existing association signals, we estimate that these loci now explain ∼38.9% of the familial relative risk of PrCa, an 8.9% improvement over the previously reported GWAS tag SNPs. This suggests that a significant fraction of the heritability of PrCa may have been hidden during the discovery phase of GWAS, in particular due to the presence of multiple independent signals within the same region. © The Author 2015. Published by Oxford University Press.
Genetic variant rs17225178 in the ARNT2 gene is associated with Asperger Syndrome.
Di Napoli, Agnese; Warrier, Varun; Baron-Cohen, Simon; Chakrabarti, Bhismadev
2015-01-01
Autism Spectrum Conditions (ASC) are neurodevelopmental conditions characterized by difficulties in communication and social interaction, alongside unusually repetitive behaviours and narrow interests. Asperger Syndrome (AS) is one subgroup of ASC and differs from classic autism in that in AS there is no language or general cognitive delay. Genetic, epigenetic and environmental factors are implicated in ASC and genes involved in neural connectivity and neurodevelopment are good candidates for studying the susceptibility to ASC. The aryl-hydrocarbon receptor nuclear translocator 2 (ARNT2) gene encodes a transcription factor involved in neurodevelopmental processes, neuronal connectivity and cellular responses to hypoxia. A mutation in this gene has been identified in individuals with ASC and single nucleotide polymorphisms (SNPs) have been nominally associated with AS and autistic traits in previous studies. In this study, we tested 34 SNPs in ARNT2 for association with AS in 118 cases and 412 controls of Caucasian origin. P values were adjusted for multiple comparisons, and linkage disequilibrium (LD) among the SNPs analysed was calculated in our sample. Finally, SNP annotation allowed functional and structural analyses of the genetic variants in ARNT2. We tested the replicability of our result using the genome-wide association studies (GWAS) database of the Psychiatric Genomics Consortium (PGC). We report statistically significant association of rs17225178 with AS. This SNP modifies transcription factor binding sites and regions that regulate the chromatin state in neural cell lines. It is also included in a LD block in our sample, alongside other genetic variants that alter chromatin regulatory regions in neural cells. These findings demonstrate that rs17225178 in the ARNT2 gene is associated with AS and support previous studies that pointed out an involvement of this gene in the predisposition to ASC.
Ain, Qurat-ul; Rasheed, Awais; Anwar, Alia; Mahmood, Tariq; Imtiaz, Muhammad; Mahmood, Tariq; Xia, Xianchun; He, Zhonghu; Quraishi, Umar M.
2015-01-01
Genome-wide association studies (GWAS) were undertaken to identify SNP markers associated with yield and yield-related traits in 123 Pakistani historical wheat cultivars evaluated during 2011–2014 seasons under rainfed field conditions. The population was genotyped by using high-density Illumina iSelect 90K single nucleotide polymorphism (SNP) assay, and finally 14,960 high quality SNPs were used in GWAS. Population structure examined using 1000 unlinked markers identified seven subpopulations (K = 7) that were representative of different breeding programs in Pakistan, in addition to local landraces. Forty four stable marker-trait associations (MTAs) with -log p > 4 were identified for nine yield-related traits. Nine multi-trait MTAs were found on chromosomes 1AL, 1BS, 2AL, 2BS, 2BL, 4BL, 5BL, 6AL, and 6BL, and those on 5BL and 6AL were stable across two seasons. Gene annotation and syntey identified that 14 trait-associated SNPs were linked to genes having significant importance in plant development. Favorable alleles for days to heading (DH), plant height (PH), thousand grain weight (TGW), and grain yield (GY) showed minor additive effects and their frequencies were slightly higher in cultivars released after 2000. However, no selection pressure on any favorable allele was identified. These genomic regions identified have historically contributed to achieve yield gains from 2.63 million tons in 1947 to 25.7 million tons in 2015. Future breeding strategies can be devised to initiate marker assisted breeding to accumulate these favorable alleles of SNPs associated with yield-related traits to increase grain yield. Additionally, in silico identification of 454-contigs corresponding to MTAs will facilitate fine mapping and subsequent cloning of candidate genes and functional marker development. PMID:26442056
Liu, Jun-Jun; Sniezko, Richard; Murray, Michael; Wang, Ning; Chen, Hao; Zamany, Arezoo; Sturrock, Rona N.; Savin, Douglas; Kegley, Angelia
2016-01-01
Whitebark pine (WBP, Pinus albicaulis Engelm.) is an endangered conifer species due to heavy mortality from white pine blister rust (WPBR, caused by Cronartium ribicola) and mountain pine beetle (Dendroctonus ponderosae). Information about genetic diversity and population structure is of fundamental importance for its conservation and restoration. However, current knowledge on the genetic constitution and genomic variation is still limited for WBP. In this study, an integrated genomics approach was applied to characterize seed collections from WBP breeding programs in western North America. RNA-seq analysis was used for de novo assembly of the WBP needle transcriptome, which contains 97,447 protein-coding transcripts. Within the transcriptome, single nucleotide polymorphisms (SNPs) were discovered, and more than 22,000 of them were non-synonymous SNPs (ns-SNPs). Following the annotation of genes with ns-SNPs, 216 ns-SNPs within candidate genes with putative functions in disease resistance and plant defense were selected to design SNP arrays for high-throughput genotyping. Among these SNP loci, 71 were highly polymorphic, with sufficient variation to identify a unique genotype for each of the 371 individuals originating from British Columbia (Canada), Oregon and Washington (USA). A clear genetic differentiation was evident among seed families. Analyses of genetic spatial patterns revealed varying degrees of diversity and the existence of several genetic subgroups in the WBP breeding populations. Genetic components were associated with geographic variables and phenotypic rating of WPBR disease severity across landscapes, which may facilitate further identification of WBP genotypes and gene alleles contributing to local adaptation and quantitative resistance to WPBR. The WBP genomic resources developed here provide an invaluable tool for further studies and for exploitation and utilization of the genetic diversity preserved within this endangered conifer and other five-needle pines. PMID:27992468
Finno, Carrie J; Gianino, Giuliana; Perumbakkam, Sudeep; Williams, Zoë J; Bordbari, Matthew H; Gardner, Keri L; Burns, Erin; Peng, Sichong; Durward-Akhurst, Sian A; Valberg, Stephanie J
2018-03-06
The cause of immune-mediated myositis (IMM), characterized by recurrent, rapid-onset muscle atrophy in Quarter Horses (QH), is unknown. The histopathologic hallmark of IMM is lymphocytic infiltration of myofibers. The purpose of this study was to identify putative functional variants associated with equine IMM. A genome-wide association (GWA) study was performed on 36 IMM QHs and 54 breed matched unaffected QHs from the same environment using the Equine SNP50 and SNP70 genotyping arrays. A mixed model analysis identified nine SNPs within a ~ 2.87 Mb region on chr11 that were significantly (P unadjusted < 1.4 × 10 - 6 ) associated with the IMM phenotype. Associated haplotypes within this region encompassed 38 annotated genes, including four myosin genes (MYH1, MYH2, MYH3, and MYH13). Whole genome sequencing of four IMM and four unaffected QHs identified a single segregating nonsynonymous E321G mutation in MYH1 encoding myosin heavy chain 2X. Genotyping of additional 35 IMM and 22 unaffected QHs confirmed an association (P = 2.9 × 10 - 5 ), and the putative mutation was absent in 175 horses from 21 non-QH breeds. Lymphocytic infiltrates occurred in type 2X myofibers and the proportion of 2X fibers was decreased in the presence of inflammation. Protein modeling and contact/stability analysis identified 14 residues affected by the mutation which significantly decreased stability. We conclude that a mutation in MYH1 is highly associated with susceptibility to the IMM phenotype in QH-related breeds. This is the first report of a mutation in MYH1 and the first link between a skeletal muscle myosin mutation and autoimmune disease.
A guide to best practices for Gene Ontology (GO) manual annotation
Balakrishnan, Rama; Harris, Midori A.; Huntley, Rachael; Van Auken, Kimberly; Cherry, J. Michael
2013-01-01
The Gene Ontology Consortium (GOC) is a community-based bioinformatics project that classifies gene product function through the use of structured controlled vocabularies. A fundamental application of the Gene Ontology (GO) is in the creation of gene product annotations, evidence-based associations between GO definitions and experimental or sequence-based analysis. Currently, the GOC disseminates 126 million annotations covering >374 000 species including all the kingdoms of life. This number includes two classes of GO annotations: those created manually by experienced biocurators reviewing the literature or by examination of biological data (1.1 million annotations covering 2226 species) and those generated computationally via automated methods. As manual annotations are often used to propagate functional predictions between related proteins within and between genomes, it is critical to provide accurate consistent manual annotations. Toward this goal, we present here the conventions defined by the GOC for the creation of manual annotation. This guide represents the best practices for manual annotation as established by the GOC project over the past 12 years. We hope this guide will encourage research communities to annotate gene products of their interest to enhance the corpus of GO annotations available to all. Database URL: http://www.geneontology.org PMID:23842463
Functional annotation of regulatory pathways.
Pandey, Jayesh; Koyutürk, Mehmet; Kim, Yohan; Szpankowski, Wojciech; Subramaniam, Shankar; Grama, Ananth
2007-07-01
Standardized annotations of biomolecules in interaction networks (e.g. Gene Ontology) provide comprehensive understanding of the function of individual molecules. Extending such annotations to pathways is a critical component of functional characterization of cellular signaling at the systems level. We propose a framework for projecting gene regulatory networks onto the space of functional attributes using multigraph models, with the objective of deriving statistically significant pathway annotations. We first demonstrate that annotations of pairwise interactions do not generalize to indirect relationships between processes. Motivated by this result, we formalize the problem of identifying statistically overrepresented pathways of functional attributes. We establish the hardness of this problem by demonstrating the non-monotonicity of common statistical significance measures. We propose a statistical model that emphasizes the modularity of a pathway, evaluating its significance based on the coupling of its building blocks. We complement the statistical model by an efficient algorithm and software, Narada, for computing significant pathways in large regulatory networks. Comprehensive results from our methods applied to the Escherichia coli transcription network demonstrate that our approach is effective in identifying known, as well as novel biological pathway annotations. Narada is implemented in Java and is available at http://www.cs.purdue.edu/homes/jpandey/narada/.
Mutant phenotypes for thousands of bacterial genes of unknown function
Price, Morgan N.; Wetmore, Kelly M.; Waters, R. Jordan; ...
2018-05-16
One-third of all protein-coding genes from bacterial genomes cannot be annotated with a function. Here, to investigate the functions of these genes, we present genome-wide mutant fitness data from 32 diverse bacteria across dozens of growth conditions. We identified mutant phenotypes for 11,779 protein-coding genes that had not been annotated with a specific function. Many genes could be associated with a specific condition because the gene affected fitness only in that condition, or with another gene in the same bacterium because they had similar mutant phenotypes. Of the poorly annotated genes, 2,316 had associations that have high confidence because theymore » are conserved in other bacteria. By combining these conserved associations with comparative genomics, we identified putative DNA repair proteins; in addition, we propose specific functions for poorly annotated enzymes and transporters and for uncharacterized protein families. Lastly, our study demonstrates the scalability of microbial genetics and its utility for improving gene annotations.« less
Mutant phenotypes for thousands of bacterial genes of unknown function
DOE Office of Scientific and Technical Information (OSTI.GOV)
Price, Morgan N.; Wetmore, Kelly M.; Waters, R. Jordan
One-third of all protein-coding genes from bacterial genomes cannot be annotated with a function. Here, to investigate the functions of these genes, we present genome-wide mutant fitness data from 32 diverse bacteria across dozens of growth conditions. We identified mutant phenotypes for 11,779 protein-coding genes that had not been annotated with a specific function. Many genes could be associated with a specific condition because the gene affected fitness only in that condition, or with another gene in the same bacterium because they had similar mutant phenotypes. Of the poorly annotated genes, 2,316 had associations that have high confidence because theymore » are conserved in other bacteria. By combining these conserved associations with comparative genomics, we identified putative DNA repair proteins; in addition, we propose specific functions for poorly annotated enzymes and transporters and for uncharacterized protein families. Lastly, our study demonstrates the scalability of microbial genetics and its utility for improving gene annotations.« less
Jiao, Y; Chen, R; Ke, X; Cheng, L; Chu, K; Lu, Z; Herskovits, E H
2011-01-01
Autism spectrum disorder (ASD) is a neurodevelopmental disorder, of which Asperger syndrome and high-functioning autism are subtypes. Our goal is: 1) to determine whether a diagnostic model based on single-nucleotide polymorphisms (SNPs), brain regional thickness measurements, or brain regional volume measurements can distinguish Asperger syndrome from high-functioning autism; and 2) to compare the SNP, thickness, and volume-based diagnostic models. Our study included 18 children with ASD: 13 subjects with high-functioning autism and 5 subjects with Asperger syndrome. For each child, we obtained 25 SNPs for 8 ASD-related genes; we also computed regional cortical thicknesses and volumes for 66 brain structures, based on structural magnetic resonance (MR) examination. To generate diagnostic models, we employed five machine-learning techniques: decision stump, alternating decision trees, multi-class alternating decision trees, logistic model trees, and support vector machines. For SNP-based classification, three decision-tree-based models performed better than the other two machine-learning models. The performance metrics for three decision-tree-based models were similar: decision stump was modestly better than the other two methods, with accuracy = 90%, sensitivity = 0.95 and specificity = 0.75. All thickness and volume-based diagnostic models performed poorly. The SNP-based diagnostic models were superior to those based on thickness and volume. For SNP-based classification, rs878960 in GABRB3 (gamma-aminobutyric acid A receptor, beta 3) was selected by all tree-based models. Our analysis demonstrated that SNP-based classification was more accurate than morphometry-based classification in ASD subtype classification. Also, we found that one SNP--rs878960 in GABRB3--distinguishes Asperger syndrome from high-functioning autism.
Buzinari, Tereza Cristina; Oishi, Jorge Camargo; De Moraes, Thiago Francisco; Vatanabe, Izabela Pereira; Selistre-de-Araújo, Heloisa Sobreiro; Pestana, Cezar Rangel; Rodrigues, Gerson Jhonatan
2017-07-15
Verify if sodium nitroprusside (SNP) is able to improve endothelial function and if this effect is independent of nitric oxide (NO) release of the compound. Normotensive (2K) and hypertensive (2K-1C) wistar rats were used. Intact endothelium aortas were placed in a myograph and incubated with SNP: 0.1nM; 1nM or 10nM during 30min. Cumulative concentration-effect curves for acetylcholine (Ach) were realized to measure the relaxing capacity. Intracellular NO were measured (by DAF-2DA probe) in HUVEC treated with SNP 0.1nM or DETA/NO 0.1μM. The detection of intracellular superoxide radical (O 2 •- ) was obtained by using DHE probe. Treatment of 2K-1C aortic rings with SNP (0.1; 1.0 and 10nM) improved endothelium dependent relaxation induced by acetylcholine. This improvement induced by SNP was verified at the concentration of 0.1nM, which does not release NO, suggesting that this effect was not induced due to NO release by SNP compound. Besides, we show that the cell treatment with 0.1nM of SNP decreased the fluorescence intensity to DHE in cells stimulated with angiotensin II. These results indicate that SNP decreases the concentration of O 2 •- in HUVEC cells. The SNP at a concentration that does not release NO inside the cells is able to attenuate endothelial dysfunction. Acetylcholine (Ach) (PubChem CID:6060); angiotensin II human (Ang II) (PubChem CID: 16211177); diethylenetriamine/nitric oxide (DETA-NO) (PubChem CID 4518); dihydroethidium (DHE) (PubChem CID: 128682); phenylephrine (Phe) (PubChem CID: 5284443); sodium nitroprusside (SNP) (PubChem CID: 11963579); Thiazolyl Blue Tetrazolium Bromide (MTT) (PubChem CID: 64965); 4,5-diaminofluorescein diacetate (DAF-2DA); 4-hidroxy-Tempo (Tempol) (PubChem CID: 137994), were purchased from Sigma-Aldrich (St. Louis, MO, USA). Copyright © 2017 Elsevier B.V. All rights reserved.
Comparative in vivo assessment of the subacute toxicity of gold and silver nanoparticles
NASA Astrophysics Data System (ADS)
Rathore, Mansee; Mohanty, Ipseeta Ray; Maheswari, Ujjwala; Dayal, Navami; Suman, Rajesh; Joshi, D. S.
2014-04-01
In spite of the projected therapeutic potentials of gold nanoparticles (GNP) and silver nanoparticles (SNP), very limited data are available on the interaction of nanoparticles with the biological systems. The present investigation was designed to evaluate as well as compare the subacute toxicity of GNP and SNP. Stable suspensions of GNP and SNP with mean particle diameter 10 and 25 nm, respectively, were prepared. Wistar rats were orally fed SNP (3 mg/kg) or GNP (20 μg/kg), once a day for 21 days. Biochemical indices (creatinine phosphokinase-MB, urea, blood urea nitrogen, aspartate transaminase, alkaline alanine transferase) and histopathological features of the liver, heart, brain, lungs, and kidney were evaluated for signs of toxicity. A significant decline in hepatic and renal function in the GNP treated group was observed as compared to SNP. GNP was found to be relatively more toxic on the lungs and SNP on the myocardial tissue as compared to SNP and GNP treatments, respectively. Interestingly, neither SNP nor GNP adversely affected the basal architecture of the brain as compared to sham. The present study demonstrated that GNP was significantly more noxious on the liver and kidney as compared with SNP.
Accurate and reproducible functional maps in 127 human cell types via 2D genome segmentation
Hardison, Ross C.
2017-01-01
Abstract The Roadmap Epigenomics Consortium has published whole-genome functional annotation maps in 127 human cell types by integrating data from studies of multiple epigenetic marks. These maps have been widely used for studying gene regulation in cell type-specific contexts and predicting the functional impact of DNA mutations on disease. Here, we present a new map of functional elements produced by applying a method called IDEAS on the same data. The method has several unique advantages and outperforms existing methods, including that used by the Roadmap Epigenomics Consortium. Using five categories of independent experimental datasets, we compared the IDEAS and Roadmap Epigenomics maps. While the overall concordance between the two maps is high, the maps differ substantially in the prediction details and in their consistency of annotation of a given genomic position across cell types. The annotation from IDEAS is uniformly more accurate than the Roadmap Epigenomics annotation and the improvement is substantial based on several criteria. We further introduce a pipeline that improves the reproducibility of functional annotation maps. Thus, we provide a high-quality map of candidate functional regions across 127 human cell types and compare the quality of different annotation methods in order to facilitate biomedical research in epigenomics. PMID:28973456
CommWalker: correctly evaluating modules in molecular networks in light of annotation bias.
Luecken, M D; Page, M J T; Crosby, A J; Mason, S; Reinert, G; Deane, C M
2018-03-15
Detecting novel functional modules in molecular networks is an important step in biological research. In the absence of gold standard functional modules, functional annotations are often used to verify whether detected modules/communities have biological meaning. However, as we show, the uneven distribution of functional annotations means that such evaluation methods favor communities of well-studied proteins. We propose a novel framework for the evaluation of communities as functional modules. Our proposed framework, CommWalker, takes communities as inputs and evaluates them in their local network environment by performing short random walks. We test CommWalker's ability to overcome annotation bias using input communities from four community detection methods on two protein interaction networks. We find that modules accepted by CommWalker are similarly co-expressed as those accepted by current methods. Crucially, CommWalker performs well not only in well-annotated regions, but also in regions otherwise obscured by poor annotation. CommWalker community prioritization both faithfully captures well-validated communities and identifies functional modules that may correspond to more novel biology. The CommWalker algorithm is freely available at opig.stats.ox.ac.uk/resources or as a docker image on the Docker Hub at hub.docker.com/r/lueckenmd/commwalker/. deane@stats.ox.ac.uk. Supplementary data are available at Bioinformatics online.
Plant genome and transcriptome annotations: from misconceptions to simple solutions
Bolger, Marie E; Arsova, Borjana; Usadel, Björn
2018-01-01
Abstract Next-generation sequencing has triggered an explosion of available genomic and transcriptomic resources in the plant sciences. Although genome and transcriptome sequencing has become orders of magnitudes cheaper and more efficient, often the functional annotation process is lagging behind. This might be hampered by the lack of a comprehensive enumeration of simple-to-use tools available to the plant researcher. In this comprehensive review, we present (i) typical ontologies to be used in the plant sciences, (ii) useful databases and resources used for functional annotation, (iii) what to expect from an annotated plant genome, (iv) an automated annotation pipeline and (v) a recipe and reference chart outlining typical steps used to annotate plant genomes/transcriptomes using publicly available resources. PMID:28062412
Functional Annotation of the Arabidopsis Genome Using Controlled Vocabularies1
Berardini, Tanya Z.; Mundodi, Suparna; Reiser, Leonore; Huala, Eva; Garcia-Hernandez, Margarita; Zhang, Peifen; Mueller, Lukas A.; Yoon, Jungwoon; Doyle, Aisling; Lander, Gabriel; Moseyko, Nick; Yoo, Danny; Xu, Iris; Zoeckler, Brandon; Montoya, Mary; Miller, Neil; Weems, Dan; Rhee, Seung Y.
2004-01-01
Controlled vocabularies are increasingly used by databases to describe genes and gene products because they facilitate identification of similar genes within an organism or among different organisms. One of The Arabidopsis Information Resource's goals is to associate all Arabidopsis genes with terms developed by the Gene Ontology Consortium that describe the molecular function, biological process, and subcellular location of a gene product. We have also developed terms describing Arabidopsis anatomy and developmental stages and use these to annotate published gene expression data. As of March 2004, we used computational and manual annotation methods to make 85,666 annotations representing 26,624 unique loci. We focus on associating genes to controlled vocabulary terms based on experimental data from the literature and use The Arabidopsis Information Resource-developed PubSearch software to facilitate this process. Each annotation is tagged with a combination of evidence codes, evidence descriptions, and references that provide a robust means to assess data quality. Annotation of all Arabidopsis genes will allow quantitative comparisons between sets of genes derived from sources such as microarray experiments. The Arabidopsis annotation data will also facilitate annotation of newly sequenced plant genomes by using sequence similarity to transfer annotations to homologous genes. In addition, complete and up-to-date annotations will make unknown genes easy to identify and target for experimentation. Here, we describe the process of Arabidopsis functional annotation using a variety of data sources and illustrate several ways in which this information can be accessed and used to infer knowledge about Arabidopsis and other plant species. PMID:15173566
Chen, Wenan; McDonnell, Shannon K; Thibodeau, Stephen N; Tillmans, Lori S; Schaid, Daniel J
2016-11-01
Functional annotations have been shown to improve both the discovery power and fine-mapping accuracy in genome-wide association studies. However, the optimal strategy to incorporate the large number of existing annotations is still not clear. In this study, we propose a Bayesian framework to incorporate functional annotations in a systematic manner. We compute the maximum a posteriori solution and use cross validation to find the optimal penalty parameters. By extending our previous fine-mapping method CAVIARBF into this framework, we require only summary statistics as input. We also derived an exact calculation of Bayes factors using summary statistics for quantitative traits, which is necessary when a large proportion of trait variance is explained by the variants of interest, such as in fine mapping expression quantitative trait loci (eQTL). We compared the proposed method with PAINTOR using different strategies to combine annotations. Simulation results show that the proposed method achieves the best accuracy in identifying causal variants among the different strategies and methods compared. We also find that for annotations with moderate effects from a large annotation pool, screening annotations individually and then combining the top annotations can produce overly optimistic results. We applied these methods on two real data sets: a meta-analysis result of lipid traits and a cis-eQTL study of normal prostate tissues. For the eQTL data, incorporating annotations significantly increased the number of potential causal variants with high probabilities. Copyright © 2016 by the Genetics Society of America.
PExFInS: An Integrative Post-GWAS Explorer for Functional Indels and SNPs
Cheng, Zhongshan; Chu, Hin; Fan, Yanhui; Li, Cun; Song, You-Qiang; Zhou, Jie; Yuen, Kwok-Yung
2015-01-01
Expression quantitative trait loci (eQTLs) mapping and linkage disequilibrium (LD) analysis have been widely employed to interpret findings of genome-wide association studies (GWAS). With the availability of deep sequencing data of 423 lymphoblastoid cell lines (LCLs) from six global populations and the microarray expression data, we performed eQTL analysis, identified more than 228 K SNP cis-eQTLs and 21 K indel cis-eQTLs and generated a LCL cis-eQTL database. We demonstrate that the percentages of population-shared and population-specific cis-eQTLs are comparable; while indel cis-eQTLs in the population-specific subsection make more contribution to gene expression variations than those in the population-shared subsection. We found cis-eQTLs, especially the population-shared cis-eQTLs are significantly enriched toward transcription start site. Moreover, the National Human Genome Research Institute cataloged GWAS SNPs are enriched for LCL cis-eQTLs. Specifically, 32.8% GWAS SNPs are LCL cis-eQTLs, among which 12.5% can be tagged by indel cis-eQTLs, suggesting the fundamental contribution of indel cis-eQTLs to GWAS association signals. To search for functional indels and SNPs tagging GWAS SNPs, a pipeline Post-GWAS Explorer for Functional Indels and SNPs (PExFInS) has been developed, integrating LD analysis, functional annotation from public databases, cis-eQTL mapping with our LCL cis-eQTL database and other published cis-eQTL datasets. PMID:26612672
The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)
Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos; ...
2015-10-26
The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.
The standard operating procedure of the DOE-JGI Microbial Genome Annotation Pipeline (MGAP v.4)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos
The DOE-JGI Microbial Genome Annotation Pipeline performs structural and functional annotation of microbial genomes that are further included into the Integrated Microbial Genome comparative analysis system. MGAP is applied to assembled nucleotide sequence datasets that are provided via the IMG submission site. Dataset submission for annotation first requires project and associated metadata description in GOLD. The MGAP sequence data processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNA features, as well as CRISPR elements. In conclusion, structural annotation is followed by assignment of protein product names and functions.
AnnotCompute: annotation-based exploration and meta-analysis of genomics experiments
Zheng, Jie; Stoyanovich, Julia; Manduchi, Elisabetta; Liu, Junmin; Stoeckert, Christian J.
2011-01-01
The ever-increasing scale of biological data sets, particularly those arising in the context of high-throughput technologies, requires the development of rich data exploration tools. In this article, we present AnnotCompute, an information discovery platform for repositories of functional genomics experiments such as ArrayExpress. Our system leverages semantic annotations of functional genomics experiments with controlled vocabulary and ontology terms, such as those from the MGED Ontology, to compute conceptual dissimilarities between pairs of experiments. These dissimilarities are then used to support two types of exploratory analysis—clustering and query-by-example. We show that our proposed dissimilarity measures correspond to a user's intuition about conceptual dissimilarity, and can be used to support effective query-by-example. We also evaluate the quality of clustering based on these measures. While AnnotCompute can support a richer data exploration experience, its effectiveness is limited in some cases, due to the quality of available annotations. Nonetheless, tools such as AnnotCompute may provide an incentive for richer annotations of experiments. Code is available for download at http://www.cbil.upenn.edu/downloads/AnnotCompute. Database URL: http://www.cbil.upenn.edu/annotCompute/ PMID:22190598
The language of gene ontology: a Zipf's law analysis.
Kalankesh, Leila Ranandeh; Stevens, Robert; Brass, Andy
2012-06-07
Most major genome projects and sequence databases provide a GO annotation of their data, either automatically or through human annotators, creating a large corpus of data written in the language of GO. Texts written in natural language show a statistical power law behaviour, Zipf's law, the exponent of which can provide useful information on the nature of the language being used. We have therefore explored the hypothesis that collections of GO annotations will show similar statistical behaviours to natural language. Annotations from the Gene Ontology Annotation project were found to follow Zipf's law. Surprisingly, the measured power law exponents were consistently different between annotation captured using the three GO sub-ontologies in the corpora (function, process and component). On filtering the corpora using GO evidence codes we found that the value of the measured power law exponent responded in a predictable way as a function of the evidence codes used to support the annotation. Techniques from computational linguistics can provide new insights into the annotation process. GO annotations show similar statistical behaviours to those seen in natural language with measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent might provide a signal regarding the information content of the annotation.
Busk, P K; Pilgaard, B; Lezyk, M J; Meyer, A S; Lange, L
2017-04-12
Carbohydrate-active enzymes are found in all organisms and participate in key biological processes. These enzymes are classified in 274 families in the CAZy database but the sequence diversity within each family makes it a major task to identify new family members and to provide basis for prediction of enzyme function. A fast and reliable method for de novo annotation of genes encoding carbohydrate-active enzymes is to identify conserved peptides in the curated enzyme families followed by matching of the conserved peptides to the sequence of interest as demonstrated for the glycosyl hydrolase and the lytic polysaccharide monooxygenase families. This approach not only assigns the enzymes to families but also provides functional prediction of the enzymes with high accuracy. We identified conserved peptides for all enzyme families in the CAZy database with Peptide Pattern Recognition. The conserved peptides were matched to protein sequence for de novo annotation and functional prediction of carbohydrate-active enzymes with the Hotpep method. Annotation of protein sequences from 12 bacterial and 16 fungal genomes to families with Hotpep had an accuracy of 0.84 (measured as F1-score) compared to semiautomatic annotation by the CAZy database whereas the dbCAN HMM-based method had an accuracy of 0.77 with optimized parameters. Furthermore, Hotpep provided a functional prediction with 86% accuracy for the annotated genes. Hotpep is available as a stand-alone application for MS Windows. Hotpep is a state-of-the-art method for automatic annotation and functional prediction of carbohydrate-active enzymes.
A functional polymorphism of the TNF-{alpha} gene that is associated with type 2 DM
DOE Office of Scientific and Technical Information (OSTI.GOV)
Susa, Shinji; Daimon, Makoto; Sakabe, Jun-Ichi
2008-05-09
To examine the association of the tumor necrosis factor-{alpha} (TNF-{alpha}) gene region with type 2 diabetes (DM), 11 single-nucleotide polymorphisms (SNPs) of the region were analyzed. The initial study using a sample set (148 cases vs. 227 controls) showed a significant association of the SNP IVS1G + 123A of the TNF-{alpha} gene with DM (p = 0.0056). Multiple logistic regression analysis using an enlarged sample set (225 vs. 716) revealed the significant association of the SNP with DM independently of any clinical traits examined (OR: 1.49, p = 0.014). The functional relevance of the SNP were examined by the electrophoreticmore » mobility shift assays using nuclear extracts from the U937 and NIH3T3 cells and luciferase assays in these cells with Simian virus 40 promoter- and TNF-{alpha} promoter-reporter gene constructs. The functional analyses showed that YY1 transcription factor bound allele-specifically to the SNP region and, the IVS1 + 123A allele had an increase in luciferase expression compared with the G allele.« less
The NO donor sodium nitroprusside: evaluation of skeletal muscle vascular and metabolic dysfunction
Hirai, Daniel M.; Copp, Steven W.; Ferguson, Scott K.; Holdsworth, Clark T.; Musch, Timothy I.; Poole, David C.
2012-01-01
The nitric oxide (NO) donor sodium nitroprusside (SNP) may promote cyanide-induced toxicity and systemic and/or local responses approaching maximal vasodilation. The hypotheses were tested that SNP superfusion of the rat spinotrapezius muscle exerts 1) residual impairments in resting and contracting blood flow, oxygen utilization (V̇O2) and microvascular O2 pressure (PO2mv); and 2) marked hypotension and elevation in resting PO2mv. Two superfusion protocols were performed: 1) Krebs-Henseleit (control 1), SNP (300 µM; a dose used commonly in superfusion studies) and Krebs-Henseleit (control 2), in this order; 2) 300 and 1200 µM SNP in random order. Spinotrapezius muscle blood flow (radiolabeled microspheres), V̇O2 (Fick calculation) and PO2mv (phosphorescence quenching) were determined at rest and during electrically-induced (1 Hz) contractions. There were no differences in spinotrapezius blood flow, V̇O2 or PO2mv at rest and during contractions pre- and post-SNP condition (control 1 and control 2; p>0.05 for all). With regard to dosing, SNP produced a graded elevation in resting PO2mv (p<0.05) with a reduction in mean arterial pressure only at the higher concentration (p<0.05). Contrary to our hypothesis, skeletal muscle superfusion with the NO donor SNP (300 µM) improved microvascular oxygenation during the transition from rest to contractions (PO2mv kinetics) without precipitating residual impairment of muscle hemodynamic or metabolic control or compromising systemic hemodynamics. These data suggest that SNP superfusion (300 µM) constitutes a valid and important tool for assessing the functional roles of NO in resting and contracting skeletal muscle function without incurring residual alterations consistent with cyanide accumulation and poisoning. PMID:23174313
Cho, Young-Il; Ahn, Yul-Kyun; Tripathi, Swati; Kim, Jeong-Ho; Lee, Hye-Eun; Kim, Do-Sun
2015-01-01
Numerous studies using single nucleotide polymorphisms (SNPs) have been conducted in humans, and other animals, and in major crops, including rice, soybean, and Chinese cabbage. However, the number of SNP studies in cabbage is limited. In this present study, we evaluated whether 7,645 SNPs previously identified as molecular markers linked to disease resistance in the Brassica rapa genome could be applied to B. oleracea. In a BLAST analysis using the SNP sequences of B. rapa and B. oleracea genomic sequence data registered in the NCBI database, 256 genes for which SNPs had been identified in B. rapa were found in B. oleracea. These genes were classified into three functional groups: molecular function (64 genes), biological process (96 genes), and cellular component (96 genes). A total of 693 SNP markers, including 145 SNP markers [BRH—developed from the B. rapa genome for high-resolution melt (HRM) analysis], 425 SNP markers (BRP—based on the B. rapa genome that could be applied to B. oleracea), and 123 new SNP markers (BRS—derived from BRP and designed for HRM analysis), were investigated for their ability to amplify sequences from cabbage genomic DNA. In total, 425 of the SNP markers (BRP-based on B. rapa genome), selected from 7,645 SNPs, were successfully applied to B. oleracea. Using PCR, 108 of 145 BRH (74.5%), 415 of 425 BRP (97.6%), and 118 of 123 BRS (95.9%) showed amplification, suggesting that it is possible to apply SNP markers developed based on the B. rapa genome to B. oleracea. These results provide valuable information that can be utilized in cabbage genetics and breeding programs using molecular markers derived from other Brassica species. PMID:25790283
proGenomes: a resource for consistent functional and taxonomic annotations of prokaryotic genomes.
Mende, Daniel R; Letunic, Ivica; Huerta-Cepas, Jaime; Li, Simone S; Forslund, Kristoffer; Sunagawa, Shinichi; Bork, Peer
2017-01-04
The availability of microbial genomes has opened many new avenues of research within microbiology. This has been driven primarily by comparative genomics approaches, which rely on accurate and consistent characterization of genomic sequences. It is nevertheless difficult to obtain consistent taxonomic and integrated functional annotations for defined prokaryotic clades. Thus, we developed proGenomes, a resource that provides user-friendly access to currently 25 038 high-quality genomes whose sequences and consistent annotations can be retrieved individually or by taxonomic clade. These genomes are assigned to 5306 consistent and accurate taxonomic species clusters based on previously established methodology. proGenomes also contains functional information for almost 80 million protein-coding genes, including a comprehensive set of general annotations and more focused annotations for carbohydrate-active enzymes and antibiotic resistance genes. Additionally, broad habitat information is provided for many genomes. All genomes and associated information can be downloaded by user-selected clade or multiple habitat-specific sets of representative genomes. We expect that the availability of high-quality genomes with comprehensive functional annotations will promote advances in clinical microbial genomics, functional evolution and other subfields of microbiology. proGenomes is available at http://progenomes.embl.de. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Text Mining Improves Prediction of Protein Functional Sites
Cohn, Judith D.; Ravikumar, Komandur E.
2012-01-01
We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions. PMID:22393388
Mantello, Camila Campos; Cardoso-Silva, Claudio Benicio; da Silva, Carla Cristina; de Souza, Livia Moura; Scaloppi Junior, Erivaldo José; de Souza Gonçalves, Paulo; Vicentini, Renato; de Souza, Anete Pereira
2014-01-01
Hevea brasiliensis (Willd. Ex Adr. Juss.) Muell.-Arg. is the primary source of natural rubber that is native to the Amazon rainforest. The singular properties of natural rubber make it superior to and competitive with synthetic rubber for use in several applications. Here, we performed RNA sequencing (RNA-seq) of H. brasiliensis bark on the Illumina GAIIx platform, which generated 179,326,804 raw reads on the Illumina GAIIx platform. A total of 50,384 contigs that were over 400 bp in size were obtained and subjected to further analyses. A similarity search against the non-redundant (nr) protein database returned 32,018 (63%) positive BLASTx hits. The transcriptome analysis was annotated using the clusters of orthologous groups (COG), gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Pfam databases. A search for putative molecular marker was performed to identify simple sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs). In total, 17,927 SSRs and 404,114 SNPs were detected. Finally, we selected sequences that were identified as belonging to the mevalonate (MVA) and 2-C-methyl-D-erythritol 4-phosphate (MEP) pathways, which are involved in rubber biosynthesis, to validate the SNP markers. A total of 78 SNPs were validated in 36 genotypes of H. brasiliensis. This new dataset represents a powerful information source for rubber tree bark genes and will be an important tool for the development of microsatellites and SNP markers for use in future genetic analyses such as genetic linkage mapping, quantitative trait loci identification, investigations of linkage disequilibrium and marker-assisted selection.
Okada, D; Endo, S; Matsuda, H; Ogawa, S; Taniguchi, Y; Katsuta, T; Watanabe, T; Iwaisaki, H
2018-05-12
Genome-wide association studies (GWAS) of quantitative traits have detected numerous genetic associations, but they encounter difficulties in pinpointing prominent candidate genes and inferring gene networks. The present study used a systems genetics approach integrating GWAS results with external RNA-expression data to detect candidate gene networks in feed utilization and growth traits of Japanese Black cattle, which are matters of concern. A SNP co-association network was derived from significant correlations between SNPs with effects estimated by GWAS across seven phenotypic traits. The resulting network genes contained significant numbers of annotations related to the traits. Using bovine transcriptome data from a public database, an RNA co-expression network was inferred based on the similarity of expression patterns across different tissues. An intersection network was then generated by superimposing the SNP and RNA networks and extracting shared interactions. This intersection network contained four tissue-specific modules: nervous system, reproductive system, muscular system, and glands. To characterize the structure (topographical properties) of the three networks, their scale-free properties were evaluated, which revealed that the intersection network was the most scale-free. In the sub-network containing the most connected transcription factors (URI1, ROCK2 and ETV6), most genes were widely expressed across tissues, and genes previously shown to be involved in the traits were found. Results indicated that the current approach might be used to construct a gene network that better reflects biological information, providing encouragement for the genetic dissection of economically important quantitative traits.
ERIC Educational Resources Information Center
Wallen, Erik; Plass, Jan L.; Brunken, Roland
2005-01-01
Students participated in a study (n = 98) investigating the effectiveness of three types of annotations on three learning outcome measures. The annotations were designed to support the cognitive processes in the comprehension of scientific texts, with a function to aid either the process of selecting relevant information, organizing the…
A computational platform to maintain and migrate manual functional annotations for BioCyc databases.
Walsh, Jesse R; Sen, Taner Z; Dickerson, Julie A
2014-10-12
BioCyc databases are an important resource for information on biological pathways and genomic data. Such databases represent the accumulation of biological data, some of which has been manually curated from literature. An essential feature of these databases is the continuing data integration as new knowledge is discovered. As functional annotations are improved, scalable methods are needed for curators to manage annotations without detailed knowledge of the specific design of the BioCyc database. We have developed CycTools, a software tool which allows curators to maintain functional annotations in a model organism database. This tool builds on existing software to improve and simplify annotation data imports of user provided data into BioCyc databases. Additionally, CycTools automatically resolves synonyms and alternate identifiers contained within the database into the appropriate internal identifiers. Automating steps in the manual data entry process can improve curation efforts for major biological databases. The functionality of CycTools is demonstrated by transferring GO term annotations from MaizeCyc to matching proteins in CornCyc, both maize metabolic pathway databases available at MaizeGDB, and by creating strain specific databases for metabolic engineering.
MetaStorm: A Public Resource for Customizable Metagenomics Annotation
Arango-Argoty, Gustavo; Singh, Gargi; Heath, Lenwood S.; Pruden, Amy; Xiao, Weidong; Zhang, Liqing
2016-01-01
Metagenomics is a trending research area, calling for the need to analyze large quantities of data generated from next generation DNA sequencing technologies. The need to store, retrieve, analyze, share, and visualize such data challenges current online computational systems. Interpretation and annotation of specific information is especially a challenge for metagenomic data sets derived from environmental samples, because current annotation systems only offer broad classification of microbial diversity and function. Moreover, existing resources are not configured to readily address common questions relevant to environmental systems. Here we developed a new online user-friendly metagenomic analysis server called MetaStorm (http://bench.cs.vt.edu/MetaStorm/), which facilitates customization of computational analysis for metagenomic data sets. Users can upload their own reference databases to tailor the metagenomics annotation to focus on various taxonomic and functional gene markers of interest. MetaStorm offers two major analysis pipelines: an assembly-based annotation pipeline and the standard read annotation pipeline used by existing web servers. These pipelines can be selected individually or together. Overall, MetaStorm provides enhanced interactive visualization to allow researchers to explore and manipulate taxonomy and functional annotation at various levels of resolution. PMID:27632579
MetaStorm: A Public Resource for Customizable Metagenomics Annotation.
Arango-Argoty, Gustavo; Singh, Gargi; Heath, Lenwood S; Pruden, Amy; Xiao, Weidong; Zhang, Liqing
2016-01-01
Metagenomics is a trending research area, calling for the need to analyze large quantities of data generated from next generation DNA sequencing technologies. The need to store, retrieve, analyze, share, and visualize such data challenges current online computational systems. Interpretation and annotation of specific information is especially a challenge for metagenomic data sets derived from environmental samples, because current annotation systems only offer broad classification of microbial diversity and function. Moreover, existing resources are not configured to readily address common questions relevant to environmental systems. Here we developed a new online user-friendly metagenomic analysis server called MetaStorm (http://bench.cs.vt.edu/MetaStorm/), which facilitates customization of computational analysis for metagenomic data sets. Users can upload their own reference databases to tailor the metagenomics annotation to focus on various taxonomic and functional gene markers of interest. MetaStorm offers two major analysis pipelines: an assembly-based annotation pipeline and the standard read annotation pipeline used by existing web servers. These pipelines can be selected individually or together. Overall, MetaStorm provides enhanced interactive visualization to allow researchers to explore and manipulate taxonomy and functional annotation at various levels of resolution.
Wang, Ying; Yang, Liandong; Wu, Bo; Song, Zhaobin; He, Shunping
2015-07-10
Triplophysa dalaica, endemic species of Qinghai-Tibetan Plateau, is informative for understanding the genetic basis of adaptation to hypoxic conditions of high altitude habitats. Here, a comprehensive gene repertoire for this plateau fish was generated using the Illumina deep paired-end high-throughput sequencing technology. De novo assembly yielded 145, 256 unigenes with an average length of 1632 bp. Blast searches against GenBank non-redundant database annotated 74,594 (51.4%) unigenes encoding for 30,047 gene descriptions in T. dalaica. Functional annotation and classification of assembled sequences were performed using Gene Ontology (GO), clusters of euKaryotic Orthologous Groups (KOG) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analysis. After comparison with other fish transcriptomes, including silver carp (Hypophthalmichthys molitrix) and mud loach (Misgurnus anguillicaudatus), 2621 high-quality orthologous gene alignments were constructed among these species. 61 (2.3%) of the genes were identified as having undergone positive selection in the T. dalaica lineage. Within the positively selected genes, 13 genes were involved in hypoxia response, of which 11 were listed in HypoxiaDB. Furthermore, duplicated hif-α (hif-1αA/B and hif-2αA/B), EGLN1 and PPARA candidate genes involved in adaptation to hypoxia were identified in T. dalaica transcriptome. Branch-site model in PAML validated that hif-1αB and hif-2αA genes have undergone positive selection in T.dalaica. Finally, 37,501 simple sequence repeats (SSRs) and 19,497 high-quality single nucleotide polymorphisms (SNPs) were identified in T. dalaica. The identified SSR and SNP markers will facilitate the genetic structure, population geography and ecological studies of Triplophysa fishes. Copyright © 2015 Elsevier B.V. All rights reserved.
Wang, Lin; Liu, Simin; Niu, Tianhua; Xu, Xin
2005-03-18
Single nucleotide polymorphisms (SNPs) provide an important tool in pinpointing susceptibility genes for complex diseases and in unveiling human molecular evolution. Selection and retrieval of an optimal SNP set from publicly available databases have emerged as the foremost bottlenecks in designing large-scale linkage disequilibrium studies, particularly in case-control settings. We describe the architectural structure and implementations of a novel software program, SNPHunter, which allows for both ad hoc-mode and batch-mode SNP search, automatic SNP filtering, and retrieval of SNP data, including physical position, function class, flanking sequences at user-defined lengths, and heterozygosity from NCBI dbSNP. The SNP data extracted from dbSNP via SNPHunter can be exported and saved in plain text format for further down-stream analyses. As an illustration, we applied SNPHunter for selecting SNPs for 10 major candidate genes for type 2 diabetes, including CAPN10, FABP4, IL6, NOS3, PPARG, TNF, UCP2, CRP, ESR1, and AR. SNPHunter constitutes an efficient and user-friendly tool for SNP screening, selection, and acquisition. The executable and user's manual are available at http://www.hsph.harvard.edu/ppg/software.htm
Evaluating Computational Gene Ontology Annotations.
Škunca, Nives; Roberts, Richard J; Steffen, Martin
2017-01-01
Two avenues to understanding gene function are complementary and often overlapping: experimental work and computational prediction. While experimental annotation generally produces high-quality annotations, it is low throughput. Conversely, computational annotations have broad coverage, but the quality of annotations may be variable, and therefore evaluating the quality of computational annotations is a critical concern.In this chapter, we provide an overview of strategies to evaluate the quality of computational annotations. First, we discuss why evaluating quality in this setting is not trivial. We highlight the various issues that threaten to bias the evaluation of computational annotations, most of which stem from the incompleteness of biological databases. Second, we discuss solutions that address these issues, for example, targeted selection of new experimental annotations and leveraging the existing experimental annotations.
Seaver, Samuel M. D.; Gerdes, Svetlana; Frelin, Océane; Lerma-Ortiz, Claudia; Bradbury, Louis M. T.; Zallot, Rémi; Hasnain, Ghulam; Niehaus, Thomas D.; El Yacoubi, Basma; Pasternak, Shiran; Olson, Robert; Pusch, Gordon; Overbeek, Ross; Stevens, Rick; de Crécy-Lagard, Valérie; Ware, Doreen; Hanson, Andrew D.; Henry, Christopher S.
2014-01-01
The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today’s annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic models. To overcome these problems, we have developed the PlantSEED, an integrated, metabolism-centric database to support subsystems-based annotation and metabolic model reconstruction for plant genomes. PlantSEED combines SEED subsystems technology, first developed for microbial genomes, with refined protein families and biochemical data to assign fully consistent functional annotations to orthologous genes, particularly those encoding primary metabolic pathways. Seamless integration with its parent, the prokaryotic SEED database, makes PlantSEED a unique environment for cross-kingdom comparative analysis of plant and bacterial genomes. The consistent annotations imposed by PlantSEED permit rapid reconstruction and modeling of primary metabolism for all plant genomes in the database. This feature opens the unique possibility of model-based assessment of the completeness and accuracy of gene annotation and thus allows computational identification of genes and pathways that are restricted to certain genomes or need better curation. We demonstrate the PlantSEED system by producing consistent annotations for 10 reference genomes. We also produce a functioning metabolic model for each genome, gapfilling to identify missing annotations and proposing gene candidates for missing annotations. Models are built around an extended biomass composition representing the most comprehensive published to date. To our knowledge, our models are the first to be published for seven of the genomes analyzed. PMID:24927599
Seaver, Samuel M D; Gerdes, Svetlana; Frelin, Océane; Lerma-Ortiz, Claudia; Bradbury, Louis M T; Zallot, Rémi; Hasnain, Ghulam; Niehaus, Thomas D; El Yacoubi, Basma; Pasternak, Shiran; Olson, Robert; Pusch, Gordon; Overbeek, Ross; Stevens, Rick; de Crécy-Lagard, Valérie; Ware, Doreen; Hanson, Andrew D; Henry, Christopher S
2014-07-01
The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic models. To overcome these problems, we have developed the PlantSEED, an integrated, metabolism-centric database to support subsystems-based annotation and metabolic model reconstruction for plant genomes. PlantSEED combines SEED subsystems technology, first developed for microbial genomes, with refined protein families and biochemical data to assign fully consistent functional annotations to orthologous genes, particularly those encoding primary metabolic pathways. Seamless integration with its parent, the prokaryotic SEED database, makes PlantSEED a unique environment for cross-kingdom comparative analysis of plant and bacterial genomes. The consistent annotations imposed by PlantSEED permit rapid reconstruction and modeling of primary metabolism for all plant genomes in the database. This feature opens the unique possibility of model-based assessment of the completeness and accuracy of gene annotation and thus allows computational identification of genes and pathways that are restricted to certain genomes or need better curation. We demonstrate the PlantSEED system by producing consistent annotations for 10 reference genomes. We also produce a functioning metabolic model for each genome, gapfilling to identify missing annotations and proposing gene candidates for missing annotations. Models are built around an extended biomass composition representing the most comprehensive published to date. To our knowledge, our models are the first to be published for seven of the genomes analyzed.
Thomas, Paul D.; Wood, Valerie; Mungall, Christopher J.; Lewis, Suzanna E.; Blake, Judith A.
2012-01-01
A recent paper (Nehrt et al., PLoS Comput. Biol. 7:e1002073, 2011) has proposed a metric for the “functional similarity” between two genes that uses only the Gene Ontology (GO) annotations directly derived from published experimental results. Applying this metric, the authors concluded that paralogous genes within the mouse genome or the human genome are more functionally similar on average than orthologous genes between these genomes, an unexpected result with broad implications if true. We suggest, based on both theoretical and empirical considerations, that this proposed metric should not be interpreted as a functional similarity, and therefore cannot be used to support any conclusions about the “ortholog conjecture” (or, more properly, the “ortholog functional conservation hypothesis”). First, we reexamine the case studies presented by Nehrt et al. as examples of orthologs with divergent functions, and come to a very different conclusion: they actually exemplify how GO annotations for orthologous genes provide complementary information about conserved biological functions. We then show that there is a global ascertainment bias in the experiment-based GO annotations for human and mouse genes: particular types of experiments tend to be performed in different model organisms. We conclude that the reported statistical differences in annotations between pairs of orthologous genes do not reflect differences in biological function, but rather complementarity in experimental approaches. Our results underscore two general considerations for researchers proposing novel types of analysis based on the GO: 1) that GO annotations are often incomplete, potentially in a biased manner, and subject to an “open world assumption” (absence of an annotation does not imply absence of a function), and 2) that conclusions drawn from a novel, large-scale GO analysis should whenever possible be supported by careful, in-depth examination of examples, to help ensure the conclusions have a justifiable biological basis. PMID:22359495
A Factor Graph Approach to Automated GO Annotation
Spetale, Flavio E.; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar
2016-01-01
As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum. PMID:26771463
A Factor Graph Approach to Automated GO Annotation.
Spetale, Flavio E; Tapia, Elizabeth; Krsticevic, Flavia; Roda, Fernando; Bulacio, Pilar
2016-01-01
As volume of genomic data grows, computational methods become essential for providing a first glimpse onto gene annotations. Automated Gene Ontology (GO) annotation methods based on hierarchical ensemble classification techniques are particularly interesting when interpretability of annotation results is a main concern. In these methods, raw GO-term predictions computed by base binary classifiers are leveraged by checking the consistency of predefined GO relationships. Both formal leveraging strategies, with main focus on annotation precision, and heuristic alternatives, with main focus on scalability issues, have been described in literature. In this contribution, a factor graph approach to the hierarchical ensemble formulation of the automated GO annotation problem is presented. In this formal framework, a core factor graph is first built based on the GO structure and then enriched to take into account the noisy nature of GO-term predictions. Hence, starting from raw GO-term predictions, an iterative message passing algorithm between nodes of the factor graph is used to compute marginal probabilities of target GO-terms. Evaluations on Saccharomyces cerevisiae, Arabidopsis thaliana and Drosophila melanogaster protein sequences from the GO Molecular Function domain showed significant improvements over competing approaches, even when protein sequences were naively characterized by their physicochemical and secondary structure properties or when loose noisy annotation datasets were considered. Based on these promising results and using Arabidopsis thaliana annotation data, we extend our approach to the identification of most promising molecular function annotations for a set of proteins of unknown function in Solanum lycopersicum.
Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh
2015-01-01
Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac_0437 and Csac_0424 encode for glycoside hydrolases (GH) and are proposed to be involved in the decomposition of recalcitrant plant polysaccharides. Similarly, HPs: Csac_0732, Csac_1862, Csac_1294 and Csac_0668 are suggested to play a significant role in biohydrogen production. Function prediction of these HPs by using our integrated approach will considerably enhance the interpretation of large-scale experiments targeting this industrially important organism. PMID:26196387
Chowdhary, Nupoor; Selvaraj, Ashok; KrishnaKumaar, Lakshmi; Kumar, Gopal Ramesh
2015-01-01
Caldicellulosiruptor saccharolyticus has proven itself to be an excellent candidate for biological hydrogen (H2) production, but still it has major drawbacks like sensitivity to high osmotic pressure and low volumetric H2 productivity, which should be considered before it can be used industrially. A whole genome re-annotation work has been carried out as an attempt to update the incomplete genome information that causes gap in the knowledge especially in the area of metabolic engineering, to improve the H2 producing capabilities of C. saccharolyticus. Whole genome re-annotation was performed through manual means for 2,682 Coding Sequences (CDSs). Bioinformatics tools based on sequence similarity, motif search, phylogenetic analysis and fold recognition were employed for re-annotation. Our methodology could successfully add functions for 409 hypothetical proteins (HPs), 46 proteins previously annotated as putative and assigned more accurate functions for the known protein sequences. Homology based gene annotation has been used as a standard method for assigning function to novel proteins, but over the past few years many non-homology based methods such as genomic context approaches for protein function prediction have been developed. Using non-homology based functional prediction methods, we were able to assign cellular processes or physical complexes for 249 hypothetical sequences. Our re-annotation pipeline highlights the addition of 231 new CDSs generated from MicroScope Platform, to the original genome with functional prediction for 49 of them. The re-annotation of HPs and new CDSs is stored in the relational database that is available on the MicroScope web-based platform. In parallel, a comparative genome analyses were performed among the members of genus Caldicellulosiruptor to understand the function and evolutionary processes. Further, with results from integrated re-annotation studies (homology and genomic context approach), we strongly suggest that Csac_0437 and Csac_0424 encode for glycoside hydrolases (GH) and are proposed to be involved in the decomposition of recalcitrant plant polysaccharides. Similarly, HPs: Csac_0732, Csac_1862, Csac_1294 and Csac_0668 are suggested to play a significant role in biohydrogen production. Function prediction of these HPs by using our integrated approach will considerably enhance the interpretation of large-scale experiments targeting this industrially important organism.
Solving the Problem: Genome Annotation Standards before the Data Deluge.
Klimke, William; O'Donovan, Claire; White, Owen; Brister, J Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D; Tatusova, Tatiana
2011-10-15
The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.
Solving the Problem: Genome Annotation Standards before the Data Deluge
Klimke, William; O'Donovan, Claire; White, Owen; Brister, J. Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D.; Tatusova, Tatiana
2011-01-01
The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries. PMID:22180819
IL6R Variation Asp358Ala Is a Potential Modifier of Lung Function in Asthma
Hawkins, Gregory A; Robinson, Mac B; Hastie, Annette T; Li, Xingnan; Li, Huashi; Moore, Wendy C; Howard, Timothy D; Busse, William W.; Erzurum, Serpil C.; Wenzel, Sally E.; Peters, Stephen P; Meyers, Deborah A; Bleecker, Eugene R
2012-01-01
Background The IL6R SNP rs4129267 has recently been identified as an asthma susceptibility locus in subjects of European ancestry but has not been characterized with respect to asthma severity. The SNP rs4129267 is in linkage disequilibrium (r2=1) with the IL6R coding SNP rs2228145 (Asp358Ala). This IL6R coding change increases IL6 receptor shedding and promotes IL6 transsignaling. Objectives To evaluate the IL6R SNP rs2228145 with respect to asthma severity phenotypes. Methods The IL6R SNP rs2228145 was evaluated in subjects of European ancestry with asthma from the Severe Asthma Research Program (SARP). Lung function associations were replicated in the Collaborative Study on the Genetics of Asthma (CSGA) cohort. Serum soluble IL6 receptor (sIL6R) levels were measured in subjects from SARP. Immunohistochemistry was used to qualitatively evaluate IL6R protein expression in BAL cells and endobronchial biopsies. Results The minor C allele of IL6R SNP rs2228145 was associated with lower ppFEV1 in the SARP cohort (p=0.005), the CSGA cohort (0.008), and in combined cohort analysis (p=0.003). Additional associations with ppFVC, FEV1/FVC, and PC20 were observed. The rs2228145 C allele (Ala358) was more frequent in severe asthma phenotypic clusters. Elevated serum sIL6R was associated with lower ppFEV1 (p=0.02) and lower ppFVC (p=0.008) (N=146). IL6R protein expression was observed in BAL macrophages, airway epithelium, vascular endothelium, and airway smooth muscle. Conclusions The IL6R coding SNP rs2228145 (Asp358Ala) is a potential modifier of lung function in asthma and may identify subjects at risk for more severe asthma. IL6 transsignaling may have a pathogenic role in the lung. PMID:22554704
A curated catalog of canine and equine keratin genes
Pujar, Shashikant; McGarvey, Kelly M.; Welle, Monika; Galichet, Arnaud; Müller, Eliane J.; Pruitt, Kim D.; Leeb, Tosso
2017-01-01
Keratins represent a large protein family with essential structural and functional roles in epithelial cells of skin, hair follicles, and other organs. During evolution the genes encoding keratins have undergone multiple rounds of duplication and humans have two clusters with a total of 55 functional keratin genes in their genomes. Due to the high similarity between different keratin paralogs and species-specific differences in gene content, the currently available keratin gene annotation in species with draft genome assemblies such as dog and horse is still imperfect. We compared the National Center for Biotechnology Information (NCBI) (dog annotation release 103, horse annotation release 101) and Ensembl (release 87) gene predictions for the canine and equine keratin gene clusters to RNA-seq data that were generated from adult skin of five dogs and two horses and from adult hair follicle tissue of one dog. Taking into consideration the knowledge on the conserved exon/intron structure of keratin genes, we annotated 61 putatively functional keratin genes in both the dog and horse, respectively. Subsequently, curators in the RefSeq group at NCBI reviewed their annotation of keratin genes in the dog and horse genomes (Annotation Release 104 and Annotation Release 102, respectively) and updated annotation and gene nomenclature of several keratin genes. The updates are now available in the NCBI Gene database (https://www.ncbi.nlm.nih.gov/gene). PMID:28846680
Andersson, Leif; Archibald, Alan L; Bottema, Cynthia D; Brauning, Rudiger; Burgess, Shane C; Burt, Dave W; Casas, Eduardo; Cheng, Hans H; Clarke, Laura; Couldrey, Christine; Dalrymple, Brian P; Elsik, Christine G; Foissac, Sylvain; Giuffra, Elisabetta; Groenen, Martien A; Hayes, Ben J; Huang, LuSheng S; Khatib, Hassan; Kijas, James W; Kim, Heebal; Lunney, Joan K; McCarthy, Fiona M; McEwan, John C; Moore, Stephen; Nanduri, Bindu; Notredame, Cedric; Palti, Yniv; Plastow, Graham S; Reecy, James M; Rohrer, Gary A; Sarropoulou, Elena; Schmidt, Carl J; Silverstein, Jeffrey; Tellam, Ross L; Tixier-Boichard, Michele; Tosser-Klopp, Gwenola; Tuggle, Christopher K; Vilkki, Johanna; White, Stephen N; Zhao, Shuhong; Zhou, Huaijun
2015-03-25
We describe the organization of a nascent international effort, the Functional Annotation of Animal Genomes (FAANG) project, whose aim is to produce comprehensive maps of functional elements in the genomes of domesticated animal species.
USDA-ARS?s Scientific Manuscript database
We describe the organization of a nascent international effort - the "Functional Annotation of ANimal Genomes" project - whose aim is to produce comprehensive maps of functional elements in the genomes of domesticated animal species....
Optimal Design of Low-Density SNP Arrays for Genomic Prediction: Algorithm and Applications.
Wu, Xiao-Lin; Xu, Jiaqi; Feng, Guofei; Wiggans, George R; Taylor, Jeremy F; He, Jun; Qian, Changsong; Qiu, Jiansheng; Simpson, Barry; Walker, Jeremy; Bauck, Stewart
2016-01-01
Low-density (LD) single nucleotide polymorphism (SNP) arrays provide a cost-effective solution for genomic prediction and selection, but algorithms and computational tools are needed for the optimal design of LD SNP chips. A multiple-objective, local optimization (MOLO) algorithm was developed for design of optimal LD SNP chips that can be imputed accurately to medium-density (MD) or high-density (HD) SNP genotypes for genomic prediction. The objective function facilitates maximization of non-gap map length and system information for the SNP chip, and the latter is computed either as locus-averaged (LASE) or haplotype-averaged Shannon entropy (HASE) and adjusted for uniformity of the SNP distribution. HASE performed better than LASE with ≤1,000 SNPs, but required considerably more computing time. Nevertheless, the differences diminished when >5,000 SNPs were selected. Optimization was accomplished conditionally on the presence of SNPs that were obligated to each chromosome. The frame location of SNPs on a chip can be either uniform (evenly spaced) or non-uniform. For the latter design, a tunable empirical Beta distribution was used to guide location distribution of frame SNPs such that both ends of each chromosome were enriched with SNPs. The SNP distribution on each chromosome was finalized through the objective function that was locally and empirically maximized. This MOLO algorithm was capable of selecting a set of approximately evenly-spaced and highly-informative SNPs, which in turn led to increased imputation accuracy compared with selection solely of evenly-spaced SNPs. Imputation accuracy increased with LD chip size, and imputation error rate was extremely low for chips with ≥3,000 SNPs. Assuming that genotyping or imputation error occurs at random, imputation error rate can be viewed as the upper limit for genomic prediction error. Our results show that about 25% of imputation error rate was propagated to genomic prediction in an Angus population. The utility of this MOLO algorithm was also demonstrated in a real application, in which a 6K SNP panel was optimized conditional on 5,260 obligatory SNP selected based on SNP-trait association in U.S. Holstein animals. With this MOLO algorithm, both imputation error rate and genomic prediction error rate were minimal.
Optimal Design of Low-Density SNP Arrays for Genomic Prediction: Algorithm and Applications
Wu, Xiao-Lin; Xu, Jiaqi; Feng, Guofei; Wiggans, George R.; Taylor, Jeremy F.; He, Jun; Qian, Changsong; Qiu, Jiansheng; Simpson, Barry; Walker, Jeremy; Bauck, Stewart
2016-01-01
Low-density (LD) single nucleotide polymorphism (SNP) arrays provide a cost-effective solution for genomic prediction and selection, but algorithms and computational tools are needed for the optimal design of LD SNP chips. A multiple-objective, local optimization (MOLO) algorithm was developed for design of optimal LD SNP chips that can be imputed accurately to medium-density (MD) or high-density (HD) SNP genotypes for genomic prediction. The objective function facilitates maximization of non-gap map length and system information for the SNP chip, and the latter is computed either as locus-averaged (LASE) or haplotype-averaged Shannon entropy (HASE) and adjusted for uniformity of the SNP distribution. HASE performed better than LASE with ≤1,000 SNPs, but required considerably more computing time. Nevertheless, the differences diminished when >5,000 SNPs were selected. Optimization was accomplished conditionally on the presence of SNPs that were obligated to each chromosome. The frame location of SNPs on a chip can be either uniform (evenly spaced) or non-uniform. For the latter design, a tunable empirical Beta distribution was used to guide location distribution of frame SNPs such that both ends of each chromosome were enriched with SNPs. The SNP distribution on each chromosome was finalized through the objective function that was locally and empirically maximized. This MOLO algorithm was capable of selecting a set of approximately evenly-spaced and highly-informative SNPs, which in turn led to increased imputation accuracy compared with selection solely of evenly-spaced SNPs. Imputation accuracy increased with LD chip size, and imputation error rate was extremely low for chips with ≥3,000 SNPs. Assuming that genotyping or imputation error occurs at random, imputation error rate can be viewed as the upper limit for genomic prediction error. Our results show that about 25% of imputation error rate was propagated to genomic prediction in an Angus population. The utility of this MOLO algorithm was also demonstrated in a real application, in which a 6K SNP panel was optimized conditional on 5,260 obligatory SNP selected based on SNP-trait association in U.S. Holstein animals. With this MOLO algorithm, both imputation error rate and genomic prediction error rate were minimal. PMID:27583971
MimoSA: a system for minimotif annotation
2010-01-01
Background Minimotifs are short peptide sequences within one protein, which are recognized by other proteins or molecules. While there are now several minimotif databases, they are incomplete. There are reports of many minimotifs in the primary literature, which have yet to be annotated, while entirely novel minimotifs continue to be published on a weekly basis. Our recently proposed function and sequence syntax for minimotifs enables us to build a general tool that will facilitate structured annotation and management of minimotif data from the biomedical literature. Results We have built the MimoSA application for minimotif annotation. The application supports management of the Minimotif Miner database, literature tracking, and annotation of new minimotifs. MimoSA enables the visualization, organization, selection and editing functions of minimotifs and their attributes in the MnM database. For the literature components, Mimosa provides paper status tracking and scoring of papers for annotation through a freely available machine learning approach, which is based on word correlation. The paper scoring algorithm is also available as a separate program, TextMine. Form-driven annotation of minimotif attributes enables entry of new minimotifs into the MnM database. Several supporting features increase the efficiency of annotation. The layered architecture of MimoSA allows for extensibility by separating the functions of paper scoring, minimotif visualization, and database management. MimoSA is readily adaptable to other annotation efforts that manually curate literature into a MySQL database. Conclusions MimoSA is an extensible application that facilitates minimotif annotation and integrates with the Minimotif Miner database. We have built MimoSA as an application that integrates dynamic abstract scoring with a high performance relational model of minimotif syntax. MimoSA's TextMine, an efficient paper-scoring algorithm, can be used to dynamically rank papers with respect to context. PMID:20565705
Structural and functional annotation of the porcine immunome
2013-01-01
Background The domestic pig is known as an excellent model for human immunology and the two species share many pathogens. Susceptibility to infectious disease is one of the major constraints on swine performance, yet the structure and function of genes comprising the pig immunome are not well-characterized. The completion of the pig genome provides the opportunity to annotate the pig immunome, and compare and contrast pig and human immune systems. Results The Immune Response Annotation Group (IRAG) used computational curation and manual annotation of the swine genome assembly 10.2 (Sscrofa10.2) to refine the currently available automated annotation of 1,369 immunity-related genes through sequence-based comparison to genes in other species. Within these genes, we annotated 3,472 transcripts. Annotation provided evidence for gene expansions in several immune response families, and identified artiodactyl-specific expansions in the cathelicidin and type 1 Interferon families. We found gene duplications for 18 genes, including 13 immune response genes and five non-immune response genes discovered in the annotation process. Manual annotation provided evidence for many new alternative splice variants and 8 gene duplications. Over 1,100 transcripts without porcine sequence evidence were detected using cross-species annotation. We used a functional approach to discover and accurately annotate porcine immune response genes. A co-expression clustering analysis of transcriptomic data from selected experimental infections or immune stimulations of blood, macrophages or lymph nodes identified a large cluster of genes that exhibited a correlated positive response upon infection across multiple pathogens or immune stimuli. Interestingly, this gene cluster (cluster 4) is enriched for known general human immune response genes, yet contains many un-annotated porcine genes. A phylogenetic analysis of the encoded proteins of cluster 4 genes showed that 15% exhibited an accelerated evolution as compared to 4.1% across the entire genome. Conclusions This extensive annotation dramatically extends the genome-based knowledge of the molecular genetics and structure of a major portion of the porcine immunome. Our complementary functional approach using co-expression during immune response has provided new putative immune response annotation for over 500 porcine genes. Our phylogenetic analysis of this core immunome cluster confirms rapid evolutionary change in this set of genes, and that, as in other species, such genes are important components of the pig’s adaptation to pathogen challenge over evolutionary time. These comprehensive and integrated analyses increase the value of the porcine genome sequence and provide important tools for global analyses and data-mining of the porcine immune response. PMID:23676093
Functional Annotation of Ion Channel Structures by Molecular Simulation.
Trick, Jemma L; Chelvaniththilan, Sivapalan; Klesse, Gianni; Aryal, Prafulla; Wallace, E Jayne; Tucker, Stephen J; Sansom, Mark S P
2016-12-06
Ion channels play key roles in cell membranes, and recent advances are yielding an increasing number of structures. However, their functional relevance is often unclear and better tools are required for their functional annotation. In sub-nanometer pores such as ion channels, hydrophobic gating has been shown to promote dewetting to produce a functionally closed (i.e., non-conductive) state. Using the serotonin receptor (5-HT 3 R) structure as an example, we demonstrate the use of molecular dynamics to aid the functional annotation of channel structures via simulation of the behavior of water within the pore. Three increasingly complex simulation analyses are described: water equilibrium densities; single-ion free-energy profiles; and computational electrophysiology. All three approaches correctly predict the 5-HT 3 R crystal structure to represent a functionally closed (i.e., non-conductive) state. We also illustrate the application of water equilibrium density simulations to annotate different conformational states of a glycine receptor. Copyright © 2016 The Authors. Published by Elsevier Ltd.. All rights reserved.
Expanded microbial genome coverage and improved protein family annotation in the COG database
Galperin, Michael Y.; Makarova, Kira S.; Wolf, Yuri I.; Koonin, Eugene V.
2015-01-01
Microbial genome sequencing projects produce numerous sequences of deduced proteins, only a small fraction of which have been or will ever be studied experimentally. This leaves sequence analysis as the only feasible way to annotate these proteins and assign to them tentative functions. The Clusters of Orthologous Groups of proteins (COGs) database (http://www.ncbi.nlm.nih.gov/COG/), first created in 1997, has been a popular tool for functional annotation. Its success was largely based on (i) its reliance on complete microbial genomes, which allowed reliable assignment of orthologs and paralogs for most genes; (ii) orthology-based approach, which used the function(s) of the characterized member(s) of the protein family (COG) to assign function(s) to the entire set of carefully identified orthologs and describe the range of potential functions when there were more than one; and (iii) careful manual curation of the annotation of the COGs, aimed at detailed prediction of the biological function(s) for each COG while avoiding annotation errors and overprediction. Here we present an update of the COGs, the first since 2003, and a comprehensive revision of the COG annotations and expansion of the genome coverage to include representative complete genomes from all bacterial and archaeal lineages down to the genus level. This re-analysis of the COGs shows that the original COG assignments had an error rate below 0.5% and allows an assessment of the progress in functional genomics in the past 12 years. During this time, functions of many previously uncharacterized COGs have been elucidated and tentative functional assignments of many COGs have been validated, either by targeted experiments or through the use of high-throughput methods. A particularly important development is the assignment of functions to several widespread, conserved proteins many of which turned out to participate in translation, in particular rRNA maturation and tRNA modification. The new version of the COGs is expected to become an important tool for microbial genomics. PMID:25428365
GFam: a platform for automatic annotation of gene families.
Sasidharan, Rajkumar; Nepusz, Tamás; Swarbreck, David; Huala, Eva; Paccanaro, Alberto
2012-10-01
We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam's capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/.
Margam, Venu M.; Coates, Brad S.; Bayles, Darrell O.; Hellmich, Richard L.; Agunbiade, Tolulope; Seufferheld, Manfredo J.; Sun, Weilin; Kroemer, Jeremy A.; Ba, Malick N.; Binso-Dabire, Clementine L.; Baoua, Ibrahim; Ishiyaku, Mohammad F.; Covas, Fernando G.; Srinivasan, Ramasamy; Armstrong, Joel; Murdock, Larry L.; Pittendrigh, Barry R.
2011-01-01
The legume pod borer, Maruca vitrata (Lepidoptera: Crambidae), is an insect pest species of crops grown by subsistence farmers in tropical regions of Africa. We present the de novo assembly of 3729 contigs from 454- and Sanger-derived sequencing reads for midgut, salivary, and whole adult tissues of this non-model species. Functional annotation predicted that 1320 M. vitrata protein coding genes are present, of which 631 have orthologs within the Bombyx mori gene model. A homology-based analysis assigned M. vitrata genes into a group of paralogs, but these were subsequently partitioned into putative orthologs following phylogenetic analyses. Following sequence quality filtering, a total of 1542 putative single nucleotide polymorphisms (SNPs) were predicted within M. vitrata contig assemblies. Seventy one of 1078 designed molecular genetic markers were used to screen M. vitrata samples from five collection sites in West Africa. Population substructure may be present with significant implications in the insect resistance management recommendations pertaining to the release of biological control agents or transgenic cowpea that express Bacillus thuringiensis crystal toxins. Mutation data derived from transcriptome sequencing is an expeditious and economical source for genetic markers that allow evaluation of ecological differentiation. PMID:21754987
Cattle genome-wide analysis reveals genetic signatures in trypanotolerant N'Dama.
Kim, Soo-Jin; Ka, Sojeong; Ha, Jung-Woo; Kim, Jaemin; Yoo, DongAhn; Kim, Kwondo; Lee, Hak-Kyo; Lim, Dajeong; Cho, Seoae; Hanotte, Olivier; Mwai, Okeyo Ally; Dessie, Tadelle; Kemp, Stephen; Oh, Sung Jong; Kim, Heebal
2017-05-12
Indigenous cattle in Africa have adapted to various local environments to acquire superior phenotypes that enhance their survival under harsh conditions. While many studies investigated the adaptation of overall African cattle, genetic characteristics of each breed have been poorly studied. We performed the comparative genome-wide analysis to assess evidence for subspeciation within species at the genetic level in trypanotolerant N'Dama cattle. We analysed genetic variation patterns in N'Dama from the genomes of 101 cattle breeds including 48 samples of five indigenous African cattle breeds and 53 samples of various commercial breeds. Analysis of SNP variances between cattle breeds using wMI, XP-CLR, and XP-EHH detected genes containing N'Dama-specific genetic variants and their potential associations. Functional annotation analysis revealed that these genes are associated with ossification, neurological and immune system. Particularly, the genes involved in bone formation indicate that local adaptation of N'Dama may engage in skeletal growth as well as immune systems. Our results imply that N'Dama might have acquired distinct genotypes associated with growth and regulation of regional diseases including trypanosomiasis. Moreover, this study offers significant insights into identifying genetic signatures for natural and artificial selection of diverse African cattle breeds.
Guidelines for the functional annotation of microRNAs using the Gene Ontology
D'Eustachio, Peter; Smith, Jennifer R.; Zampetaki, Anna
2016-01-01
MicroRNA regulation of developmental and cellular processes is a relatively new field of study, and the available research data have not been organized to enable its inclusion in pathway and network analysis tools. The association of gene products with terms from the Gene Ontology is an effective method to analyze functional data, but until recently there has been no substantial effort dedicated to applying Gene Ontology terms to microRNAs. Consequently, when performing functional analysis of microRNA data sets, researchers have had to rely instead on the functional annotations associated with the genes encoding microRNA targets. In consultation with experts in the field of microRNA research, we have created comprehensive recommendations for the Gene Ontology curation of microRNAs. This curation manual will enable provision of a high-quality, reliable set of functional annotations for the advancement of microRNA research. Here we describe the key aspects of the work, including development of the Gene Ontology to represent this data, standards for describing the data, and guidelines to support curators making these annotations. The full microRNA curation guidelines are available on the GO Consortium wiki (http://wiki.geneontology.org/index.php/MicroRNA_GO_annotation_manual). PMID:26917558
Activity study of biogenic spherical silver nanoparticles towards microbes and oxidants
NASA Astrophysics Data System (ADS)
Hoskote Anand, Kiran Kumar; Mandal, Badal Kumar
2015-01-01
The eco-friendly approach for the green synthesis of silver nanoparticles (SNP) using Terminalia bellirica (T. bellirica) fruit extract is reported herein. Initially formation of SNP was noticed through visual color change from yellow to reddish brown and further analyzed by surface plasmonic resonance (SPR) band at 429 nm using UV-Vis spectroscopy. Identification of different polyphenols present in T. bellirica extract was done using High Pressure Liquid Chromatography (HPLC). Aqueous T. bellirica extract contains high amount of gallic acid which is major secondary metabolite responsible for the reduction and stabilization process. It was established by analyses of extracts before and after reduction using HPLC. Formation of spherical SNP was characterized by Transmission Electron Microscopy (TEM) analysis. X-ray Diffraction (XRD) study revealed crystalline nature of SNP. Presence of different functional groups on the surface of SNP was evidenced by Fourier Transform Infrared Spectroscopy (FTIR) study. A plausible mechanism of reduction and stabilization processes involved in the synthesis of stable SNP was also explained based on HPLC and FTIR data. In addition, the synthesized SNP was tested for antibacterial and antioxidant activities. SNP showed good antimicrobial activity against both gram positive (S. aureus) and gram negative (E. coli) bacteria. It also showed good antioxidant activity compared to ascorbic acid as standard antioxidant by using standard DPPH method.
Concomitant prediction of function and fold at the domain level with GO-based profiles.
Lopez, Daniel; Pazos, Florencio
2013-01-01
Predicting the function of newly sequenced proteins is crucial due to the pace at which these raw sequences are being obtained. Almost all resources for predicting protein function assign functional terms to whole chains, and do not distinguish which particular domain is responsible for the allocated function. This is not a limitation of the methodologies themselves but it is due to the fact that in the databases of functional annotations these methods use for transferring functional terms to new proteins, these annotations are done on a whole-chain basis. Nevertheless, domains are the basic evolutionary and often functional units of proteins. In many cases, the domains of a protein chain have distinct molecular functions, independent from each other. For that reason resources with functional annotations at the domain level, as well as methodologies for predicting function for individual domains adapted to these resources are required.We present a methodology for predicting the molecular function of individual domains, based on a previously developed database of functional annotations at the domain level. The approach, which we show outperforms a standard method based on sequence searches in assigning function, concomitantly predicts the structural fold of the domains and can give hints on the functionally important residues associated to the predicted function.
Aubourg, Sébastien; Brunaud, Véronique; Bruyère, Clémence; Cock, Mark; Cooke, Richard; Cottet, Annick; Couloux, Arnaud; Déhais, Patrice; Deléage, Gilbert; Duclert, Aymeric; Echeverria, Manuel; Eschbach, Aimée; Falconet, Denis; Filippi, Ghislain; Gaspin, Christine; Geourjon, Christophe; Grienenberger, Jean-Michel; Houlné, Guy; Jamet, Elisabeth; Lechauve, Frédéric; Leleu, Olivier; Leroy, Philippe; Mache, Régis; Meyer, Christian; Nedjari, Hafed; Negrutiu, Ioan; Orsini, Valérie; Peyretaillade, Eric; Pommier, Cyril; Raes, Jeroen; Risler, Jean-Loup; Rivière, Stéphane; Rombauts, Stéphane; Rouzé, Pierre; Schneider, Michel; Schwob, Philippe; Small, Ian; Soumayet-Kampetenga, Ghislain; Stankovski, Darko; Toffano, Claire; Tognolli, Michael; Caboche, Michel; Lecharny, Alain
2005-01-01
Genomic projects heavily depend on genome annotations and are limited by the current deficiencies in the published predictions of gene structure and function. It follows that, improved annotation will allow better data mining of genomes, and more secure planning and design of experiments. The purpose of the GeneFarm project is to obtain homogeneous, reliable, documented and traceable annotations for Arabidopsis nuclear genes and gene products, and to enter them into an added-value database. This re-annotation project is being performed exhaustively on every member of each gene family. Performing a family-wide annotation makes the task easier and more efficient than a gene-by-gene approach since many features obtained for one gene can be extrapolated to some or all the other genes of a family. A complete annotation procedure based on the most efficient prediction tools available is being used by 16 partner laboratories, each contributing annotated families from its field of expertise. A database, named GeneFarm, and an associated user-friendly interface to query the annotations have been developed. More than 3000 genes distributed over 300 families have been annotated and are available at http://genoplante-info.infobiogen.fr/Genefarm/. Furthermore, collaboration with the Swiss Institute of Bioinformatics is underway to integrate the GeneFarm data into the protein knowledgebase Swiss-Prot. PMID:15608279
Sma3s: a three-step modular annotator for large sequence datasets.
Muñoz-Mérida, Antonio; Viguera, Enrique; Claros, M Gonzalo; Trelles, Oswaldo; Pérez-Pulido, Antonio J
2014-08-01
Automatic sequence annotation is an essential component of modern 'omics' studies, which aim to extract information from large collections of sequence data. Most existing tools use sequence homology to establish evolutionary relationships and assign putative functions to sequences. However, it can be difficult to define a similarity threshold that achieves sufficient coverage without sacrificing annotation quality. Defining the correct configuration is critical and can be challenging for non-specialist users. Thus, the development of robust automatic annotation techniques that generate high-quality annotations without needing expert knowledge would be very valuable for the research community. We present Sma3s, a tool for automatically annotating very large collections of biological sequences from any kind of gene library or genome. Sma3s is composed of three modules that progressively annotate query sequences using either: (i) very similar homologues, (ii) orthologous sequences or (iii) terms enriched in groups of homologous sequences. We trained the system using several random sets of known sequences, demonstrating average sensitivity and specificity values of ~85%. In conclusion, Sma3s is a versatile tool for high-throughput annotation of a wide variety of sequence datasets that outperforms the accuracy of other well-established annotation algorithms, and it can enrich existing database annotations and uncover previously hidden features. Importantly, Sma3s has already been used in the functional annotation of two published transcriptomes. © The Author 2014. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.
The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)
Overbeek, Ross; Olson, Robert; Pusch, Gordon D.; Olsen, Gary J.; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Parrello, Bruce; Shukla, Maulik; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang; Stevens, Rick
2014-01-01
In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources. PMID:24293654
The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST).
Overbeek, Ross; Olson, Robert; Pusch, Gordon D; Olsen, Gary J; Davis, James J; Disz, Terry; Edwards, Robert A; Gerdes, Svetlana; Parrello, Bruce; Shukla, Maulik; Vonstein, Veronika; Wattam, Alice R; Xia, Fangfang; Stevens, Rick
2014-01-01
In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources.
Ameur, Adam; Bunikis, Ignas; Enroth, Stefan; Gyllensten, Ulf
2014-01-01
CanvasDB is an infrastructure for management and analysis of genetic variants from massively parallel sequencing (MPS) projects. The system stores SNP and indel calls in a local database, designed to handle very large datasets, to allow for rapid analysis using simple commands in R. Functional annotations are included in the system, making it suitable for direct identification of disease-causing mutations in human exome- (WES) or whole-genome sequencing (WGS) projects. The system has a built-in filtering function implemented to simultaneously take into account variant calls from all individual samples. This enables advanced comparative analysis of variant distribution between groups of samples, including detection of candidate causative mutations within family structures and genome-wide association by sequencing. In most cases, these analyses are executed within just a matter of seconds, even when there are several hundreds of samples and millions of variants in the database. We demonstrate the scalability of canvasDB by importing the individual variant calls from all 1092 individuals present in the 1000 Genomes Project into the system, over 4.4 billion SNPs and indels in total. Our results show that canvasDB makes it possible to perform advanced analyses of large-scale WGS projects on a local server. Database URL: https://github.com/UppsalaGenomeCenter/CanvasDB PMID:25281234
Ameur, Adam; Bunikis, Ignas; Enroth, Stefan; Gyllensten, Ulf
2014-01-01
CanvasDB is an infrastructure for management and analysis of genetic variants from massively parallel sequencing (MPS) projects. The system stores SNP and indel calls in a local database, designed to handle very large datasets, to allow for rapid analysis using simple commands in R. Functional annotations are included in the system, making it suitable for direct identification of disease-causing mutations in human exome- (WES) or whole-genome sequencing (WGS) projects. The system has a built-in filtering function implemented to simultaneously take into account variant calls from all individual samples. This enables advanced comparative analysis of variant distribution between groups of samples, including detection of candidate causative mutations within family structures and genome-wide association by sequencing. In most cases, these analyses are executed within just a matter of seconds, even when there are several hundreds of samples and millions of variants in the database. We demonstrate the scalability of canvasDB by importing the individual variant calls from all 1092 individuals present in the 1000 Genomes Project into the system, over 4.4 billion SNPs and indels in total. Our results show that canvasDB makes it possible to perform advanced analyses of large-scale WGS projects on a local server. Database URL: https://github.com/UppsalaGenomeCenter/CanvasDB. © The Author(s) 2014. Published by Oxford University Press.
Discovery and Characterization of Chromatin States for Systematic Annotation of the Human Genome
NASA Astrophysics Data System (ADS)
Ernst, Jason; Kellis, Manolis
A plethora of epigenetic modifications have been described in the human genome and shown to play diverse roles in gene regulation, cellular differentiation and the onset of disease. Although individual modifications have been linked to the activity levels of various genetic functional elements, their combinatorial patterns are still unresolved and their potential for systematic de novo genome annotation remains untapped. Here, we use a multivariate Hidden Markov Model to reveal chromatin states in human T cells, based on recurrent and spatially coherent combinations of chromatin marks.We define 51 distinct chromatin states, including promoter-associated, transcription-associated, active intergenic, largescale repressed and repeat-associated states. Each chromatin state shows specific enrichments in functional annotations, sequence motifs and specific experimentally observed characteristics, suggesting distinct biological roles. This approach provides a complementary functional annotation of the human genome that reveals the genome-wide locations of diverse classes of epigenetic function.
A Resource of Quantitative Functional Annotation for Homo sapiens Genes.
Taşan, Murat; Drabkin, Harold J; Beaver, John E; Chua, Hon Nian; Dunham, Julie; Tian, Weidong; Blake, Judith A; Roth, Frederick P
2012-02-01
The body of human genomic and proteomic evidence continues to grow at ever-increasing rates, while annotation efforts struggle to keep pace. A surprisingly small fraction of human genes have clear, documented associations with specific functions, and new functions continue to be found for characterized genes. Here we assembled an integrated collection of diverse genomic and proteomic data for 21,341 human genes and make quantitative associations of each to 4333 Gene Ontology terms. We combined guilt-by-profiling and guilt-by-association approaches to exploit features unique to the data types. Performance was evaluated by cross-validation, prospective validation, and by manual evaluation with the biological literature. Functional-linkage networks were also constructed, and their utility was demonstrated by identifying candidate genes related to a glioma FLN using a seed network from genome-wide association studies. Our annotations are presented-alongside existing validated annotations-in a publicly accessible and searchable web interface.
The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4)
Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos; ...
2016-02-24
The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provide d via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation ismore » followed by functional annotation including assignment of protein product names and connection to various protein family databases.« less
The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos
The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provide d via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation ismore » followed by functional annotation including assignment of protein product names and connection to various protein family databases.« less
e-GRASP: an integrated evolutionary and GRASP resource for exploring disease associations.
Karim, Sajjad; NourEldin, Hend Fakhri; Abusamra, Heba; Salem, Nada; Alhathli, Elham; Dudley, Joel; Sanderford, Max; Scheinfeldt, Laura B; Chaudhary, Adeel G; Al-Qahtani, Mohammed H; Kumar, Sudhir
2016-10-17
Genome-wide association studies (GWAS) have become a mainstay of biological research concerned with discovering genetic variation linked to phenotypic traits and diseases. Both discrete and continuous traits can be analyzed in GWAS to discover associations between single nucleotide polymorphisms (SNPs) and traits of interest. Associations are typically determined by estimating the significance of the statistical relationship between genetic loci and the given trait. However, the prioritization of bona fide, reproducible genetic associations from GWAS results remains a central challenge in identifying genomic loci underlying common complex diseases. Evolutionary-aware meta-analysis of the growing GWAS literature is one way to address this challenge and to advance from association to causation in the discovery of genotype-phenotype relationships. We have created an evolutionary GWAS resource to enable in-depth query and exploration of published GWAS results. This resource uses the publically available GWAS results annotated in the GRASP2 database. The GRASP2 database includes results from 2082 studies, 177 broad phenotype categories, and ~8.87 million SNP-phenotype associations. For each SNP in e-GRASP, we present information from the GRASP2 database for convenience as well as evolutionary information (e.g., rate and timespan). Users can, therefore, identify not only SNPs with highly significant phenotype-association P-values, but also SNPs that are highly replicated and/or occur at evolutionarily conserved sites that are likely to be functionally important. Additionally, we provide an evolutionary-adjusted SNP association ranking (E-rank) that uses cross-species evolutionary conservation scores and population allele frequencies to transform P-values in an effort to enhance the discovery of SNPs with a greater probability of biologically meaningful disease associations. By adding an evolutionary dimension to the GWAS results available in the GRASP2 database, our e-GRASP resource will enable a more effective exploration of SNPs not only by the statistical significance of trait associations, but also by the number of studies in which associations have been replicated, and the evolutionary context of the associated mutations. Therefore, e-GRASP will be a valuable resource for aiding researchers in the identification of bona fide, reproducible genetic associations from GWAS results. This resource is freely available at http://www.mypeg.info/egrasp .
Gene-environment interaction in the etiology of mathematical ability using SNP sets.
Docherty, Sophia J; Kovas, Yulia; Plomin, Robert
2011-01-01
Mathematics ability and disability is as heritable as other cognitive abilities and disabilities, however its genetic etiology has received relatively little attention. In our recent genome-wide association study of mathematical ability in 10-year-old children, 10 SNP associations were nominated from scans of pooled DNA and validated in an individually genotyped sample. In this paper, we use a 'SNP set' composite of these 10 SNPs to investigate gene-environment (GE) interaction, examining whether the association between the 10-SNP set and mathematical ability differs as a function of ten environmental measures in the home and school in a sample of 1888 children with complete data. We found two significant GE interactions for environmental measures in the home and the school both in the direction of the diathesis-stress type of GE interaction: The 10-SNP set was more strongly associated with mathematical ability in chaotic homes and when parents are negative.
Novel Thrombotic Function of a Human SNP in STXBP5 Revealed by CRISPR/Cas9 Gene Editing in Mice.
Zhu, Qiuyu Martin; Ko, Kyung Ae; Ture, Sara; Mastrangelo, Michael A; Chen, Ming-Huei; Johnson, Andrew D; O'Donnell, Christopher J; Morrell, Craig N; Miano, Joseph M; Lowenstein, Charles J
2017-02-01
To identify and characterize the effect of a SNP (single-nucleotide polymorphism) in the STXBP5 locus that is associated with altered thrombosis in humans. GWAS (genome-wide association studies) have identified numerous SNPs associated with human thrombotic phenotypes, but determining the functional significance of an individual candidate SNP can be challenging, particularly when in vivo modeling is required. Recent GWAS led to the discovery of STXBP5 as a regulator of platelet secretion in humans. Further clinical studies have identified genetic variants of STXBP5 that are linked to altered plasma von Willebrand factor levels and thrombosis in humans, but the functional significance of these variants in STXBP5 is not understood. We used CRISPR/Cas9 (clustered regularly interspaced short palindromic repeats/CRISPR-associated 9) techniques to produce a precise mouse model carrying a human coding SNP rs1039084 (encoding human p. N436S) in the STXBP5 locus associated with decreased thrombosis. Mice carrying the orthologous human mutation (encoding p. N437S in mouse STXBP5) have lower plasma von Willebrand factor levels, decreased thrombosis, and decreased platelet secretion compared with wild-type mice. This thrombosis phenotype recapitulates the phenotype of humans carrying the minor allele of rs1039084. Decreased plasma von Willebrand factor and platelet activation may partially explain the decreased thrombotic phenotype in mutant mice. Using precise mammalian genome editing, we have identified a human nonsynonymous SNP rs1039084 in the STXBP5 locus as a causal variant for a decreased thrombotic phenotype. CRISPR/Cas9 genetic editing facilitates the rapid and efficient generation of animals to study the function of human genetic variation in vascular diseases. © 2016 American Heart Association, Inc.
Singh, Kanhaiya; Goyal, Prabhjot; Singh, Manju; Deshmukh, Sujit; Upadhyay, Divyesh; Kant, Sri; Agrawal, Neeraj K; Gupta, Sanjeev K; Singh, Kiran
2017-12-01
Retinal angiogenesis is a hallmark of diabetic retinopathy. Matrix Metalloproteinases (MMPs) are involved in degradation of extracellular matrix (ECM). Functional SNP-1562C>T in the promoter of the MMP-9 gene results increase in transcriptional activity. The present work was designed to evaluate the contribution of functional SNP-1562C>T of MMP-9 gene to the risk of proliferative diabetic retinopathy (PDR) in type 2 diabetes mellitus (T2DM) patients in north Indian Population. This Case control study comprised of a total of 645 individuals in which 320 were T2DM patients out of which 73 had PDR, 98 had non- proliferative diabetic retinopathy (NPDR), 149 T2DM cases without any eye related disease (DM) and 325 non diabetic healthy individuals as controls (non DM controls). Genotyping for SNP-1562C>T of MMP-9 was done by polymerase chain reactions followed by restriction analyses with specific endonucleases (PCR-RFLP). DNA sequencing was used to ascertain PCR-RFLP results. T allele frequency in PDR patients was 32.1%, 20.4% in NPDR, 15.4% in DM and 13.7% in controls. Statistically significant difference was observed in both allele and genotype distribution between the PDR versus non-DM control group (p<0.0001 by T allele; p=0.002 by TT and p<0.0001 by CT genotype). The present study suggests that the functional SNP-1562C>T in the promoter of the MMP-9 gene could be regarded as a major risk factor for PDR as increased MMP-9 production from high expressing T allele may promote retinal angiogenesis. Copyright © 2017 Elsevier Inc. All rights reserved.
Islam, Mohammad T; Garg, Gagan; Hancock, William S; Risk, Brian A; Baker, Mark S; Ranganathan, Shoba
2014-01-03
The chromosome-centric human proteome project (C-HPP) aims to define the complete set of proteins encoded in each human chromosome. The neXtProt database (September 2013) lists 20,128 proteins for the human proteome, of which 3831 human proteins (∼19%) are considered "missing" according to the standard metrics table (released September 27, 2013). In support of the C-HPP initiative, we have extended the annotation strategy developed for human chromosome 7 "missing" proteins into a semiautomated pipeline to functionally annotate the "missing" human proteome. This pipeline integrates a suite of bioinformatics analysis and annotation software tools to identify homologues and map putative functional signatures, gene ontology, and biochemical pathways. From sequential BLAST searches, we have primarily identified homologues from reviewed nonhuman mammalian proteins with protein evidence for 1271 (33.2%) "missing" proteins, followed by 703 (18.4%) homologues from reviewed nonhuman mammalian proteins and subsequently 564 (14.7%) homologues from reviewed human proteins. Functional annotations for 1945 (50.8%) "missing" proteins were also determined. To accelerate the identification of "missing" proteins from proteomics studies, we generated proteotypic peptides in silico. Matching these proteotypic peptides to ENCODE proteogenomic data resulted in proteomic evidence for 107 (2.8%) of the 3831 "missing proteins, while evidence from a recent membrane proteomic study supported the existence for another 15 "missing" proteins. The chromosome-wise functional annotation of all "missing" proteins is freely available to the scientific community through our web server (http://biolinfo.org/protannotator).
Neerincx, Pieter BT; Casel, Pierrot; Prickett, Dennis; Nie, Haisheng; Watson, Michael; Leunissen, Jack AM; Groenen, Martien AM; Klopp, Christophe
2009-01-01
Background Reliable annotation linking oligonucleotide probes to target genes is essential for functional biological analysis of microarray experiments. We used the IMAD, OligoRAP and sigReannot pipelines to update the annotation for the ARK-Genomics Chicken 20 K array as part of a joined EADGENE/SABRE workshop. In this manuscript we compare their annotation strategies and results. Furthermore, we analyse the effect of differences in updated annotation on functional analysis for an experiment involving Eimeria infected chickens and finally we propose guidelines for optimal annotation strategies. Results IMAD, OligoRAP and sigReannot update both annotation and estimated target specificity. The 3 pipelines can assign oligos to target specificity categories although with varying degrees of resolution. Target specificity is judged based on the amount and type of oligo versus target-gene alignments (hits), which are determined by filter thresholds that users can adjust based on their experimental conditions. Linking oligos to annotation on the other hand is based on rigid rules, which differ between pipelines. For 52.7% of the oligos from a subset selected for in depth comparison all pipelines linked to one or more Ensembl genes with consensus on 44.0%. In 31.0% of the cases none of the pipelines could assign an Ensembl gene to an oligo and for the remaining 16.3% the coverage differed between pipelines. Differences in updated annotation were mainly due to different thresholds for hybridisation potential filtering of oligo versus target-gene alignments and different policies for expanding annotation using indirect links. The differences in updated annotation packages had a significant effect on GO term enrichment analysis with consensus on only 67.2% of the enriched terms. Conclusion In addition to flexible thresholds to determine target specificity, annotation tools should provide metadata describing the relationships between oligos and the annotation assigned to them. These relationships can then be used to judge the varying degrees of reliability allowing users to fine-tune the balance between reliability and coverage. This is important as it can have a significant effect on functional microarray analysis as exemplified by the lack of consensus on almost one third of the terms found with GO term enrichment analysis based on updated IMAD, OligoRAP or sigReannot annotation. PMID:19615109
Neerincx, Pieter Bt; Casel, Pierrot; Prickett, Dennis; Nie, Haisheng; Watson, Michael; Leunissen, Jack Am; Groenen, Martien Am; Klopp, Christophe
2009-07-16
Reliable annotation linking oligonucleotide probes to target genes is essential for functional biological analysis of microarray experiments. We used the IMAD, OligoRAP and sigReannot pipelines to update the annotation for the ARK-Genomics Chicken 20 K array as part of a joined EADGENE/SABRE workshop. In this manuscript we compare their annotation strategies and results. Furthermore, we analyse the effect of differences in updated annotation on functional analysis for an experiment involving Eimeria infected chickens and finally we propose guidelines for optimal annotation strategies. IMAD, OligoRAP and sigReannot update both annotation and estimated target specificity. The 3 pipelines can assign oligos to target specificity categories although with varying degrees of resolution. Target specificity is judged based on the amount and type of oligo versus target-gene alignments (hits), which are determined by filter thresholds that users can adjust based on their experimental conditions. Linking oligos to annotation on the other hand is based on rigid rules, which differ between pipelines.For 52.7% of the oligos from a subset selected for in depth comparison all pipelines linked to one or more Ensembl genes with consensus on 44.0%. In 31.0% of the cases none of the pipelines could assign an Ensembl gene to an oligo and for the remaining 16.3% the coverage differed between pipelines. Differences in updated annotation were mainly due to different thresholds for hybridisation potential filtering of oligo versus target-gene alignments and different policies for expanding annotation using indirect links. The differences in updated annotation packages had a significant effect on GO term enrichment analysis with consensus on only 67.2% of the enriched terms. In addition to flexible thresholds to determine target specificity, annotation tools should provide metadata describing the relationships between oligos and the annotation assigned to them. These relationships can then be used to judge the varying degrees of reliability allowing users to fine-tune the balance between reliability and coverage. This is important as it can have a significant effect on functional microarray analysis as exemplified by the lack of consensus on almost one third of the terms found with GO term enrichment analysis based on updated IMAD, OligoRAP or sigReannot annotation.
2012-01-01
Background The first draft assembly and gene prediction of the grapevine genome (8X base coverage) was made available to the scientific community in 2007, and functional annotation was developed on this gene prediction. Since then additional Sanger sequences were added to the 8X sequences pool and a new version of the genomic sequence with superior base coverage (12X) was produced. Results In order to more efficiently annotate the function of the genes predicted in the new assembly, it is important to build on as much of the previous work as possible, by transferring 8X annotation of the genome to the 12X version. The 8X and 12X assemblies and gene predictions of the grapevine genome were compared to answer the question, “Can we uniquely map 8X predicted genes to 12X predicted genes?” The results show that while the assemblies and gene structure predictions are too different to make a complete mapping between them, most genes (18,725) showed a one-to-one relationship between 8X predicted genes and the last version of 12X predicted genes. In addition, reshuffled genomic sequence structures appeared. These highlight regions of the genome where the gene predictions need to be taken with caution. Based on the new grapevine gene functional annotation and in-depth functional categorization, twenty eight new molecular networks have been created for VitisNet while the existing networks were updated. Conclusions The outcomes of this study provide a functional annotation of the 12X genes, an update of VitisNet, the system of the grapevine molecular networks, and a new functional categorization of genes. Data are available at the VitisNet website (http://www.sdstate.edu/ps/research/vitis/pathways.cfm). PMID:22554261
NegGOA: negative GO annotations selection using ontology structure.
Fu, Guangyuan; Wang, Jun; Yang, Bo; Yu, Guoxian
2016-10-01
Predicting the biological functions of proteins is one of the key challenges in the post-genomic era. Computational models have demonstrated the utility of applying machine learning methods to predict protein function. Most prediction methods explicitly require a set of negative examples-proteins that are known not carrying out a particular function. However, Gene Ontology (GO) almost always only provides the knowledge that proteins carry out a particular function, and functional annotations of proteins are incomplete. GO structurally organizes more than tens of thousands GO terms and a protein is annotated with several (or dozens) of these terms. For these reasons, the negative examples of a protein can greatly help distinguishing true positive examples of the protein from such a large candidate GO space. In this paper, we present a novel approach (called NegGOA) to select negative examples. Specifically, NegGOA takes advantage of the ontology structure, available annotations and potentiality of additional annotations of a protein to choose negative examples of the protein. We compare NegGOA with other negative examples selection algorithms and find that NegGOA produces much fewer false negatives than them. We incorporate the selected negative examples into an efficient function prediction model to predict the functions of proteins in Yeast, Human, Mouse and Fly. NegGOA also demonstrates improved accuracy than these comparing algorithms across various evaluation metrics. In addition, NegGOA is less suffered from incomplete annotations of proteins than these comparing methods. The Matlab and R codes are available at https://sites.google.com/site/guoxian85/neggoa gxyu@swu.edu.cn Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Mantello, Camila Campos; Cardoso-Silva, Claudio Benicio; da Silva, Carla Cristina; de Souza, Livia Moura; Scaloppi Junior, Erivaldo José; de Souza Gonçalves, Paulo; Vicentini, Renato; de Souza, Anete Pereira
2014-01-01
Hevea brasiliensis (Willd. Ex Adr. Juss.) Muell.-Arg. is the primary source of natural rubber that is native to the Amazon rainforest. The singular properties of natural rubber make it superior to and competitive with synthetic rubber for use in several applications. Here, we performed RNA sequencing (RNA-seq) of H. brasiliensis bark on the Illumina GAIIx platform, which generated 179,326,804 raw reads on the Illumina GAIIx platform. A total of 50,384 contigs that were over 400 bp in size were obtained and subjected to further analyses. A similarity search against the non-redundant (nr) protein database returned 32,018 (63%) positive BLASTx hits. The transcriptome analysis was annotated using the clusters of orthologous groups (COG), gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Pfam databases. A search for putative molecular marker was performed to identify simple sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs). In total, 17,927 SSRs and 404,114 SNPs were detected. Finally, we selected sequences that were identified as belonging to the mevalonate (MVA) and 2-C-methyl-D-erythritol 4-phosphate (MEP) pathways, which are involved in rubber biosynthesis, to validate the SNP markers. A total of 78 SNPs were validated in 36 genotypes of H. brasiliensis. This new dataset represents a powerful information source for rubber tree bark genes and will be an important tool for the development of microsatellites and SNP markers for use in future genetic analyses such as genetic linkage mapping, quantitative trait loci identification, investigations of linkage disequilibrium and marker-assisted selection. PMID:25048025
Farmer, Andrew D.; Huang, Wei; Ambachew, Daniel; Penmetsa, R. Varma; Carrasquilla-Garcia, Noelia; Assefa, Teshale; Cannon, Steven B.
2018-01-01
Recombination (R) rate and linkage disequilibrium (LD) analyses are the basis for plant breeding. These vary by breeding system, by generation of inbreeding or outcrossing and by region in the chromosome. Common bean (Phaseolus vulgaris L.) is a favored food legume with a small sequenced genome (514 Mb) and n = 11 chromosomes. The goal of this study was to describe R and LD in the common bean genome using a 768-marker array of single nucleotide polymorphisms (SNP) based on Trans-legume Orthologous Group (TOG) genes along with an advanced-generation Recombinant Inbred Line reference mapping population (BAT93 x Jalo EEP558) and an internationally available diversity panel. A whole genome genetic map was created that covered all eleven linkage groups (LG). The LGs were linked to the physical map by sequence data of the TOGs compared to each chromosome sequence of common bean. The genetic map length in total was smaller than for previous maps reflecting the precision of allele calling and mapping with SNP technology as well as the use of gene-based markers. A total of 91.4% of TOG markers had singleton hits with annotated Pv genes and all mapped outside of regions of resistance gene clusters. LD levels were found to be stronger within the Mesoamerican genepool and decay more rapidly within the Andean genepool. The recombination rate across the genome was 2.13 cM / Mb but R was found to be highly repressed around centromeres and frequent outside peri-centromeric regions. These results have important implications for association and genetic mapping or crop improvement in common bean. PMID:29522524
Genome-wide analysis of epistasis in body mass index using multiple human populations.
Wei, Wen-Hua; Hemani, Gib; Gyenesei, Attila; Vitart, Veronique; Navarro, Pau; Hayward, Caroline; Cabrera, Claudia P; Huffman, Jennifer E; Knott, Sara A; Hicks, Andrew A; Rudan, Igor; Pramstaller, Peter P; Wild, Sarah H; Wilson, James F; Campbell, Harry; Hastie, Nicholas D; Wright, Alan F; Haley, Chris S
2012-08-01
We surveyed gene-gene interactions (epistasis) in human body mass index (BMI) in four European populations (n<1200) via exhaustive pair-wise genome scans where interactions were computed as F ratios by testing a linear regression model fitting two single-nucleotide polymorphisms (SNPs) with interactions against the one without. Before the association tests, BMI was corrected for sex and age, normalised and adjusted for relatedness. Neither single SNPs nor SNP interactions were genome-wide significant in either cohort based on the consensus threshold (P=5.0E-08) and a Bonferroni corrected threshold (P=1.1E-12), respectively. Next we compared sub genome-wide significant SNP interactions (P<5.0E-08) across cohorts to identify common epistatic signals, where SNPs were annotated to genes to test for gene ontology (GO) enrichment. Among the epistatic genes contributing to the commonly enriched GO terms, 19 were shared across study cohorts of which 15 are previously published genome-wide association loci, including CDH13 (cadherin 13) associated with height and SORCS2 (sortilin-related VPS10 domain containing receptor 2) associated with circulating insulin-like growth factor 1 and binding protein 3. Interactions between the 19 shared epistatic genes and those involving BMI candidate loci (P<5.0E-08) were tested across cohorts and found eight replicated at the SNP level (P<0.05) in at least one cohort, which were further tested and showed limited replication in a separate European population (n>5000). We conclude that genome-wide analysis of epistasis in multiple populations is an effective approach to provide new insights into the genetic regulation of BMI but requires additional efforts to confirm the findings.
Revisiting Criteria for Plant MicroRNA Annotation in the Era of Big Data[OPEN
2018-01-01
MicroRNAs (miRNAs) are ∼21-nucleotide-long regulatory RNAs that arise from endonucleolytic processing of hairpin precursors. Many function as essential posttranscriptional regulators of target mRNAs and long noncoding RNAs. Alongside miRNAs, plants also produce large numbers of short interfering RNAs (siRNAs), which are distinguished from miRNAs primarily by their biogenesis (typically processed from long double-stranded RNA instead of single-stranded hairpins) and functions (typically via roles in transcriptional regulation instead of posttranscriptional regulation). Next-generation DNA sequencing methods have yielded extensive data sets of plant small RNAs, resulting in many miRNA annotations. However, it has become clear that many miRNA annotations are questionable. The sheer number of endogenous siRNAs compared with miRNAs has been a major factor in the erroneous annotation of siRNAs as miRNAs. Here, we provide updated criteria for the confident annotation of plant miRNAs, suitable for the era of “big data” from DNA sequencing. The updated criteria emphasize replication and the minimization of false positives, and they require next-generation sequencing of small RNAs. We argue that improved annotation systems are needed for miRNAs and all other classes of plant small RNAs. Finally, to illustrate the complexities of miRNA and siRNA annotation, we review the evolution and functions of miRNAs and siRNAs in plants. PMID:29343505
Dhanyalakshmi, K H; Naika, Mahantesha B N; Sajeevan, R S; Mathew, Oommen K; Shafi, K Mohamed; Sowdhamini, Ramanathan; N Nataraja, Karaba
2016-01-01
The modern sequencing technologies are generating large volumes of information at the transcriptome and genome level. Translation of this information into a biological meaning is far behind the race due to which a significant portion of proteins discovered remain as proteins of unknown function (PUFs). Attempts to uncover the functional significance of PUFs are limited due to lack of easy and high throughput functional annotation tools. Here, we report an approach to assign putative functions to PUFs, identified in the transcriptome of mulberry, a perennial tree commonly cultivated as host of silkworm. We utilized the mulberry PUFs generated from leaf tissues exposed to drought stress at whole plant level. A sequence and structure based computational analysis predicted the probable function of the PUFs. For rapid and easy annotation of PUFs, we developed an automated pipeline by integrating diverse bioinformatics tools, designated as PUFs Annotation Server (PUFAS), which also provides a web service API (Application Programming Interface) for a large-scale analysis up to a genome. The expression analysis of three selected PUFs annotated by the pipeline revealed abiotic stress responsiveness of the genes, and hence their potential role in stress acclimation pathways. The automated pipeline developed here could be extended to assign functions to PUFs from any organism in general. PUFAS web server is available at http://caps.ncbs.res.in/pufas/ and the web service is accessible at http://capservices.ncbs.res.in/help/pufas.
Gouret, Philippe; Vitiello, Vérane; Balandraud, Nathalie; Gilles, André; Pontarotti, Pierre; Danchin, Etienne GJ
2005-01-01
Background Two of the main objectives of the genomic and post-genomic era are to structurally and functionally annotate genomes which consists of detecting genes' position and structure, and inferring their function (as well as of other features of genomes). Structural and functional annotation both require the complex chaining of numerous different software, algorithms and methods under the supervision of a biologist. The automation of these pipelines is necessary to manage huge amounts of data released by sequencing projects. Several pipelines already automate some of these complex chaining but still necessitate an important contribution of biologists for supervising and controlling the results at various steps. Results Here we propose an innovative automated platform, FIGENIX, which includes an expert system capable to substitute to human expertise at several key steps. FIGENIX currently automates complex pipelines of structural and functional annotation under the supervision of the expert system (which allows for example to make key decisions, check intermediate results or refine the dataset). The quality of the results produced by FIGENIX is comparable to those obtained by expert biologists with a drastic gain in terms of time costs and avoidance of errors due to the human manipulation of data. Conclusion The core engine and expert system of the FIGENIX platform currently handle complex annotation processes of broad interest for the genomic community. They could be easily adapted to new, or more specialized pipelines, such as for example the annotation of miRNAs, the classification of complex multigenic families, annotation of regulatory elements and other genomic features of interest. PMID:16083500
Maize - GO annotation methods, evaluation, and review (Maize-GAMER)
USDA-ARS?s Scientific Manuscript database
Making a genome sequence accessible and useful involves three basic steps: genome assembly, structural annotation, and functional annotation. The quality of data generated at each step influences the accuracy of inferences that can be made, with high-quality analyses produce better datasets resultin...
Quality of Computationally Inferred Gene Ontology Annotations
Škunca, Nives; Altenhoff, Adrian; Dessimoz, Christophe
2012-01-01
Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation. PMID:22693439
High Precision Prediction of Functional Sites in Protein Structures
Buturovic, Ljubomir; Wong, Mike; Tang, Grace W.; Altman, Russ B.; Petkovic, Dragutin
2014-01-01
We address the problem of assigning biological function to solved protein structures. Computational tools play a critical role in identifying potential active sites and informing screening decisions for further lab analysis. A critical parameter in the practical application of computational methods is the precision, or positive predictive value. Precision measures the level of confidence the user should have in a particular computed functional assignment. Low precision annotations lead to futile laboratory investigations and waste scarce research resources. In this paper we describe an advanced version of the protein function annotation system FEATURE, which achieved 99% precision and average recall of 95% across 20 representative functional sites. The system uses a Support Vector Machine classifier operating on the microenvironment of physicochemical features around an amino acid. We also compared performance of our method with state-of-the-art sequence-level annotator Pfam in terms of precision, recall and localization. To our knowledge, no other functional site annotator has been rigorously evaluated against these key criteria. The software and predictive models are incorporated into the WebFEATURE service at http://feature.stanford.edu/wf4.0-beta. PMID:24632601
Chen, Xin; Wu, Qiong; Sun, Ruimin; Zhang, Louxin
2012-01-01
The discovery of single-nucleotide polymorphisms (SNPs) has important implications in a variety of genetic studies on human diseases and biological functions. One valuable approach proposed for SNP discovery is based on base-specific cleavage and mass spectrometry. However, it is still very challenging to achieve the full potential of this SNP discovery approach. In this study, we formulate two new combinatorial optimization problems. While both problems are aimed at reconstructing the sample sequence that would attain the minimum number of SNPs, they search over different candidate sequence spaces. The first problem, denoted as SNP - MSP, limits its search to sequences whose in silico predicted mass spectra have all their signals contained in the measured mass spectra. In contrast, the second problem, denoted as SNP - MSQ, limits its search to sequences whose in silico predicted mass spectra instead contain all the signals of the measured mass spectra. We present an exact dynamic programming algorithm for solving the SNP - MSP problem and also show that the SNP - MSQ problem is NP-hard by a reduction from a restricted variation of the 3-partition problem. We believe that an efficient solution to either problem above could offer a seamless integration of information in four complementary base-specific cleavage reactions, thereby improving the capability of the underlying biotechnology for sensitive and accurate SNP discovery.
MEGANTE: A Web-Based System for Integrated Plant Genome Annotation
Numa, Hisataka; Itoh, Takeshi
2014-01-01
The recent advancement of high-throughput genome sequencing technologies has resulted in a considerable increase in demands for large-scale genome annotation. While annotation is a crucial step for downstream data analyses and experimental studies, this process requires substantial expertise and knowledge of bioinformatics. Here we present MEGANTE, a web-based annotation system that makes plant genome annotation easy for researchers unfamiliar with bioinformatics. Without any complicated configuration, users can perform genomic sequence annotations simply by uploading a sequence and selecting the species to query. MEGANTE automatically runs several analysis programs and integrates the results to select the appropriate consensus exon–intron structures and to predict open reading frames (ORFs) at each locus. Functional annotation, including a similarity search against known proteins and a functional domain search, are also performed for the predicted ORFs. The resultant annotation information is visualized with a widely used genome browser, GBrowse. For ease of analysis, the results can be downloaded in Microsoft Excel format. All of the query sequences and annotation results are stored on the server side so that users can access their own data from virtually anywhere on the web. The current release of MEGANTE targets 24 plant species from the Brassicaceae, Fabaceae, Musaceae, Poaceae, Salicaceae, Solanaceae, Rosaceae and Vitaceae families, and it allows users to submit a sequence up to 10 Mb in length and to save up to 100 sequences with the annotation information on the server. The MEGANTE web service is available at https://megante.dna.affrc.go.jp/. PMID:24253915
Molecular Dynamics Information Improves cis-Peptide-Based Function Annotation of Proteins.
Das, Sreetama; Bhadra, Pratiti; Ramakumar, Suryanarayanarao; Pal, Debnath
2017-08-04
cis-Peptide bonds, whose occurrence in proteins is rare but evolutionarily conserved, are implicated to play an important role in protein function. This has led to their previous use in a homology-independent, fragment-match-based protein function annotation method. However, proteins are not static molecules; dynamics is integral to their activity. This is nicely epitomized by the geometric isomerization of cis-peptide to trans form for molecular activity. Hence we have incorporated both static (cis-peptide) and dynamics information to improve the prediction of protein molecular function. Our results show that cis-peptide information alone cannot detect functional matches in cases where cis-trans isomerization exists but 3D coordinates have been obtained for only the trans isomer or when the cis-peptide bond is incorrectly assigned as trans. On the contrary, use of dynamics information alone includes false-positive matches for cases where fragments with similar secondary structure show similar dynamics, but the proteins do not share a common function. Combining the two methods reduces errors while detecting the true matches, thereby enhancing the utility of our method in function annotation. A combined approach, therefore, opens up new avenues of improving existing automated function annotation methodologies.
Haplotype analysis of sucrose synthase gene family in three Saccharum species
2013-01-01
Background Sugarcane is an economically important crop contributing about 80% and 40% to the world sugar and ethanol production, respectively. The complicated genetics consequential to its complex polyploid genome, however, have impeded efforts to improve sugar yield and related important agronomic traits. Modern sugarcane cultivars are complex hybrids derived mainly from crosses among its progenitor species, S. officinarum and S. spontanuem, and to a lesser degree, S. robustom. Atypical of higher plants, sugarcane stores its photoassimilates as sucrose rather than as starch in its parenchymous stalk cells. In the sugar biosynthesis pathway, sucrose synthase (SuSy, UDP-glucose: D-fructose 2-a-D-glucosyltransferase, EC 2.4.1.13) is a key enzyme in the regulation of sucrose accumulation and partitioning by catalyzing the reversible conversion of sucrose and UDP into UDP-glucose and fructose. However, little is known about the sugarcane SuSy gene family members and hence no definitive studies have been reported regarding allelic diversity of SuSy gene families in Saccharum species. Results We identified and characterized a total of five sucrose synthase genes in the three sugarcane progenitor species through gene annotation and PCR haplotype analysis by analyzing 70 to 119 PCR fragments amplified from intron-containing target regions. We detected all but one (i.e. ScSuSy5) of ScSuSy transcripts in five tissue types of three Saccharum species. The average SNP frequency was one SNP per 108 bp, 81 bp, and 72 bp in S. officinarum, S. robustom, and S. spontanuem respectively. The average shared SNP is 15 between S. officinarum and S. robustom, 7 between S. officinarum and S. spontanuem , and 11 between S. robustom and S. spontanuem. We identified 27, 35, and 32 haplotypes from the five ScSuSy genes in S. officinarum, S. robustom, and S. spontanuem respectively. Also, 12, 11, and 9 protein sequences were translated from the haplotypes in S. officinarum, S. robustom, S. spontanuem, respectively. Phylogenetic analysis showed three separate clusters composed of SbSuSy1 and SbSuSy2, SbSuSy3 and SbSuSy5, and SbSuSy4. Conclusions The five members of the SuSy gene family evolved before the divergence of the genera in the tribe Andropogoneae at least 12 MYA. Each ScSuSy gene showed at least one non-synonymous substitution in SNP haplotypes. The SNP frequency is the lowest in S. officinarum, intermediate in S. robustum, and the highest in S. spontaneum, which may reflect the timing of the two rounds of whole genome duplication in these octoploids. The higher rate of shared SNP frequency between S. officinarum and S. robustum than between S. officinarum and in S. spontaneum confirmed that the speciation event separating S. officinarum and S. robustum occurred after their common ancestor diverged from S. spontaneum. The SNP and haplotype frequencies in three Saccharum species provide fundamental information for designing strategies to sequence these autopolyploid genomes. PMID:23663250
Haplotype analysis of sucrose synthase gene family in three Saccharum species.
Zhang, Jisen; Arro, Jie; Chen, Youqiang; Ming, Ray
2013-05-10
Sugarcane is an economically important crop contributing about 80% and 40% to the world sugar and ethanol production, respectively. The complicated genetics consequential to its complex polyploid genome, however, have impeded efforts to improve sugar yield and related important agronomic traits. Modern sugarcane cultivars are complex hybrids derived mainly from crosses among its progenitor species, S. officinarum and S. spontanuem, and to a lesser degree, S. robustom. Atypical of higher plants, sugarcane stores its photoassimilates as sucrose rather than as starch in its parenchymous stalk cells. In the sugar biosynthesis pathway, sucrose synthase (SuSy, UDP-glucose: D-fructose 2-a-D-glucosyltransferase, EC 2.4.1.13) is a key enzyme in the regulation of sucrose accumulation and partitioning by catalyzing the reversible conversion of sucrose and UDP into UDP-glucose and fructose. However, little is known about the sugarcane SuSy gene family members and hence no definitive studies have been reported regarding allelic diversity of SuSy gene families in Saccharum species. We identified and characterized a total of five sucrose synthase genes in the three sugarcane progenitor species through gene annotation and PCR haplotype analysis by analyzing 70 to 119 PCR fragments amplified from intron-containing target regions. We detected all but one (i.e. ScSuSy5) of ScSuSy transcripts in five tissue types of three Saccharum species. The average SNP frequency was one SNP per 108 bp, 81 bp, and 72 bp in S. officinarum, S. robustom, and S. spontanuem respectively. The average shared SNP is 15 between S. officinarum and S. robustom, 7 between S. officinarum and S. spontanuem , and 11 between S. robustom and S. spontanuem. We identified 27, 35, and 32 haplotypes from the five ScSuSy genes in S. officinarum, S. robustom, and S. spontanuem respectively. Also, 12, 11, and 9 protein sequences were translated from the haplotypes in S. officinarum, S. robustom, S. spontanuem, respectively. Phylogenetic analysis showed three separate clusters composed of SbSuSy1 and SbSuSy2, SbSuSy3 and SbSuSy5, and SbSuSy4. The five members of the SuSy gene family evolved before the divergence of the genera in the tribe Andropogoneae at least 12 MYA. Each ScSuSy gene showed at least one non-synonymous substitution in SNP haplotypes. The SNP frequency is the lowest in S. officinarum, intermediate in S. robustum, and the highest in S. spontaneum, which may reflect the timing of the two rounds of whole genome duplication in these octoploids. The higher rate of shared SNP frequency between S. officinarum and S. robustum than between S. officinarum and in S. spontaneum confirmed that the speciation event separating S. officinarum and S. robustum occurred after their common ancestor diverged from S. spontaneum. The SNP and haplotype frequencies in three Saccharum species provide fundamental information for designing strategies to sequence these autopolyploid genomes.
Determining Semantically Related Significant Genes.
Taha, Kamal
2014-01-01
GO relation embodies some aspects of existence dependency. If GO term xis existence-dependent on GO term y, the presence of y implies the presence of x. Therefore, the genes annotated with the function of the GO term y are usually functionally and semantically related to the genes annotated with the function of the GO term x. A large number of gene set enrichment analysis methods have been developed in recent years for analyzing gene sets enrichment. However, most of these methods overlook the structural dependencies between GO terms in GO graph by not considering the concept of existence dependency. We propose in this paper a biological search engine called RSGSearch that identifies enriched sets of genes annotated with different functions using the concept of existence dependency. We observe that GO term xcannot be existence-dependent on GO term y, if x- and y- have the same specificity (biological characteristics). After encoding into a numeric format the contributions of GO terms annotating target genes to the semantics of their lowest common ancestors (LCAs), RSGSearch uses microarray experiment to identify the most significant LCA that annotates the result genes. We evaluated RSGSearch experimentally and compared it with five gene set enrichment systems. Results showed marked improvement.
De novo RNA-seq and functional annotation of Ornithonyssus bacoti.
Niu, DongLing; Wang, RuiLing; Zhao, YaE; Yang, Rui; Hu, Li
2018-06-01
Ornithonyssus bacoti (Hirst) (Acari: Macronyssidae) is a vector and reservoir of pathogens causing serious infectious diseases, such as epidemic hemorrhagic fever, endemic typhus, tularemia, and leptospirosis. Its genome and transcriptome data are lacking in public databases. In this study, total RNA was extracted from live O. bacoti to conduct RNA-seq, functional annotation, coding domain sequence (CDS) prediction and simple sequence repeats (SSRs) detection. The results showed that 65.8 million clean reads were generated and assembled into 72,185 unigenes, of which 49.4% were annotated by seven functional databases. 23,121 unigenes were annotated and assigned to 457 species by non-redundant protein sequence database. The BLAST top-two hit species were Metaseiulus occidentalis and Ixodes scapularis. The procedure detected 12,426 SSRs, of which tri- and di-nucleotides were the most abundant types and the representative motifs were AAT/ATT and AC/GT. 26,936 CDS were predicted with a mean length of 711 bp. 87 unigenes of 30 functional genes, which are usually involved in stress responses, drug resistance, movement, metabolism and allergy, were further identified by bioinformatics methods. The unigenes putatively encoding cytochrome P450 proteins were further analyzed phylogenetically. In conclusion, this study completed the RNA-seq and functional annotation of O. bacoti successfully, which provides reliable molecular data for its future studies of gene function and molecular markers.
Computer Applications in Marketing. An Annotated Bibliography of Computer Software.
ERIC Educational Resources Information Center
Burrow, Jim; Schwamman, Faye
This bibliography contains annotations of 95 items of educational and business software with applications in seven marketing and business functions. The annotations, which appear in alphabetical order by title, provide this information: category (related application), title, date, source and price, equipment, supplementary materials, description…
Expanded microbial genome coverage and improved protein family annotation in the COG database.
Galperin, Michael Y; Makarova, Kira S; Wolf, Yuri I; Koonin, Eugene V
2015-01-01
Microbial genome sequencing projects produce numerous sequences of deduced proteins, only a small fraction of which have been or will ever be studied experimentally. This leaves sequence analysis as the only feasible way to annotate these proteins and assign to them tentative functions. The Clusters of Orthologous Groups of proteins (COGs) database (http://www.ncbi.nlm.nih.gov/COG/), first created in 1997, has been a popular tool for functional annotation. Its success was largely based on (i) its reliance on complete microbial genomes, which allowed reliable assignment of orthologs and paralogs for most genes; (ii) orthology-based approach, which used the function(s) of the characterized member(s) of the protein family (COG) to assign function(s) to the entire set of carefully identified orthologs and describe the range of potential functions when there were more than one; and (iii) careful manual curation of the annotation of the COGs, aimed at detailed prediction of the biological function(s) for each COG while avoiding annotation errors and overprediction. Here we present an update of the COGs, the first since 2003, and a comprehensive revision of the COG annotations and expansion of the genome coverage to include representative complete genomes from all bacterial and archaeal lineages down to the genus level. This re-analysis of the COGs shows that the original COG assignments had an error rate below 0.5% and allows an assessment of the progress in functional genomics in the past 12 years. During this time, functions of many previously uncharacterized COGs have been elucidated and tentative functional assignments of many COGs have been validated, either by targeted experiments or through the use of high-throughput methods. A particularly important development is the assignment of functions to several widespread, conserved proteins many of which turned out to participate in translation, in particular rRNA maturation and tRNA modification. The new version of the COGs is expected to become an important tool for microbial genomics. Published by Oxford University Press on behalf of Nucleic Acids Research 2014. This work is written by US Government employees and is in the public domain in the US.
AGORA : Organellar genome annotation from the amino acid and nucleotide references.
Jung, Jaehee; Kim, Jong Im; Jeong, Young-Sik; Yi, Gangman
2018-03-29
Next-generation sequencing (NGS) technologies have led to the accumulation of highthroughput sequence data from various organisms in biology. To apply gene annotation of organellar genomes for various organisms, more optimized tools for functional gene annotation are required. Almost all gene annotation tools are mainly focused on the chloroplast genome of land plants or the mitochondrial genome of animals.We have developed a web application AGORA for the fast, user-friendly, and improved annotations of organellar genomes. AGORA annotates genes based on a BLAST-based homology search and clustering with selected reference sequences from the NCBI database or user-defined uploaded data. AGORA can annotate the functional genes in almost all mitochondrion and plastid genomes of eukaryotes. The gene annotation of a genome with an exon-intron structure within a gene or inverted repeat region is also available. It provides information of start and end positions of each gene, BLAST results compared with the reference sequence, and visualization of gene map by OGDRAW. Users can freely use the software, and the accessible URL is https://bigdata.dongguk.edu/gene_project/AGORA/.The main module of the tool is implemented by the python and php, and the web page is built by the HTML and CSS to support all browsers. gangman@dongguk.edu.
Xiao, Shijun; Wang, Panpan; Dong, Linsong; Zhang, Yaguang; Han, Zhaofang; Wang, Qiurong
2016-01-01
Whole-genome single-nucleotide polymorphism (SNP) markers are valuable genetic resources for the association and conservation studies. Genome-wide SNP development in many teleost species are still challenging because of the genome complexity and the cost of re-sequencing. Genotyping-By-Sequencing (GBS) provided an efficient reduced representative method to squeeze cost for SNP detection; however, most of recent GBS applications were reported on plant organisms. In this work, we used an EcoRI-NlaIII based GBS protocol to teleost large yellow croaker, an important commercial fish in China and East-Asia, and reported the first whole-genome SNP development for the species. 69,845 high quality SNP markers that evenly distributed along genome were detected in at least 80% of 500 individuals. Nearly 95% randomly selected genotypes were successfully validated by Sequenom MassARRAY assay. The association studies with the muscle eicosapentaenoic acid (EPA) and docosahexaenoic acid (DHA) content discovered 39 significant SNP markers, contributing as high up to ∼63% genetic variance that explained by all markers. Functional genes that involved in fat digestion and absorption pathway were identified, such as APOB, CRAT and OSBPL10. Notably, PPT2 Gene, previously identified in the association study of the plasma n-3 and n-6 polyunsaturated fatty acid level in human, was re-discovered in large yellow croaker. Our study verified that EcoRI-NlaIII based GBS could produce quality SNP markers in a cost-efficient manner in teleost genome. The developed SNP markers and the EPA and DHA associated SNP loci provided invaluable resources for the population structure, conservation genetics and genomic selection of large yellow croaker and other fish organisms. PMID:28028455
AA9int: SNP Interaction Pattern Search Using Non-Hierarchical Additive Model Set.
Lin, Hui-Yi; Huang, Po-Yu; Chen, Dung-Tsa; Tung, Heng-Yuan; Sellers, Thomas A; Pow-Sang, Julio; Eeles, Rosalind; Easton, Doug; Kote-Jarai, Zsofia; Amin Al Olama, Ali; Benlloch, Sara; Muir, Kenneth; Giles, Graham G; Wiklund, Fredrik; Gronberg, Henrik; Haiman, Christopher A; Schleutker, Johanna; Nordestgaard, Børge G; Travis, Ruth C; Hamdy, Freddie; Neal, David E; Pashayan, Nora; Khaw, Kay-Tee; Stanford, Janet L; Blot, William J; Thibodeau, Stephen N; Maier, Christiane; Kibel, Adam S; Cybulski, Cezary; Cannon-Albright, Lisa; Brenner, Hermann; Kaneva, Radka; Batra, Jyotsna; Teixeira, Manuel R; Pandha, Hardev; Lu, Yong-Jie; Park, Jong Y
2018-06-07
The use of single nucleotide polymorphism (SNP) interactions to predict complex diseases is getting more attention during the past decade, but related statistical methods are still immature. We previously proposed the SNP Interaction Pattern Identifier (SIPI) approach to evaluate 45 SNP interaction patterns/patterns. SIPI is statistically powerful but suffers from a large computation burden. For large-scale studies, it is necessary to use a powerful and computation-efficient method. The objective of this study is to develop an evidence-based mini-version of SIPI as the screening tool or solitary use and to evaluate the impact of inheritance mode and model structure on detecting SNP-SNP interactions. We tested two candidate approaches: the 'Five-Full' and 'AA9int' method. The Five-Full approach is composed of the five full interaction models considering three inheritance modes (additive, dominant and recessive). The AA9int approach is composed of nine interaction models by considering non-hierarchical model structure and the additive mode. Our simulation results show that AA9int has similar statistical power compared to SIPI and is superior to the Five-Full approach, and the impact of the non-hierarchical model structure is greater than that of the inheritance mode in detecting SNP-SNP interactions. In summary, it is recommended that AA9int is a powerful tool to be used either alone or as the screening stage of a two-stage approach (AA9int+SIPI) for detecting SNP-SNP interactions in large-scale studies. The 'AA9int' and 'parAA9int' functions (standard and parallel computing version) are added in the SIPI R package, which is freely available at https://linhuiyi.github.io/LinHY_Software/. hlin1@lsuhsc.edu. Supplementary data are available at Bioinformatics online.
Defining functional distance using manifold embeddings of gene ontology annotations
Lerman, Gilad; Shakhnovich, Boris E.
2007-01-01
Although rigorous measures of similarity for sequence and structure are now well established, the problem of defining functional relationships has been particularly daunting. Here, we present several manifold embedding techniques to compute distances between Gene Ontology (GO) functional annotations and consequently estimate functional distances between protein domains. To evaluate accuracy, we correlate the functional distance to the well established measures of sequence, structural, and phylogenetic similarities. Finally, we show that manual classification of structures into folds and superfamilies is mirrored by proximity in the newly defined function space. We show how functional distances place structure–function relationships in biological context resulting in insight into divergent and convergent evolution. The methods and results in this paper can be readily generalized and applied to a wide array of biologically relevant investigations, such as accuracy of annotation transference, the relationship between sequence, structure, and function, or coherence of expression modules. PMID:17595300
Report on the development of putative functional SSR and SNP markers in passion fruits.
da Costa, Zirlane Portugal; Munhoz, Carla de Freitas; Vieira, Maria Lucia Carneiro
2017-09-06
Passionflowers Passiflora edulis and Passiflora alata are diploid, outcrossing and understudied fruit bearing species. In Brazil, passion fruit cultivation began relatively recently and has earned the country an outstanding position as the world's top producer of passion fruit. The fruit's main economic value lies in the production of juice, an essential exotic ingredient in juice blends. Currently, crop improvement strategies, including those for underexploited tropical species, tend to incorporate molecular genetic approaches. In this study, we examined a set of P. edulis transcripts expressed in response to infection by Xanthomonas axonopodis, (the passion fruit's main bacterial pathogen that attacks the vines), aiming at the development of putative functional markers, i.e. SSRs (simple sequence repeats) and SNPs (single nucleotide polymorphisms). A total of 210 microsatellites were found in 998 sequences, and trinucleotide repeats were found to be the most frequent (31.4%). Of the sequences selected for designing primers, 80.9% could be used to develop SSR markers, and 60.6% SNP markers for P. alata. SNPs were all biallelic and found within 15 gene fragments of P. alata. Overall, gene fragments generated 10,003 bp. SNP frequency was estimated as one SNP every 294 bp. Polymorphism rates revealed by SSR and SNP loci were 29.4 and 53.6%, respectively. Passiflora edulis transcripts were useful for the development of putative functional markers for P. alata, suggesting a certain level of sequence conservation between these cultivated species. The markers developed herein could be used for genetic mapping purposes and also in diversity studies.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kolker, Eugene
Our project focused primarily on analysis of different types of data produced by global high-throughput technologies, data integration of gene annotation, and gene and protein expression information, as well as on getting a better functional annotation of Shewanella genes. Specifically, four of our numerous major activities and achievements include the development of: statistical models for identification and expression proteomics, superior to currently available approaches (including our own earlier ones); approaches to improve gene annotations on the whole-organism scale; standards for annotation, transcriptomics and proteomics approaches; and generalized approaches for data integration of gene annotation, gene and protein expression information.
USDA-ARS?s Scientific Manuscript database
The increasing number of sequenced plant genomes is placing new demands on the methods applied to analyze, annotate, and model these genomes. Today's annotation pipelines result in inconsistent gene assignments that complicate comparative analyses and prevent efficient construction of metabolic mode...
Rašić, Gordana; Filipović, Igor; Weeks, Andrew R; Hoffmann, Ary A
2014-04-11
Genetic markers are widely used to understand the biology and population dynamics of disease vectors, but often markers are limited in the resolution they provide. In particular, the delineation of population structure, fine scale movement and patterns of relatedness are often obscured unless numerous markers are available. To address this issue in the major arbovirus vector, the yellow fever mosquito (Aedes aegypti), we used double digest Restriction-site Associated DNA (ddRAD) sequencing for the discovery of genome-wide single nucleotide polymorphisms (SNPs). We aimed to characterize the new SNP set and to test the resolution against previously described microsatellite markers in detecting broad and fine-scale genetic patterns in Ae. aegypti. We developed bioinformatics tools that support the customization of restriction enzyme-based protocols for SNP discovery. We showed that our approach for RAD library construction achieves unbiased genome representation that reflects true evolutionary processes. In Ae. aegypti samples from three continents we identified more than 18,000 putative SNPs. They were widely distributed across the three Ae. aegypti chromosomes, with 47.9% found in intergenic regions and 17.8% in exons of over 2,300 genes. Pattern of their imputed effects in ORFs and UTRs were consistent with those found in a recent transcriptome study. We demonstrated that individual mosquitoes from Indonesia, Australia, Vietnam and Brazil can be assigned with a very high degree of confidence to their region of origin using a large SNP panel. We also showed that familial relatedness of samples from a 0.4 km2 area could be confidently established with a subset of SNPs. Using a cost-effective customized RAD sequencing approach supported by our bioinformatics tools, we characterized over 18,000 SNPs in field samples of the dengue fever mosquito Ae. aegypti. The variants were annotated and positioned onto the three Ae. aegypti chromosomes. The new SNP set provided much greater resolution in detecting population structure and estimating fine-scale relatedness than a set of polymorphic microsatellites. RAD-based markers demonstrate great potential to advance our understanding of mosquito population processes, critical for implementing new control measures against this major disease vector.
Social networks to biological networks: systems biology of Mycobacterium tuberculosis.
Vashisht, Rohit; Bhardwaj, Anshu; Osdd Consortium; Brahmachari, Samir K
2013-07-01
Contextualizing relevant information to construct a network that represents a given biological process presents a fundamental challenge in the network science of biology. The quality of network for the organism of interest is critically dependent on the extent of functional annotation of its genome. Mostly the automated annotation pipelines do not account for unstructured information present in volumes of literature and hence large fraction of genome remains poorly annotated. However, if used, this information could substantially enhance the functional annotation of a genome, aiding the development of a more comprehensive network. Mining unstructured information buried in volumes of literature often requires manual intervention to a great extent and thus becomes a bottleneck for most of the automated pipelines. In this review, we discuss the potential of scientific social networking as a solution for systematic manual mining of data. Focusing on Mycobacterium tuberculosis, as a case study, we discuss our open innovative approach for the functional annotation of its genome. Furthermore, we highlight the strength of such collated structured data in the context of drug target prediction based on systems level analysis of pathogen.
Brettin, Thomas; Davis, James J.; Disz, Terry; ...
2015-02-10
The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offersmore » a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.« less
DBATE: database of alternative transcripts expression.
Bianchi, Valerio; Colantoni, Alessio; Calderone, Alberto; Ausiello, Gabriele; Ferrè, Fabrizio; Helmer-Citterich, Manuela
2013-01-01
The use of high-throughput RNA sequencing technology (RNA-seq) allows whole transcriptome analysis, providing an unbiased and unabridged view of alternative transcript expression. Coupling splicing variant-specific expression with its functional inference is still an open and difficult issue for which we created the DataBase of Alternative Transcripts Expression (DBATE), a web-based repository storing expression values and functional annotation of alternative splicing variants. We processed 13 large RNA-seq panels from human healthy tissues and in disease conditions, reporting expression levels and functional annotations gathered and integrated from different sources for each splicing variant, using a variant-specific annotation transfer pipeline. The possibility to perform complex queries by cross-referencing different functional annotations permits the retrieval of desired subsets of splicing variant expression values that can be visualized in several ways, from simple to more informative. DBATE is intended as a novel tool to help appreciate how, and possibly why, the transcriptome expression is shaped. DATABASE URL: http://bioinformatica.uniroma2.it/DBATE/.
Assessment of protein set coherence using functional annotations
Chagoyen, Monica; Carazo, Jose M; Pascual-Montano, Alberto
2008-01-01
Background Analysis of large-scale experimental datasets frequently produces one or more sets of proteins that are subsequently mined for functional interpretation and validation. To this end, a number of computational methods have been devised that rely on the analysis of functional annotations. Although current methods provide valuable information (e.g. significantly enriched annotations, pairwise functional similarities), they do not specifically measure the degree of homogeneity of a protein set. Results In this work we present a method that scores the degree of functional homogeneity, or coherence, of a set of proteins on the basis of the global similarity of their functional annotations. The method uses statistical hypothesis testing to assess the significance of the set in the context of the functional space of a reference set. As such, it can be used as a first step in the validation of sets expected to be homogeneous prior to further functional interpretation. Conclusion We evaluate our method by analysing known biologically relevant sets as well as random ones. The known relevant sets comprise macromolecular complexes, cellular components and pathways described for Saccharomyces cerevisiae, which are mostly significantly coherent. Finally, we illustrate the usefulness of our approach for validating 'functional modules' obtained from computational analysis of protein-protein interaction networks. Matlab code and supplementary data are available at PMID:18937846
Louis, Ed
2011-01-01
In the early days of the yeast genome sequencing project, gene annotation was in its infancy and suffered the problem of many false positive annotations as well as missed genes. The lack of other sequences for comparison also prevented the annotation of conserved, functional sequences that were not coding. We are now in an era of comparative genomics where many closely related as well as more distantly related genomes are available for direct sequence and synteny comparisons allowing for more probable predictions of genes and other functional sequences due to conservation. We also have a plethora of functional genomics data which helps inform gene annotation for previously uncharacterised open reading frames (ORFs)/genes. For Saccharomyces cerevisiae this has resulted in a continuous updating of the gene and functional sequence annotations in the reference genome helping it retain its position as the best characterized eukaryotic organism's genome. A single reference genome for a species does not accurately describe the species and this is quite clear in the case of S. cerevisiae where the reference strain is not ideal for brewing or baking due to missing genes. Recent surveys of numerous isolates, from a variety of sources, using a variety of technologies have revealed a great deal of variation amongst isolates with genome sequence surveys providing information on novel genes, undetectable by other means. We now have a better understanding of the extant variation in S. cerevisiae as a species as well as some idea of how much we are missing from this understanding. As with gene annotation, comparative genomics enhances the discovery and description of genome variation and is providing us with the tools for understanding genome evolution, adaptation and selection, and underlying genetics of complex traits.
Bioinformatics for spermatogenesis: annotation of male reproduction based on proteomics
Zhou, Tao; Zhou, Zuo-Min; Guo, Xue-Jiang
2013-01-01
Proteomics strategies have been widely used in the field of male reproduction, both in basic and clinical research. Bioinformatics methods are indispensable in proteomics-based studies and are used for data presentation, database construction and functional annotation. In the present review, we focus on the functional annotation of gene lists obtained through qualitative or quantitative methods, summarizing the common and male reproduction specialized proteomics databases. We introduce several integrated tools used to find the hidden biological significance from the data obtained. We further describe in detail the information on male reproduction derived from Gene Ontology analyses, pathway analyses and biomedical analyses. We provide an overview of bioinformatics annotations in spermatogenesis, from gene function to biological function and from biological function to clinical application. On the basis of recently published proteomics studies and associated data, we show that bioinformatics methods help us to discover drug targets for sperm motility and to scan for cancer-testis genes. In addition, we summarize the online resources relevant to male reproduction research for the exploration of the regulation of spermatogenesis. PMID:23852026
Impact of ontology evolution on functional analyses.
Groß, Anika; Hartung, Michael; Prüfer, Kay; Kelso, Janet; Rahm, Erhard
2012-10-15
Ontologies are used in the annotation and analysis of biological data. As knowledge accumulates, ontologies and annotation undergo constant modifications to reflect this new knowledge. These modifications may influence the results of statistical applications such as functional enrichment analyses that describe experimental data in terms of ontological groupings. Here, we investigate to what degree modifications of the Gene Ontology (GO) impact these statistical analyses for both experimental and simulated data. The analysis is based on new measures for the stability of result sets and considers different ontology and annotation changes. Our results show that past changes in the GO are non-uniformly distributed over different branches of the ontology. Considering the semantic relatedness of significant categories in analysis results allows a more realistic stability assessment for functional enrichment studies. We observe that the results of term-enrichment analyses tend to be surprisingly stable despite changes in ontology and annotation.
Gene-Environment Interaction in the Etiology of Mathematical Ability Using SNP Sets
Kovas, Yulia; Plomin, Robert
2010-01-01
Mathematics ability and disability is as heritable as other cognitive abilities and disabilities, however its genetic etiology has received relatively little attention. In our recent genome-wide association study of mathematical ability in 10-year-old children, 10 SNP associations were nominated from scans of pooled DNA and validated in an individually genotyped sample. In this paper, we use a ‘SNP set’ composite of these 10 SNPs to investigate gene-environment (GE) interaction, examining whether the association between the 10-SNP set and mathematical ability differs as a function of ten environmental measures in the home and school in a sample of 1888 children with complete data. We found two significant GE interactions for environmental measures in the home and the school both in the direction of the diathesis-stress type of GE interaction: The 10-SNP set was more strongly associated with mathematical ability in chaotic homes and when parents are negative. PMID:20978832
Bonfiglio, Ferdinando; Zheng, Tenghao; Garcia-Etxebarria, Koldo; Hadizadeh, Fatemeh; Bujanda, Luis; Bresso, Francesca; Agreus, Lars; Andreasson, Anna; Dlugosz, Aldona; Lindberg, Greger; Schmidt, Peter T; Karling, Pontus; Ohlsson, Bodil; Simren, Magnus; Walter, Susanna; Nardone, Gerardo; Cuomo, Rosario; Usai-Satta, Paolo; Galeazzi, Francesca; Neri, Matteo; Portincasa, Piero; Bellini, Massimo; Barbara, Giovanni; Latiano, Anna; Hübenthal, Matthias; Thijs, Vincent; Netea, Mihai G; Jonkers, Daisy; Chang, Lin; Mayer, Emeran A; Wouters, Mira M; Boeckxstaens, Guy; Camilleri, Michael; Franke, Andre; Zhernakova, Alexandra; D'Amato, Mauro
2018-04-04
Genetic factors are believed to affect risk for irritable bowel syndrome (IBS), but there have been no sufficiently powered and adequately sized studies. To identify DNA variants associated with IBS risk, we performed a genome-wide association study (GWAS) of the large UK Biobank population-based cohort, which includes genotype and health data from 500,000 participants. We studied 7,287,191 high-quality single-nucleotide polymorphisms in individuals who self-reported a doctor's diagnosis of IBS (cases; m=9576) compared to the remainder of the cohort (controls; n=336,499) (mean age of study subjects, 40-69 years). Genome-wide significant findings were further investigated in 2045 patients with IBS from tertiary centers and 7955 population controls from Europe and the United States, and a small general population sample from Sweden (n=249). Functional annotation of GWAS results was carried out by integrating data from multiple biorepositories, to obtain biological insights from the observed associations. We identified a genome-wide significant association on chromosome 9q31.2 (SNP rs10512344; P=3.57×10 -8 ), in a region previously linked to age at menarche, and 13 additional loci of suggestive significance (P<5.0×10 -6 ). Sex-stratified analyses revealed that the variants at 9q32.1 affect risk of IBS in only women (P=4.29×10 -10 in UK Biobank) and also associate with constipation-predominant IBS in women (P=.015 in the tertiary cohort) and harder stools in women (P=.0012 in the population-based sample). Functional annotation of the 9q32.1 locus identified 8 candidate genes, including the elongator complex protein 1 gene (ELP1 or IKBKAP), which is mutated in patients with familial dysautonomia. In a sufficiently powered GWAS of IBS, we associated variants at the locus 9q32.1 with risk of IBS in women. This observation may provide additional rationale for investigating the role of sex hormones and autonomic dysfunction in IBS. Copyright © 2018 AGA Institute. Published by Elsevier Inc. All rights reserved.
RELATIONSHIP BETWEEN PHYLOGENETIC DISTRIBUTION AND GENOMIC FEATURES IN NEUROSPORA CRASSA
USDA-ARS?s Scientific Manuscript database
In the post-genome era, insufficient functional annotation of predicted genes greatly restricts the potential of mining genome data. We demonstrate that an evolutionary approach, which is independent of functional annotation, has great potential as a tool for genome analysis. We chose the genome o...
SNP discovery in candidate adaptive genes using exon capture in a free-ranging alpine ungulate
Gretchen H. Roffler; Stephen J. Amish; Seth Smith; Ted Cosart; Marty Kardos; Michael K. Schwartz; Gordon Luikart
2016-01-01
Identification of genes underlying genomic signatures of natural selection is key to understanding adaptation to local conditions. We used targeted resequencing to identify SNP markers in 5321 candidate adaptive genes associated with known immunological, metabolic and growth functions in ovids and other ungulates. We selectively targeted 8161 exons in protein-coding...
Goonesekere, Nalin C W; Shipely, Krysten; O'Connor, Kevin
2010-06-01
The Pfam database is an important tool in genome annotation, since it provides a collection of curated protein families. However, a subset of these families, known as domains of unknown function (DUFs), remains poorly characterized. We have related sequences from DUF404, DUF407, DUF482, DUF608, DUF810, DUF853, DUF976 and DUF1111 to homologs in PDB, within the midnight zone (9-20%) of sequence identity. These relationships were extended to provide functional annotation by sequence analysis and model building. Also described are examples of residue plasticity within enzyme active sites, and change of function within homologous sequences of a DUF. Copyright 2010 Elsevier Ltd. All rights reserved.
2012-01-01
The increasing size and complexity of exome/genome sequencing data requires new tools for clinical geneticists to discover disease-causing variants. Bottlenecks in identifying the causative variation include poor cross-sample querying, constantly changing functional annotation and not considering existing knowledge concerning the phenotype. We describe a methodology that facilitates exploration of patient sequencing data towards identification of causal variants under different genetic hypotheses. Annotate-it facilitates handling, analysis and interpretation of high-throughput single nucleotide variant data. We demonstrate our strategy using three case studies. Annotate-it is freely available and test data are accessible to all users at http://www.annotate-it.org. PMID:23013645
EnsMart: A Generic System for Fast and Flexible Access to Biological Data
Kasprzyk, Arek; Keefe, Damian; Smedley, Damian; London, Darin; Spooner, William; Melsopp, Craig; Hammond, Martin; Rocca-Serra, Philippe; Cox, Tony; Birney, Ewan
2004-01-01
The EnsMart system (www.ensembl.org/EnsMart) provides a generic data warehousing solution for fast and flexible querying of large biological data sets and integration with third-party data and tools. The system consists of a query-optimized database and interactive, user-friendly interfaces. EnsMart has been applied to Ensembl, where it extends its genomic browser capabilities, facilitating rapid retrieval of customized data sets. A wide variety of complex queries, on various types of annotations, for numerous species are supported. These can be applied to many research problems, ranging from SNP selection for candidate gene screening, through cross-species evolutionary comparisons, to microarray annotation. Users can group and refine biological data according to many criteria, including cross-species analyses, disease links, sequence variations, and expression patterns. Both tabulated list data and biological sequence output can be generated dynamically, in HTML, text, Microsoft Excel, and compressed formats. A wide range of sequence types, such as cDNA, peptides, coding regions, UTRs, and exons, with additional upstream and downstream regions, can be retrieved. The EnsMart database can be accessed via a public Web site, or through a Java application suite. Both implementations and the database are freely available for local installation, and can be extended or adapted to `non-Ensembl' data sets. PMID:14707178
Bryan, Kenneth; Cunningham, Pádraig
2008-01-01
Background Microarrays have the capacity to measure the expressions of thousands of genes in parallel over many experimental samples. The unsupervised classification technique of bicluster analysis has been employed previously to uncover gene expression correlations over subsets of samples with the aim of providing a more accurate model of the natural gene functional classes. This approach also has the potential to aid functional annotation of unclassified open reading frames (ORFs). Until now this aspect of biclustering has been under-explored. In this work we illustrate how bicluster analysis may be extended into a 'semi-supervised' ORF annotation approach referred to as BALBOA. Results The efficacy of the BALBOA ORF classification technique is first assessed via cross validation and compared to a multi-class k-Nearest Neighbour (kNN) benchmark across three independent gene expression datasets. BALBOA is then used to assign putative functional annotations to unclassified yeast ORFs. These predictions are evaluated using existing experimental and protein sequence information. Lastly, we employ a related semi-supervised method to predict the presence of novel functional modules within yeast. Conclusion In this paper we demonstrate how unsupervised classification methods, such as bicluster analysis, may be extended using of available annotations to form semi-supervised approaches within the gene expression analysis domain. We show that such methods have the potential to improve upon supervised approaches and shed new light on the functions of unclassified ORFs and their co-regulation. PMID:18831786
Protein Annotators' Assistant: A Novel Application of Information Retrieval Techniques.
ERIC Educational Resources Information Center
Wise, Michael J.
2000-01-01
Protein Annotators' Assistant (PAA) is a software system which assists protein annotators in assigning functions to newly sequenced proteins. PAA employs a number of information retrieval techniques in a novel setting and is thus related to text categorization, where multiple categories may be suggested, except that in this case none of the…
Rutllant, Josep
2016-01-01
Comparative genomics approaches provide a means of leveraging functional genomics information from a highly annotated model organism's genome (such as the mouse genome) in order to make physiological inferences about the role of genes and proteins in a less characterized organism's genome (such as the Burmese python). We employed a comparative genomics approach to produce the functional annotation of Python bivittatus genes encoding proteins associated with sperm phenotypes. We identify 129 gene-phenotype relationships in the python which are implicated in 10 specific sperm phenotypes. Results obtained through our systematic analysis identified subsets of python genes exhibiting associations with gene ontology annotation terms. Functional annotation data was represented in a semantic scatter plot. Together, these newly annotated Python bivittatus genome resources provide a high resolution framework from which the biology relating to reptile spermatogenesis, fertility, and reproduction can be further investigated. Applications of our research include (1) production of genetic diagnostics for assessing fertility in domestic and wild reptiles; (2) enhanced assisted reproduction technology for endangered and captive reptiles; and (3) novel molecular targets for biotechnology-based approaches aimed at reducing fertility and reproduction of invasive reptiles. Additional enhancements to reptile genomic resources will further enhance their value. PMID:27200191
Raethong, Nachon; Wong-ekkabut, Jirasak; Laoteng, Kobkul; Vongsangnak, Wanwipa
2016-01-01
Aspergillus oryzae is widely used for the industrial production of enzymes. In A. oryzae metabolism, transporters appear to play crucial roles in controlling the flux of molecules for energy generation, nutrients delivery, and waste elimination in the cell. While the A. oryzae genome sequence is available, transporter annotation remains limited and thus the connectivity of metabolic networks is incomplete. In this study, we developed a metabolic annotation strategy to understand the relationship between the sequence, structure, and function for annotation of A. oryzae metabolic transporters. Sequence-based analysis with manual curation showed that 58 genes of 12,096 total genes in the A. oryzae genome encoded metabolic transporters. Under consensus integrative databases, 55 unambiguous metabolic transporter genes were distributed into channels and pores (7 genes), electrochemical potential-driven transporters (33 genes), and primary active transporters (15 genes). To reveal the transporter functional role, a combination of homology modeling and molecular dynamics simulation was implemented to assess the relationship between sequence to structure and structure to function. As in the energy metabolism of A. oryzae, the H+-ATPase encoded by the AO090005000842 gene was selected as a representative case study of multilevel linkage annotation. Our developed strategy can be used for enhancing metabolic network reconstruction. PMID:27274991
Raethong, Nachon; Wong-Ekkabut, Jirasak; Laoteng, Kobkul; Vongsangnak, Wanwipa
2016-01-01
Aspergillus oryzae is widely used for the industrial production of enzymes. In A. oryzae metabolism, transporters appear to play crucial roles in controlling the flux of molecules for energy generation, nutrients delivery, and waste elimination in the cell. While the A. oryzae genome sequence is available, transporter annotation remains limited and thus the connectivity of metabolic networks is incomplete. In this study, we developed a metabolic annotation strategy to understand the relationship between the sequence, structure, and function for annotation of A. oryzae metabolic transporters. Sequence-based analysis with manual curation showed that 58 genes of 12,096 total genes in the A. oryzae genome encoded metabolic transporters. Under consensus integrative databases, 55 unambiguous metabolic transporter genes were distributed into channels and pores (7 genes), electrochemical potential-driven transporters (33 genes), and primary active transporters (15 genes). To reveal the transporter functional role, a combination of homology modeling and molecular dynamics simulation was implemented to assess the relationship between sequence to structure and structure to function. As in the energy metabolism of A. oryzae, the H(+)-ATPase encoded by the AO090005000842 gene was selected as a representative case study of multilevel linkage annotation. Our developed strategy can be used for enhancing metabolic network reconstruction.
Irizarry, Kristopher J L; Rutllant, Josep
2016-01-01
Comparative genomics approaches provide a means of leveraging functional genomics information from a highly annotated model organism's genome (such as the mouse genome) in order to make physiological inferences about the role of genes and proteins in a less characterized organism's genome (such as the Burmese python). We employed a comparative genomics approach to produce the functional annotation of Python bivittatus genes encoding proteins associated with sperm phenotypes. We identify 129 gene-phenotype relationships in the python which are implicated in 10 specific sperm phenotypes. Results obtained through our systematic analysis identified subsets of python genes exhibiting associations with gene ontology annotation terms. Functional annotation data was represented in a semantic scatter plot. Together, these newly annotated Python bivittatus genome resources provide a high resolution framework from which the biology relating to reptile spermatogenesis, fertility, and reproduction can be further investigated. Applications of our research include (1) production of genetic diagnostics for assessing fertility in domestic and wild reptiles; (2) enhanced assisted reproduction technology for endangered and captive reptiles; and (3) novel molecular targets for biotechnology-based approaches aimed at reducing fertility and reproduction of invasive reptiles. Additional enhancements to reptile genomic resources will further enhance their value.
Complex nature of SNP genotype effects on gene expression in primary human leucocytes.
Heap, Graham A; Trynka, Gosia; Jansen, Ritsert C; Bruinenberg, Marcel; Swertz, Morris A; Dinesen, Lotte C; Hunt, Karen A; Wijmenga, Cisca; Vanheel, David A; Franke, Lude
2009-01-07
Genome wide association studies have been hugely successful in identifying disease risk variants, yet most variants do not lead to coding changes and how variants influence biological function is usually unknown. We correlated gene expression and genetic variation in untouched primary leucocytes (n = 110) from individuals with celiac disease - a common condition with multiple risk variants identified. We compared our observations with an EBV-transformed HapMap B cell line dataset (n = 90), and performed a meta-analysis to increase power to detect non-tissue specific effects. In celiac peripheral blood, 2,315 SNP variants influenced gene expression at 765 different transcripts (< 250 kb from SNP, at FDR = 0.05, cis expression quantitative trait loci, eQTLs). 135 of the detected SNP-probe effects (reflecting 51 unique probes) were also detected in a HapMap B cell line published dataset, all with effects in the same allelic direction. Overall gene expression differences within the two datasets predominantly explain the limited overlap in observed cis-eQTLs. Celiac associated risk variants from two regions, containing genes IL18RAP and CCR3, showed significant cis genotype-expression correlations in the peripheral blood but not in the B cell line datasets. We identified 14 genes where a SNP affected the expression of different probes within the same gene, but in opposite allelic directions. By incorporating genetic variation in co-expression analyses, functional relationships between genes can be more significantly detected. In conclusion, the complex nature of genotypic effects in human populations makes the use of a relevant tissue, large datasets, and analysis of different exons essential to enable the identification of the function for many genetic risk variants in common diseases.
Xu, Jin; Lu, Zhigang; Xu, Mingming; Pan, Ling; Deng, Yi; Xie, Xiaohu; Liu, Huifen; Ding, Shixiong; Hurd, Yasmin L.; Pasternak, Gavril W.; Klein, Robert J.; Cartegni, Luca
2014-01-01
Single nucleotide polymorphisms (SNPs) in the OPRM1 gene have been associated with vulnerability to opioid dependence. The current study identifies an association of an intronic SNP (rs9479757) with the severity of heroin addiction among Han-Chinese male heroin addicts. Individual SNP analysis and haplotype-based analysis with additional SNPs in the OPRM1 locus showed that mild heroin addiction was associated with the AG genotype, whereas severe heroin addiction was associated with the GG genotype. In vitro studies such as electrophoretic mobility shift assay, minigene, siRNA, and antisense morpholino oligonucleotide studies have identified heterogeneous nuclear ribonucleoprotein H (hnRNPH) as the major binding partner for the G-containing SNP site. The G-to-A transition weakens hnRNPH binding and facilitates exon 2 skipping, leading to altered expressions of OPRM1 splice-variant mRNAs and hMOR-1 proteins. Similar changes in splicing and hMOR-1 proteins were observed in human postmortem prefrontal cortex with the AG genotype of this SNP when compared with the GG genotype. Interestingly, the altered splicing led to an increase in hMOR-1 protein levels despite decreased hMOR-1 mRNA levels, which is likely contributed by a concurrent increase in single transmembrane domain variants that have a chaperone-like function on MOR-1 protein stability. Our studies delineate the role of this SNP as a modifier of OPRM1 alternative splicing via hnRNPH interactions, and suggest a functional link between an SNP-containing splicing modifier and the severity of heroin addiction. PMID:25122903
SNP discovery in candidate adaptive genes using exon capture in a free-ranging alpine ungulate
Roffler, Gretchen H.; Amish, Stephen J.; Smith, Seth; Cosart, Ted F.; Kardos, Marty; Schwartz, Michael K.; Luikart, Gordon
2016-01-01
Identification of genes underlying genomic signatures of natural selection is key to understanding adaptation to local conditions. We used targeted resequencing to identify SNP markers in 5321 candidate adaptive genes associated with known immunological, metabolic and growth functions in ovids and other ungulates. We selectively targeted 8161 exons in protein-coding and nearby 5′ and 3′ untranslated regions of chosen candidate genes. Targeted sequences were taken from bighorn sheep (Ovis canadensis) exon capture data and directly from the domestic sheep genome (Ovis aries v. 3; oviAri3). The bighorn sheep sequences used in the Dall's sheep (Ovis dalli dalli) exon capture aligned to 2350 genes on the oviAri3 genome with an average of 2 exons each. We developed a microfluidic qPCR-based SNP chip to genotype 476 Dall's sheep from locations across their range and test for patterns of selection. Using multiple corroborating approaches (lositan and bayescan), we detected 28 SNP loci potentially under selection. We additionally identified candidate loci significantly associated with latitude, longitude, precipitation and temperature, suggesting local environmental adaptation. The three methods demonstrated consistent support for natural selection on nine genes with immune and disease-regulating functions (e.g. Ovar-DRA, APC, BATF2, MAGEB18), cell regulation signalling pathways (e.g. KRIT1, PI3K, ORRC3), and respiratory health (CYSLTR1). Characterizing adaptive allele distributions from novel genetic techniques will facilitate investigation of the influence of environmental variation on local adaptation of a northern alpine ungulate throughout its range. This research demonstrated the utility of exon capture for gene-targeted SNP discovery and subsequent SNP chip genotyping using low-quality samples in a nonmodel species.
IMG ER: a system for microbial genome annotation expert review and curation.
Markowitz, Victor M; Mavromatis, Konstantinos; Ivanova, Natalia N; Chen, I-Min A; Chu, Ken; Kyrpides, Nikos C
2009-09-01
A rapidly increasing number of microbial genomes are sequenced by organizations worldwide and are eventually included into various public genome data resources. The quality of the annotations depends largely on the original dataset providers, with erroneous or incomplete annotations often carried over into the public resources and difficult to correct. We have developed an Expert Review (ER) version of the Integrated Microbial Genomes (IMG) system, with the goal of supporting systematic and efficient revision of microbial genome annotations. IMG ER provides tools for the review and curation of annotations of both new and publicly available microbial genomes within IMG's rich integrated genome framework. New genome datasets are included into IMG ER prior to their public release either with their native annotations or with annotations generated by IMG ER's annotation pipeline. IMG ER tools allow addressing annotation problems detected with IMG's comparative analysis tools, such as genes missed by gene prediction pipelines or genes without an associated function. Over the past year, IMG ER was used for improving the annotations of about 150 microbial genomes.
A large-scale evaluation of computational protein function prediction
Radivojac, Predrag; Clark, Wyatt T; Ronnen Oron, Tal; Schnoes, Alexandra M; Wittkop, Tobias; Sokolov, Artem; Graim, Kiley; Funk, Christopher; Verspoor, Karin; Ben-Hur, Asa; Pandey, Gaurav; Yunes, Jeffrey M; Talwalkar, Ameet S; Repo, Susanna; Souza, Michael L; Piovesan, Damiano; Casadio, Rita; Wang, Zheng; Cheng, Jianlin; Fang, Hai; Gough, Julian; Koskinen, Patrik; Törönen, Petri; Nokso-Koivisto, Jussi; Holm, Liisa; Cozzetto, Domenico; Buchan, Daniel W A; Bryson, Kevin; Jones, David T; Limaye, Bhakti; Inamdar, Harshal; Datta, Avik; Manjari, Sunitha K; Joshi, Rajendra; Chitale, Meghana; Kihara, Daisuke; Lisewski, Andreas M; Erdin, Serkan; Venner, Eric; Lichtarge, Olivier; Rentzsch, Robert; Yang, Haixuan; Romero, Alfonso E; Bhat, Prajwal; Paccanaro, Alberto; Hamp, Tobias; Kassner, Rebecca; Seemayer, Stefan; Vicedo, Esmeralda; Schaefer, Christian; Achten, Dominik; Auer, Florian; Böhm, Ariane; Braun, Tatjana; Hecht, Maximilian; Heron, Mark; Hönigschmid, Peter; Hopf, Thomas; Kaufmann, Stefanie; Kiening, Michael; Krompass, Denis; Landerer, Cedric; Mahlich, Yannick; Roos, Manfred; Björne, Jari; Salakoski, Tapio; Wong, Andrew; Shatkay, Hagit; Gatzmann, Fanny; Sommer, Ingolf; Wass, Mark N; Sternberg, Michael J E; Škunca, Nives; Supek, Fran; Bošnjak, Matko; Panov, Panče; Džeroski, Sašo; Šmuc, Tomislav; Kourmpetis, Yiannis A I; van Dijk, Aalt D J; ter Braak, Cajo J F; Zhou, Yuanpeng; Gong, Qingtian; Dong, Xinran; Tian, Weidong; Falda, Marco; Fontana, Paolo; Lavezzo, Enrico; Di Camillo, Barbara; Toppo, Stefano; Lan, Liang; Djuric, Nemanja; Guo, Yuhong; Vucetic, Slobodan; Bairoch, Amos; Linial, Michal; Babbitt, Patricia C; Brenner, Steven E; Orengo, Christine; Rost, Burkhard; Mooney, Sean D; Friedberg, Iddo
2013-01-01
Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based Critical Assessment of protein Function Annotation (CAFA) experiment. Fifty-four methods representing the state-of-the-art for protein function prediction were evaluated on a target set of 866 proteins from eleven organisms. Two findings stand out: (i) today’s best protein function prediction algorithms significantly outperformed widely-used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is significant need for improvement of currently available tools. PMID:23353650
Gai, Xiaowu; Perin, Juan C; Murphy, Kevin; O'Hara, Ryan; D'arcy, Monica; Wenocur, Adam; Xie, Hongbo M; Rappaport, Eric F; Shaikh, Tamim H; White, Peter S
2010-02-04
Recent studies have shown that copy number variations (CNVs) are frequent in higher eukaryotes and associated with a substantial portion of inherited and acquired risk for various human diseases. The increasing availability of high-resolution genome surveillance platforms provides opportunity for rapidly assessing research and clinical samples for CNV content, as well as for determining the potential pathogenicity of identified variants. However, few informatics tools for accurate and efficient CNV detection and assessment currently exist. We developed a suite of software tools and resources (CNV Workshop) for automated, genome-wide CNV detection from a variety of SNP array platforms. CNV Workshop includes three major components: detection, annotation, and presentation of structural variants from genome array data. CNV detection utilizes a robust and genotype-specific extension of the Circular Binary Segmentation algorithm, and the use of additional detection algorithms is supported. Predicted CNVs are captured in a MySQL database that supports cohort-based projects and incorporates a secure user authentication layer and user/admin roles. To assist with determination of pathogenicity, detected CNVs are also annotated automatically for gene content, known disease loci, and gene-based literature references. Results are easily queried, sorted, filtered, and visualized via a web-based presentation layer that includes a GBrowse-based graphical representation of CNV content and relevant public data, integration with the UCSC Genome Browser, and tabular displays of genomic attributes for each CNV. To our knowledge, CNV Workshop represents the first cohesive and convenient platform for detection, annotation, and assessment of the biological and clinical significance of structural variants. CNV Workshop has been successfully utilized for assessment of genomic variation in healthy individuals and disease cohorts and is an ideal platform for coordinating multiple associated projects. Available on the web at: http://sourceforge.net/projects/cnv.
RATT: Rapid Annotation Transfer Tool
Otto, Thomas D.; Dillon, Gary P.; Degrave, Wim S.; Berriman, Matthew
2011-01-01
Second-generation sequencing technologies have made large-scale sequencing projects commonplace. However, making use of these datasets often requires gene function to be ascribed genome wide. Although tool development has kept pace with the changes in sequence production, for tasks such as mapping, de novo assembly or visualization, genome annotation remains a challenge. We have developed a method to rapidly provide accurate annotation for new genomes using previously annotated genomes as a reference. The method, implemented in a tool called RATT (Rapid Annotation Transfer Tool), transfers annotations from a high-quality reference to a new genome on the basis of conserved synteny. We demonstrate that a Mycobacterium tuberculosis genome or a single 2.5 Mb chromosome from a malaria parasite can be annotated in less than five minutes with only modest computational resources. RATT is available at http://ratt.sourceforge.net. PMID:21306991
USDA-ARS?s Scientific Manuscript database
The objectives of this study were to evaluate the effect of 68 SNP previously associated with genetic merit for fertility and production on phenotype for reproductive and productive traits in a population of Holstein cows. In addition, we determined which SNP had repeated effects across three studie...
Lee, Chi-Ching; Chen, Yi-Ping Phoebe; Yao, Tzu-Jung; Ma, Cheng-Yu; Lo, Wei-Cheng; Lyu, Ping-Chiang; Tang, Chuan Yi
2013-04-10
Sequencing of microbial genomes is important because of microbial-carrying antibiotic and pathogenetic activities. However, even with the help of new assembling software, finishing a whole genome is a time-consuming task. In most bacteria, pathogenetic or antibiotic genes are carried in genomic islands. Therefore, a quick genomic island (GI) prediction method is useful for ongoing sequencing genomes. In this work, we built a Web server called GI-POP (http://gipop.life.nthu.edu.tw) which integrates a sequence assembling tool, a functional annotation pipeline, and a high-performance GI predicting module, in a support vector machine (SVM)-based method called genomic island genomic profile scanning (GI-GPS). The draft genomes of the ongoing genome projects in contigs or scaffolds can be submitted to our Web server, and it provides the functional annotation and highly probable GI-predicting results. GI-POP is a comprehensive annotation Web server designed for ongoing genome project analysis. Researchers can perform annotation and obtain pre-analytic information include possible GIs, coding/non-coding sequences and functional analysis from their draft genomes. This pre-analytic system can provide useful information for finishing a genome sequencing project. Copyright © 2012 Elsevier B.V. All rights reserved.
Rosier, Arnaud; Mabo, Philippe; Chauvin, Michel; Burgun, Anita
2015-05-01
The patient population benefitting from cardiac implantable electronic devices (CIEDs) is increasing. This study introduces a device annotation method that supports the consistent description of the functional attributes of cardiac devices and evaluates how this method can detect device changes from a CIED registry. We designed the Cardiac Device Ontology, an ontology of CIEDs and device functions. We annotated 146 cardiac devices with this ontology and used it to detect therapy changes with respect to atrioventricular pacing, cardiac resynchronization therapy, and defibrillation capability in a French national registry of patients with implants (STIDEFIX). We then analyzed a set of 6905 device replacements from the STIDEFIX registry. Ontology-based identification of therapy changes (upgraded, downgraded, or similar) was accurate (6905 cases) and performed better than straightforward analysis of the registry codes (F-measure 1.00 versus 0.75 to 0.97). This study demonstrates the feasibility and effectiveness of ontology-based functional annotation of devices in the cardiac domain. Such annotation allowed a better description and in-depth analysis of STIDEFIX. This method was useful for the automatic detection of therapy changes and may be reused for analyzing data from other device registries.
Aokic, Jun-ya; Kawase, Junya; Hamada, Kazuhisa; Fujimoto, Hiroshi; Yamamoto, Ikki; Usuki, Hironori
2018-01-01
Greater amberjack (Seriola dumerili) is distributed in tropical and temperate waters worldwide and is an important aquaculture fish. We carried out de novo sequencing of the greater amberjack genome to construct a reference genome sequence to identify single nucleotide polymorphisms (SNPs) for breeding amberjack by marker-assisted or gene-assisted selection as well as to identify functional genes for biological traits. We obtained 200 times coverage and constructed a high-quality genome assembly using next generation sequencing technology. The assembled sequences were aligned onto a yellowtail (Seriola quinqueradiata) radiation hybrid (RH) physical map by sequence homology. A total of 215 of the longest amberjack sequences, with a total length of 622.8 Mbp (92% of the total length of the genome scaffolds), were lined up on the yellowtail RH map. We resequenced the whole genomes of 20 greater amberjacks and mapped the resulting sequences onto the reference genome sequence. About 186,000 nonredundant SNPs were successfully ordered on the reference genome. Further, we found differences in the genome structural variations between two greater amberjack populations using BreakDancer. We also analyzed the greater amberjack transcriptome and mapped the annotated sequences onto the reference genome sequence. PMID:29785397
SIMPLEX: Cloud-Enabled Pipeline for the Comprehensive Analysis of Exome Sequencing Data
Fischer, Maria; Snajder, Rene; Pabinger, Stephan; Dander, Andreas; Schossig, Anna; Zschocke, Johannes; Trajanoski, Zlatko; Stocker, Gernot
2012-01-01
In recent studies, exome sequencing has proven to be a successful screening tool for the identification of candidate genes causing rare genetic diseases. Although underlying targeted sequencing methods are well established, necessary data handling and focused, structured analysis still remain demanding tasks. Here, we present a cloud-enabled autonomous analysis pipeline, which comprises the complete exome analysis workflow. The pipeline combines several in-house developed and published applications to perform the following steps: (a) initial quality control, (b) intelligent data filtering and pre-processing, (c) sequence alignment to a reference genome, (d) SNP and DIP detection, (e) functional annotation of variants using different approaches, and (f) detailed report generation during various stages of the workflow. The pipeline connects the selected analysis steps, exposes all available parameters for customized usage, performs required data handling, and distributes computationally expensive tasks either on a dedicated high-performance computing infrastructure or on the Amazon cloud environment (EC2). The presented application has already been used in several research projects including studies to elucidate the role of rare genetic diseases. The pipeline is continuously tested and is publicly available under the GPL as a VirtualBox or Cloud image at http://simplex.i-med.ac.at; additional supplementary data is provided at http://www.icbi.at/exome. PMID:22870267
Analyses of MMP20 Missense Mutations in Two Families with Hypomaturation Amelogenesis Imperfecta.
Kim, Youn Jung; Kang, Jenny; Seymen, Figen; Koruyucu, Mine; Gencay, Koray; Shin, Teo Jeon; Hyun, Hong-Keun; Lee, Zang Hee; Hu, Jan C-C; Simmer, James P; Kim, Jung-Wook
2017-01-01
Amelogenesis imperfecta is a group of rare inherited disorders that affect tooth enamel formation, quantitatively and/or qualitatively. The aim of this study was to identify the genetic etiologies of two families presenting with hypomaturation amelogenesis imperfecta. DNA was isolated from peripheral blood samples obtained from participating family members. Whole exome sequencing was performed using DNA samples from the two probands. Sequencing data was aligned to the NCBI human reference genome (NCBI build 37.2, hg19) and sequence variations were annotated with the dbSNP build 138. Mutations in MMP20 were identified in both probands. A homozygous missense mutation (c.678T>A; p.His226Gln) was identified in the consanguineous Family 1. Compound heterozygous MMP20 mutations (c.540T>A, p.Tyr180 * and c.389C>T, p.Thr130Ile) were identified in the non-consanguineous Family 2. Affected persons in Family 1 showed hypomaturation AI with dark brown discoloration, which is similar to the clinical phenotype in a previous report with the same mutation. However, the dentition of the Family 2 proband exhibited slight yellowish discoloration with reduced transparency. Functional analysis showed that the p.Thr130Ile mutant protein had reduced activity of MMP20, while there was no functional MMP20 in the Family 1 proband. These results expand the mutational spectrum of the MMP20 and broaden our understanding of genotype-phenotype correlations in amelogenesis imperfecta.
USDA-ARS?s Scientific Manuscript database
Functional annotations of large plant genome projects mostly provide information on gene function and gene families based on the presence of protein domains and gene homology, but not necessarily in association with gene expression or metabolic and regulatory networks. These additional annotations a...
Arighi, Cecilia; Shamovsky, Veronica; Masci, Anna Maria; Ruttenberg, Alan; Smith, Barry; Natale, Darren A; Wu, Cathy; D'Eustachio, Peter
2015-01-01
The Protein Ontology (PRO) provides terms for and supports annotation of species-specific protein complexes in an ontology framework that relates them both to their components and to species-independent families of complexes. Comprehensive curation of experimentally known forms and annotations thereof is expected to expose discrepancies, differences, and gaps in our knowledge. We have annotated the early events of innate immune signaling mediated by Toll-Like Receptor 3 and 4 complexes in human, mouse, and chicken. The resulting ontology and annotation data set has allowed us to identify species-specific gaps in experimental data and possible functional differences between species, and to employ inferred structural and functional relationships to suggest plausible resolutions of these discrepancies and gaps.
TriAnnot: A Versatile and High Performance Pipeline for the Automated Annotation of Plant Genomes
Leroy, Philippe; Guilhot, Nicolas; Sakai, Hiroaki; Bernard, Aurélien; Choulet, Frédéric; Theil, Sébastien; Reboux, Sébastien; Amano, Naoki; Flutre, Timothée; Pelegrin, Céline; Ohyanagi, Hajime; Seidel, Michael; Giacomoni, Franck; Reichstadt, Mathieu; Alaux, Michael; Gicquello, Emmanuelle; Legeai, Fabrice; Cerutti, Lorenzo; Numa, Hisataka; Tanaka, Tsuyoshi; Mayer, Klaus; Itoh, Takeshi; Quesneville, Hadi; Feuillet, Catherine
2012-01-01
In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5 days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8 h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future. PMID:22645565
Sadahiro, Masato; Erickson, Connor; Lin, Wei-Jye; Shin, Andrew C; Razzoli, Maria; Jiang, Cheng; Fargali, Samira; Gurney, Allison; Kelley, Kevin A; Buettner, Christoph; Bartolomucci, Alessandro; Salton, Stephen R
2015-05-01
Targeted deletion of VGF, a secreted neuronal and endocrine peptide precursor, produces lean, hypermetabolic, and infertile mice that are resistant to diet-, lesion-, and genetically-induced obesity and diabetes. Previous studies suggest that VGF controls energy expenditure (EE), fat storage, and lipolysis, whereas VGF C-terminal peptides also regulate reproductive behavior and glucose homeostasis. To assess the functional equivalence of human VGF(1-615) (hVGF) and mouse VGF(1-617) (mVGF), and to elucidate the function of the VGF C-terminal region in the regulation of energy balance and susceptibility to obesity, we generated humanized VGF knockin mouse models expressing full-length hVGF or a C-terminally deleted human VGF(1-524) (hSNP), encoded by a single nucleotide polymorphism (rs35400704). We show that homozygous male and female hVGF and hSNP mice are fertile. hVGF female mice had significantly increased body weight compared with wild-type mice, whereas hSNP mice have reduced adiposity, increased activity- and nonactivity-related EE, and improved glucose tolerance, indicating that VGF C-terminal peptides are not required for reproductive function, but 1 or more specific VGF C-terminal peptides are likely to be critical regulators of EE. Taken together, our results suggest that human and mouse VGF proteins are largely functionally conserved but that species-specific differences in VGF peptide function, perhaps a result of known differences in receptor binding affinity, likely alter the metabolic phenotype of hVGF compared with mVGF mice, and in hSNP mice in which several C-terminal VGF peptides are ablated, result in significantly increased activity- and nonactivity-related EE.
Sadahiro, Masato; Erickson, Connor; Lin, Wei-Jye; Shin, Andrew C.; Razzoli, Maria; Jiang, Cheng; Fargali, Samira; Gurney, Allison; Kelley, Kevin A.; Buettner, Christoph
2015-01-01
Targeted deletion of VGF, a secreted neuronal and endocrine peptide precursor, produces lean, hypermetabolic, and infertile mice that are resistant to diet-, lesion-, and genetically-induced obesity and diabetes. Previous studies suggest that VGF controls energy expenditure (EE), fat storage, and lipolysis, whereas VGF C-terminal peptides also regulate reproductive behavior and glucose homeostasis. To assess the functional equivalence of human VGF1–615 (hVGF) and mouse VGF1–617 (mVGF), and to elucidate the function of the VGF C-terminal region in the regulation of energy balance and susceptibility to obesity, we generated humanized VGF knockin mouse models expressing full-length hVGF or a C-terminally deleted human VGF1–524 (hSNP), encoded by a single nucleotide polymorphism (rs35400704). We show that homozygous male and female hVGF and hSNP mice are fertile. hVGF female mice had significantly increased body weight compared with wild-type mice, whereas hSNP mice have reduced adiposity, increased activity- and nonactivity-related EE, and improved glucose tolerance, indicating that VGF C-terminal peptides are not required for reproductive function, but 1 or more specific VGF C-terminal peptides are likely to be critical regulators of EE. Taken together, our results suggest that human and mouse VGF proteins are largely functionally conserved but that species-specific differences in VGF peptide function, perhaps a result of known differences in receptor binding affinity, likely alter the metabolic phenotype of hVGF compared with mVGF mice, and in hSNP mice in which several C-terminal VGF peptides are ablated, result in significantly increased activity- and nonactivity-related EE. PMID:25675362
Kamphuis, Lars G; Hane, James K; Nelson, Matthew N; Gao, Lingling; Atkins, Craig A; Singh, Karam B
2015-01-01
Narrow-leafed lupin (NLL; Lupinus angustifolius L.) is an important grain legume crop that is valuable for sustainable farming and is becoming recognized as a human health food. NLL breeding is directed at improving grain production, disease resistance, drought tolerance and health benefits. However, genetic and genomic studies have been hindered by a lack of extensive genomic resources for the species. Here, the generation, de novo assembly and annotation of transcriptome datasets derived from five different NLL tissue types of the reference accession cv. Tanjil are described. The Tanjil transcriptome was compared to transcriptomes of an early domesticated cv. Unicrop, a wild accession P27255, as well as accession 83A:476, together being the founding parents of two recombinant inbred line (RIL) populations. In silico predictions for transcriptome-derived gene-based length and SNP polymorphic markers were conducted and corroborated using a survey assembly sequence for NLL cv. Tanjil. This yielded extensive indel and SNP polymorphic markers for the two RIL populations. A total of 335 transcriptome-derived markers and 66 BAC-end sequence-derived markers were evaluated, and 275 polymorphic markers were selected to genotype the reference NLL 83A:476 × P27255 RIL population. This significantly improved the completeness, marker density and quality of the reference NLL genetic map. PMID:25060816
Qi, Jie; Liu, Xudong; Liu, Jinxiang; Yu, Haiyang; Wang, Wenji; Wang, Zhigang; Zhang, Quanqi
2014-08-01
Ambient temperature is one of the major abiotic environmental factors determining the main parameters of fish vital activity. HSP70 plays an essential role in heat response. In this investigation, the promoter and structure of Paralichthys olivaceus hsp70 (Pohsp70) gene was cloned and predicted. 2558 bp upstream regulatory region of Pohsp70 was annotated with four potential promoter elements and four putative binding sites of transcription factors heat shock elements (HSE, nGAAn) in the upstream of the transcription start site. In addition, one intron with 454 bp in the 5'-noncoding region was found. Quantitative Real Time PCR analysis indicated that the transcript level of Pohsp70 was raised markedly after 1 h by heat shocked. Furthermore, 25 SNPs were identified in Pohsp70 by resequencing, seven of which was associated with heat resistance. In addition, two of the seven SNPs, namely SNP14 and SNP16, were observed in strong linkage disequilibrium. The haplotype with association analysis showed TAGGAG haplotype was more represented in heat susceptible group while (DEL/T) GAATA haplotype was more frequent in heat resistant group. The heat resistant SNPs and haplotype could be candidate markers potentially serving for selective breeding programs of Japanese flounder aimed at improving anti-stress and production. Copyright © 2014 Elsevier Ltd. All rights reserved.
Gao, Yangchun; Li, Shiguo; Zhan, Aibin
2018-04-01
Invasive species cause huge damages to ecology, environment and economy globally. The comprehensive understanding of invasion mechanisms, particularly genetic bases of micro-evolutionary processes responsible for invasion success, is essential for reducing potential damages caused by invasive species. The golden star tunicate, Botryllus schlosseri, has become a model species in invasion biology, mainly owing to its high invasiveness nature and small well-sequenced genome. However, the genome-wide genetic markers have not been well developed in this highly invasive species, thus limiting the comprehensive understanding of genetic mechanisms of invasion success. Using restriction site-associated DNA (RAD) tag sequencing, here we developed a high-quality resource of 14,119 out of 158,821 SNPs for B. schlosseri. These SNPs were relatively evenly distributed at each chromosome. SNP annotations showed that the majority of SNPs (63.20%) were located at intergenic regions, and 21.51% and 14.58% were located at introns and exons, respectively. In addition, the potential use of the developed SNPs for population genomics studies was primarily assessed, such as the estimate of observed heterozygosity (H O ), expected heterozygosity (H E ), nucleotide diversity (π), Wright's inbreeding coefficient (F IS ) and effective population size (Ne). Our developed SNP resource would provide future studies the genome-wide genetic markers for genetic and genomic investigations, such as genetic bases of micro-evolutionary processes responsible for invasion success.
Grötzinger, Stefan W.; Alam, Intikhab; Ba Alawi, Wail; Bajic, Vladimir B.; Stingl, Ulrich; Eppinger, Jörg
2014-01-01
Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available through the INDIGO website. PMID:24778629
Ethnic differences in microvascular function in apparently healthy South African men and women.
Pienaar, P R; Micklesfield, L K; Gill, J M R; Shore, A C; Gooding, K M; Levitt, N S; Lambert, E V
2014-07-01
Microvascular dysfunction precedes the clinical manifestations of cardiovascular disease. Given the ethnic disparities in cardiovascular disease, we aimed to investigate ethnic differences in microvascular endothelial function in a group of young (18-33 years old), apparently healthy individuals (n = 33, nine Black African, 12 mixed ancestry and 12 Caucasian). Microvascular endothelium-dependent and -independent function was assessed by laser Doppler imagery and iontophoresis of ACh and sodium nitroprusside (SNP), respectively, adjusting for skin resistance. Microvascular reactivity was expressed as maximum absolute perfusion, percentage change from baseline and area under the curve (AUC). Skin resistance was significantly lower in the Caucasian group in response to ACh (Caucasian, mean 0.16 ± 0.03 Ω versus Black, 0.21 ± 0.04 Ω and mixed ancestry, 0.20 ± 0.02 Ω, P < 0.01) and SNP (Caucasian, 0.08 ± 0.01 Ω versus Black, 0.11 ± 0.02 Ω and mixed ancestry, 0.12 ± 0.01 Ω, P < 0.01). Microvascular function in response to ACh was significantly higher in the Caucasian group compared with the other two groups; however, after adjusting for skin resistance these differences were no longer significant. Conversely, the microvascular SNP response remained significantly higher in the Caucasian group, even after adjusting for skin resistance (P < 0.01). Diastolic blood pressure was inversely associated with the AUC of ACh (r = -0.4) and all SNP responses (r = -0.3 to -0.6). Skin resistance was inversely associated with AUC and maximum absolute ACh response (r = -0.59 and -0.64, respectively) and all SNP responses (r = -0.37 to -0.79). Ethnic differences in endothelium-independent microvascular function may contribute to ethnic disparities in cardiovascular disease. Moreover, skin resistance plays a significant role in the interpretation of the microvascular response to outcomes of iontophoresis in a multiethnic group. © 2014 The Authors. Experimental Physiology © 2014 The Physiological Society.
Itoh, Takeshi; Tanaka, Tsuyoshi; Barrero, Roberto A.; Yamasaki, Chisato; Fujii, Yasuyuki; Hilton, Phillip B.; Antonio, Baltazar A.; Aono, Hideo; Apweiler, Rolf; Bruskiewich, Richard; Bureau, Thomas; Burr, Frances; Costa de Oliveira, Antonio; Fuks, Galina; Habara, Takuya; Haberer, Georg; Han, Bin; Harada, Erimi; Hiraki, Aiko T.; Hirochika, Hirohiko; Hoen, Douglas; Hokari, Hiroki; Hosokawa, Satomi; Hsing, Yue; Ikawa, Hiroshi; Ikeo, Kazuho; Imanishi, Tadashi; Ito, Yukiyo; Jaiswal, Pankaj; Kanno, Masako; Kawahara, Yoshihiro; Kawamura, Toshiyuki; Kawashima, Hiroaki; Khurana, Jitendra P.; Kikuchi, Shoshi; Komatsu, Setsuko; Koyanagi, Kanako O.; Kubooka, Hiromi; Lieberherr, Damien; Lin, Yao-Cheng; Lonsdale, David; Matsumoto, Takashi; Matsuya, Akihiro; McCombie, W. Richard; Messing, Joachim; Miyao, Akio; Mulder, Nicola; Nagamura, Yoshiaki; Nam, Jongmin; Namiki, Nobukazu; Numa, Hisataka; Nurimoto, Shin; O’Donovan, Claire; Ohyanagi, Hajime; Okido, Toshihisa; OOta, Satoshi; Osato, Naoki; Palmer, Lance E.; Quetier, Francis; Raghuvanshi, Saurabh; Saichi, Naomi; Sakai, Hiroaki; Sakai, Yasumichi; Sakata, Katsumi; Sakurai, Tetsuya; Sato, Fumihiko; Sato, Yoshiharu; Schoof, Heiko; Seki, Motoaki; Shibata, Michie; Shimizu, Yuji; Shinozaki, Kazuo; Shinso, Yuji; Singh, Nagendra K.; Smith-White, Brian; Takeda, Jun-ichi; Tanino, Motohiko; Tatusova, Tatiana; Thongjuea, Supat; Todokoro, Fusano; Tsugane, Mika; Tyagi, Akhilesh K.; Vanavichit, Apichart; Wang, Aihui; Wing, Rod A.; Yamaguchi, Kaori; Yamamoto, Mayu; Yamamoto, Naoyuki; Yu, Yeisoo; Zhang, Hao; Zhao, Qiang; Higo, Kenichi; Burr, Benjamin; Gojobori, Takashi; Sasaki, Takuji
2007-01-01
We present here the annotation of the complete genome of rice Oryza sativa L. ssp. japonica cultivar Nipponbare. All functional annotations for proteins and non-protein-coding RNA (npRNA) candidates were manually curated. Functions were identified or inferred in 19,969 (70%) of the proteins, and 131 possible npRNAs (including 58 antisense transcripts) were found. Almost 5000 annotated protein-coding genes were found to be disrupted in insertional mutant lines, which will accelerate future experimental validation of the annotations. The rice loci were determined by using cDNA sequences obtained from rice and other representative cereals. Our conservative estimate based on these loci and an extrapolation suggested that the gene number of rice is ∼32,000, which is smaller than previous estimates. We conducted comparative analyses between rice and Arabidopsis thaliana and found that both genomes possessed several lineage-specific genes, which might account for the observed differences between these species, while they had similar sets of predicted functional domains among the protein sequences. A system to control translational efficiency seems to be conserved across large evolutionary distances. Moreover, the evolutionary process of protein-coding genes was examined. Our results suggest that natural selection may have played a role for duplicated genes in both species, so that duplication was suppressed or favored in a manner that depended on the function of a gene. PMID:17210932
Médigue, Claudine; Calteau, Alexandra; Cruveiller, Stéphane; Gachet, Mathieu; Gautreau, Guillaume; Josso, Adrien; Lajus, Aurélie; Langlois, Jordan; Pereira, Hugo; Planel, Rémi; Roche, David; Rollin, Johan; Rouy, Zoe; Vallenet, David
2017-09-12
The overwhelming list of new bacterial genomes becoming available on a daily basis makes accurate genome annotation an essential step that ultimately determines the relevance of thousands of genomes stored in public databanks. The MicroScope platform (http://www.genoscope.cns.fr/agc/microscope) is an integrative resource that supports systematic and efficient revision of microbial genome annotation, data management and comparative analysis. Starting from the results of our syntactic, functional and relational annotation pipelines, MicroScope provides an integrated environment for the expert annotation and comparative analysis of prokaryotic genomes. It combines tools and graphical interfaces to analyze genomes and to perform the manual curation of gene function in a comparative genomics and metabolic context. In this article, we describe the free-of-charge MicroScope services for the annotation and analysis of microbial (meta)genomes, transcriptomic and re-sequencing data. Then, the functionalities of the platform are presented in a way providing practical guidance and help to the nonspecialists in bioinformatics. Newly integrated analysis tools (i.e. prediction of virulence and resistance genes in bacterial genomes) and original method recently developed (the pan-genome graph representation) are also described. Integrated environments such as MicroScope clearly contribute, through the user community, to help maintaining accurate resources. © The Author 2017. Published by Oxford University Press.
Narsai, Reena; Devenish, James; Castleden, Ian; Narsai, Kabir; Xu, Lin; Shou, Huixia; Whelan, James
2013-01-01
Omics research in Oryza sativa (rice) relies on the use of multiple databases to obtain different types of information to define gene function. We present Rice DB, an Oryza information portal that is a functional genomics database, linking gene loci to comprehensive annotations, expression data and the subcellular location of encoded proteins. Rice DB has been designed to integrate the direct comparison of rice with Arabidopsis (Arabidopsis thaliana), based on orthology or ‘expressology’, thus using and combining available information from two pre-eminent plant models. To establish Rice DB, gene identifiers (more than 40 types) and annotations from a variety of sources were compiled, functional information based on large-scale and individual studies was manually collated, hundreds of microarrays were analysed to generate expression annotations, and the occurrences of potential functional regulatory motifs in promoter regions were calculated. A range of computational subcellular localization predictions were also run for all putative proteins encoded in the rice genome, and experimentally confirmed protein localizations have been collated, curated and linked to functional studies in rice. A single search box allows anything from gene identifiers (for rice and/or Arabidopsis), motif sequences, subcellular location, to keyword searches to be entered, with the capability of Boolean searches (such as AND/OR). To demonstrate the utility of Rice DB, several examples are presented including a rice mitochondrial proteome, which draws on a variety of sources for subcellular location data within Rice DB. Comparisons of subcellular location, functional annotations, as well as transcript expression in parallel with Arabidopsis reveals examples of conservation between rice and Arabidopsis, using Rice DB (http://ricedb.plantenergy.uwa.edu.au). PMID:24147765
Narsai, Reena; Devenish, James; Castleden, Ian; Narsai, Kabir; Xu, Lin; Shou, Huixia; Whelan, James
2013-12-01
Omics research in Oryza sativa (rice) relies on the use of multiple databases to obtain different types of information to define gene function. We present Rice DB, an Oryza information portal that is a functional genomics database, linking gene loci to comprehensive annotations, expression data and the subcellular location of encoded proteins. Rice DB has been designed to integrate the direct comparison of rice with Arabidopsis (Arabidopsis thaliana), based on orthology or 'expressology', thus using and combining available information from two pre-eminent plant models. To establish Rice DB, gene identifiers (more than 40 types) and annotations from a variety of sources were compiled, functional information based on large-scale and individual studies was manually collated, hundreds of microarrays were analysed to generate expression annotations, and the occurrences of potential functional regulatory motifs in promoter regions were calculated. A range of computational subcellular localization predictions were also run for all putative proteins encoded in the rice genome, and experimentally confirmed protein localizations have been collated, curated and linked to functional studies in rice. A single search box allows anything from gene identifiers (for rice and/or Arabidopsis), motif sequences, subcellular location, to keyword searches to be entered, with the capability of Boolean searches (such as AND/OR). To demonstrate the utility of Rice DB, several examples are presented including a rice mitochondrial proteome, which draws on a variety of sources for subcellular location data within Rice DB. Comparisons of subcellular location, functional annotations, as well as transcript expression in parallel with Arabidopsis reveals examples of conservation between rice and Arabidopsis, using Rice DB (http://ricedb.plantenergy.uwa.edu.au). © 2013 The Authors The Plant Journal © 2013 John Wiley & Sons Ltd.
Protein sequence annotation in the genome era: the annotation concept of SWISS-PROT+TREMBL.
Apweiler, R; Gateau, A; Contrino, S; Martin, M J; Junker, V; O'Donovan, C; Lang, F; Mitaritonna, N; Kappus, S; Bairoch, A
1997-01-01
SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases. Ongoing genome sequencing projects have dramatically increased the number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. However, as we also want to make the sequences available as fast as possible, we introduced TREMBL (TRanslation of EMBL nucleotide sequence database), a supplement to SWISS-PROT. TREMBL consists of computer-annotated entries in SWISS-PROT format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except for CDS already included in SWISS-PROT. While TREMBL is already of immense value, its computer-generated annotation does not match the quality of SWISS-PROTs. The main difference is in the protein functional information attached to sequences. With this in mind, we are dedicating substantial effort to develop and apply computer methods to enhance the functional information attached to TREMBL entries.
The Biological Reference Repository (BioR): a rapid and flexible system for genomics annotation.
Kocher, Jean-Pierre A; Quest, Daniel J; Duffy, Patrick; Meiners, Michael A; Moore, Raymond M; Rider, David; Hossain, Asif; Hart, Steven N; Dinu, Valentin
2014-07-01
The Biological Reference Repository (BioR) is a toolkit for annotating variants. BioR stores public and user-specific annotation sources in indexed JSON-encoded flat files (catalogs). The BioR toolkit provides the functionality to combine and retrieve annotation from these catalogs via the command-line interface. Several catalogs from commonly used annotation sources and instructions for creating user-specific catalogs are provided. Commands from the toolkit can be combined with other UNIX commands for advanced annotation processing. We also provide instructions for the development of custom annotation pipelines. The package is implemented in Java and makes use of external tools written in Java and Perl. The toolkit can be executed on Mac OS X 10.5 and above or any Linux distribution. The BioR application, quickstart, and user guide documents and many biological examples are available at http://bioinformaticstools.mayo.edu. © The Author 2014. Published by Oxford University Press.
Silver nano fabrication using leaf disc of Passiflora foetida Linn
NASA Astrophysics Data System (ADS)
Lade, Bipin D.; Patil, Anita S.
2017-06-01
The main purpose of the experiment is to develop a greener low cost SNP fabrication steps using factories of secondary metabolites from Passiflora leaf extract. Here, the leaf extraction process is omitted, and instead a leaf disc was used for stable SNP fabricated by optimizing parameters such as a circular leaf disc of 2 cm (1, 2, 3, 4, 5) instead of leaf extract and grade of pH (7, 8, 9, 11). The SNP synthesis reaction is tried under room temperature, sun, UV and dark condition. The leaf disc preparation steps are also discussed in details. The SNP obtained using (1 mM: 100 ml AgNO3+ singular leaf disc: pH 9, 11) is applied against featured room temperature and sun condition. The UV spectroscopic analysis confirms that sun rays synthesized SNP yields stable nano particles. The FTIR analysis confirms a large number of functional groups such as alkanes, alkyne, amines, aliphatic amine, carboxylic acid; nitro-compound, alcohol, saturated aldehyde and phenols involved in reduction of silver salt to zero valent ions. The leaf disc mediated synthesis of silver nanoparticles, minimizes leaf extract preparation step and eligible for stable SNP synthesis. The methods sun and room temperature based nano particles synthesized within 10 min would be use certainly for antimicrobial activity.
Dictionary-driven protein annotation.
Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel
2002-09-01
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.
Alexiadis, Anastasios; Delidakis, Christos; Kalantidis, Kriton
2017-07-01
The conserved 3'-5' RNA exonuclease ERI1 is implicated in RNA interference inhibition, 5.8S rRNA maturation and histone mRNA maturation and turnover. The single ERI1 homologue in Drosophila melanogaster Snipper (Snp) is a 3'-5' exonuclease, but its in vivo function remains elusive. Here, we report Snp requirement for normal Drosophila development, since its perturbation leads to larval arrest and tissue-specific downregulation results in abnormal tissue development. Additionally, Snp directly interacts with histone mRNA, and its depletion results in drastic reduction in histone transcript levels. We propose that Snp protects the 3'-ends of histone mRNAs and upon its absence, histone transcripts are readily degraded. This in turn may lead to cell cycle delay or arrest, causing growth arrest and developmental perturbations. © 2017 Federation of European Biochemical Societies.
Livingstone, Donald; Stack, Conrad; Mustiga, Guiliana M; Rodezno, Dayana C; Suarez, Carmen; Amores, Freddy; Feltus, Frank A; Mockaitis, Keithanne; Cornejo, Omar E; Motamayor, Juan C
2017-01-01
Cacao ( Theobroma cacao L.) is an important cash crop in tropical regions around the world and has a rich agronomic history in South America. As a key component in the cosmetic and confectionary industries, millions of people worldwide use products made from cacao, ranging from shampoo to chocolate. An Illumina Infinity II array was created using 13,530 SNPs identified within a small diversity panel of cacao. Of these SNPs, 12,643 derive from variation within annotated cacao genes. The genotypes of 3,072 trees were obtained, including two mapping populations from Ecuador. High-density linkage maps for these two populations were generated and compared to the cacao genome assembly. Phenotypic data from these populations were combined with the linkage maps to identify the QTLs for yield and disease resistance.
Ajayi, Oyeyemi O; Adefenwa, Mufliat A; Agaviezor, Brilliant O; Ikeobi, Christian O N; Wheto, Matthew; Okpeku, Moses; Amusan, Samuel A; Yakubu, Abdulmojeed; De Donato, Marcos; Peters, Sunday O; Imumorin, Ikhide G
2014-02-01
The tenascin-XB (TNXB) gene has antiadhesive effects, functions in matrix maturation in connective tissues, and localizes to the major histocompatibility complex class III region. We hypothesized that it may influence adaptive physiological response through an effect on blood vessel function. We identified a novel g.1324 A→G polymorphism at a TaqI recognition site in a 454 bp fragment of ovine TNXB and genotyped it in 150 Nigerian sheep using PCR-RFLP. The missense mutation changes glutamic acid (GAA) to glycine (GGA). Among SNP genotypes, significant differences (P < 0.05) were observed in body weight and fore cannon bone length. Interaction effects of breed, SNP genotype, and geographic location had a significant effect (P < 0.05) on chest girth. The SNP genotype was significantly (P < 0.05) associated with physiological traits of pulse rate and skin temperature. The observed effect of this novel polymorphism may be mediated through its role in connective tissue biology, requiring further association and functional studies.
Jadhav, Sachin; Sattar, Naveed; Petrie, John R; Cobbe, Stuart M; Ferrell, William R
2007-09-01
Interrogation of peripheral vascular function is increasingly recognized as a noninvasive surrogate marker for coronary vascular function and carries with it important prognostic information regarding future cardiovascular risk. Laser Doppler imaging (LDI) is a completely noninvasive method for looking at peripheral microvascular function. We sought to look at reproducibility and repeatability of LDI-derived assessment of peripheral microvascular function between arms and 8 weeks apart. We used LDI in conjunction with iontophoretic application of ACh and SNP to look at endothelium-dependent and -independent microvascular function, respectively, in a mixture of women with cardiac syndrome X and healthy volunteers. We looked at variation between arms (n = 40) and variation at 8 weeks apart (n = 22). When measurements were corrected for skin resistance, there was nonsignificant variation between arms for ACh (2.7%) and SNP (3.8%) and nonsignificant temporal variation for ACh (3.5%) and SNP (4.7%). Construction of Bland-Altman plots reinforce that measurements have good repeatability. Elimination of the baseline perfusion response had deleterious effects on repeatability. LDI can be used to assess peripheral vascular response with good repeatability as long as measurements are corrected for skin resistance, which affects drug delivery. This has important implications for the future use of LDI.
Chado controller: advanced annotation management with a community annotation system.
Guignon, Valentin; Droc, Gaëtan; Alaux, Michael; Baurens, Franc-Christophe; Garsmeur, Olivier; Poiron, Claire; Carver, Tim; Rouard, Mathieu; Bocs, Stéphanie
2012-04-01
We developed a controller that is compliant with the Chado database schema, GBrowse and genome annotation-editing tools such as Artemis and Apollo. It enables the management of public and private data, monitors manual annotation (with controlled vocabularies, structural and functional annotation controls) and stores versions of annotation for all modified features. The Chado controller uses PostgreSQL and Perl. The Chado Controller package is available for download at http://www.gnpannot.org/content/chado-controller and runs on any Unix-like operating system, and documentation is available at http://www.gnpannot.org/content/chado-controller-doc The system can be tested using the GNPAnnot Sandbox at http://www.gnpannot.org/content/gnpannot-sandbox-form valentin.guignon@cirad.fr; stephanie.sidibe-bocs@cirad.fr Supplementary data are available at Bioinformatics online.
ERIC Educational Resources Information Center
Kimmel, Melvin J.
This technical report provides an annotated bibliography of senior leadership literature with an emphasis on necessary skills and functions. It was compiled through a literature search that determined the state of the art of research and theory on senior leadership skills, functions, activities, and other job related characteristics. One hundred…
Functional evaluation of genetic variants associated with endometriosis near GREB1.
Fung, Jenny N; Holdsworth-Carson, Sarah J; Sapkota, Yadav; Zhao, Zhen Zhen; Jones, Lincoln; Girling, Jane E; Paiva, Premila; Healey, Martin; Nyholt, Dale R; Rogers, Peter A W; Montgomery, Grant W
2015-05-01
Do DNA variants in the growth regulation by estrogen in breast cancer 1 (GREB1) region regulate endometrial GREB1 expression and increase the risk of developing endometriosis in women? We identified new single nucleotide polymorphisms (SNPs) with strong association with endometriosis at the GREB1 locus although we did not detect altered GREB1 expression in endometriosis patients with defined genotypes. Genome-wide association studies have identified the GREB1 region on chromosome 2p25.1 for increasing endometriosis risk. The differential expression of GREB1 has also been reported by others in association with endometriosis disease phenotype. Fine mapping studies comprehensively evaluated SNPs within the GREB1 region in a large-scale data set (>2500 cases and >4000 controls). Publicly available bioinformatics tools were employed to functionally annotate SNPs showing the strongest association signal with endometriosis risk. Endometrial GREB1 mRNA and protein expression was studied with respect to phases of the menstrual cycle (n = 2-45 per cycle stage) and expression quantitative trait loci (eQTL) analysis for significant SNPs were undertaken for GREB1 [mRNA (n = 94) and protein (n = 44) in endometrium]. Participants in this study are females who provided blood and/or endometrial tissue samples in a hospital setting. The key SNPs were genotyped using Sequenom MassARRAY. The functional roles and regulatory annotations for identified SNPs are predicted by various publicly available bioinformatics tools. Endometrial GREB1 expression work employed qRT-PCR, western blotting and immunohistochemistry studies. Fine mapping results identified a number of SNPs showing stronger association (0.004 < P < 0.032) with endometriosis risk than the original GWAS SNP (rs13394619) (P = 0.034). Some of these SNPs were predicted to have functional roles, for example, interaction with transcription factor motifs. The haplotype (a combination of alleles) formed by the risk alleles from two common SNPs showed significant association (P = 0.026) with endometriosis and epistasis analysis showed no evidence for interaction between the two SNPs, suggesting an additive effect of SNPs on endometriosis risk. In normal human endometrium, GREB1 protein expression was altered depending on the cycle stage (significantly different in late proliferative versus late secretory, P < 0.05) and cell type (glandular epithelium, not stromal cells). However, GREB1 expression in endometriosis cases versus controls and eQTL analyses did not reveal any significant changes. In silico prediction tools are generally based on cell lines different to our tissue and disease of interest. Functional annotations drawn from these analyses should be considered with this limitation in mind. We identified cell-specific and hormone-specific changes in GREB1 protein expression. The lack of a significant difference observed following our GREB1 expression studies may be the result of moderate power on mixed cell populations in the endometrial tissue samples. This study further implicates the GREB1 region on chromosome 2p25.1 and the GREB1 gene with involvement in endometriosis risk. More detailed functional studies are required to determine the role of the novel GREB1 transcripts in endometriosis pathophysiology. Funding for this work was provided by NHMRC Project Grants APP1012245, APP1026033, APP1049472 and APP1046880. There are no competing interests. © The Author 2015. Published by Oxford University Press on behalf of the European Society of Human Reproduction and Embryology. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
Duellman, Tyler; Warren, Christopher; Yang, Jay
2014-01-01
Microribonucleic acids (miRNAs) work with exquisite specificity and are able to distinguish a target from a non-target based on a single nucleotide mismatch in the core nucleotide domain. We questioned whether miRNA regulation of gene expression could occur in a single nucleotide polymorphism (SNP)-specific manner, manifesting as a post-transcriptional control of expression of genetic polymorphisms. In our recent study of the functional consequences of matrix metalloproteinase (MMP)-9 SNPs, we discovered that expression of a coding exon SNP in the pro-domain of the protein resulted in a profound decrease in the secreted protein. This missense SNP results in the N38S amino acid change and a loss of an N-glycosylation site. A systematic study demonstrated that the loss of secreted protein was due not to the loss of an N-glycosylation site, but rather an SNP-specific targeting by miR-671-3p and miR-657. Bioinformatics analysis identified 41 SNP-specific miRNA targeting MMP-9 SNPs, mostly in the coding exon and an extension of the analysis to chromosome 20, where the MMP-9 gene is located, suggesting that SNP-specific miRNAs targeting the coding exon are prevalent. This selective post-transcriptional regulation of a target messenger RNA harboring genetic polymorphisms by miRNAs offers an SNP-dependent post-transcriptional regulatory mechanism, allowing for polymorphic-specific differential gene regulation. PMID:24627221
Prioritizing individual genetic variants after kernel machine testing using variable selection.
He, Qianchuan; Cai, Tianxi; Liu, Yang; Zhao, Ni; Harmon, Quaker E; Almli, Lynn M; Binder, Elisabeth B; Engel, Stephanie M; Ressler, Kerry J; Conneely, Karen N; Lin, Xihong; Wu, Michael C
2016-12-01
Kernel machine learning methods, such as the SNP-set kernel association test (SKAT), have been widely used to test associations between traits and genetic polymorphisms. In contrast to traditional single-SNP analysis methods, these methods are designed to examine the joint effect of a set of related SNPs (such as a group of SNPs within a gene or a pathway) and are able to identify sets of SNPs that are associated with the trait of interest. However, as with many multi-SNP testing approaches, kernel machine testing can draw conclusion only at the SNP-set level, and does not directly inform on which one(s) of the identified SNP set is actually driving the associations. A recently proposed procedure, KerNel Iterative Feature Extraction (KNIFE), provides a general framework for incorporating variable selection into kernel machine methods. In this article, we focus on quantitative traits and relatively common SNPs, and adapt the KNIFE procedure to genetic association studies and propose an approach to identify driver SNPs after the application of SKAT to gene set analysis. Our approach accommodates several kernels that are widely used in SNP analysis, such as the linear kernel and the Identity by State (IBS) kernel. The proposed approach provides practically useful utilities to prioritize SNPs, and fills the gap between SNP set analysis and biological functional studies. Both simulation studies and real data application are used to demonstrate the proposed approach. © 2016 WILEY PERIODICALS, INC.
Aviation medicine translations : annotated bibliography of recently translated material, III.
DOT National Transportation Integrated Search
1965-04-01
An annotated bibliography of translations of foreignlanguage research articles is presented. The 26 listed entries are concerned with studies of aviation medicine, periodicity, optokinetic nystagmus, vision, vestibular function, and physical science....
Current challenges in genome annotation through structural biology and bioinformatics.
Furnham, Nicholas; de Beer, Tjaart A P; Thornton, Janet M
2012-10-01
With the huge volume in genomic sequences being generated from high-throughout sequencing projects the requirement for providing accurate and detailed annotations of gene products has never been greater. It is proving to be a huge challenge for computational biologists to use as much information as possible from experimental data to provide annotations for genome data of unknown function. A central component to this process is to use experimentally determined structures, which provide a means to detect homology that is not discernable from just the sequence and permit the consequences of genomic variation to be realized at the molecular level. In particular, structures also form the basis of many bioinformatics methods for improving the detailed functional annotations of enzymes in combination with similarities in sequence and chemistry. Copyright © 2012. Published by Elsevier Ltd.
GAP Final Technical Report 12-14-04
DOE Office of Scientific and Technical Information (OSTI.GOV)
Andrew J. Bordner, PhD, Senior Research Scientist
2004-12-14
The Genomics Annotation Platform (GAP) was designed to develop new tools for high throughput functional annotation and characterization of protein sequences and structures resulting from genomics and structural proteomics, benchmarking and application of those tools. Furthermore, this platform integrated the genomic scale sequence and structural analysis and prediction tools with the advanced structure prediction and bioinformatics environment of ICM. The development of GAP was primarily oriented towards the annotation of new biomolecular structures using both structural and sequence data. Even though the amount of protein X-ray crystal data is growing exponentially, the volume of sequence data is growing even moremore » rapidly. This trend was exploited by leveraging the wealth of sequence data to provide functional annotation for protein structures. The additional information provided by GAP is expected to assist the majority of the commercial users of ICM, who are involved in drug discovery, in identifying promising drug targets as well in devising strategies for the rational design of therapeutics directed at the protein of interest. The GAP also provided valuable tools for biochemistry education, and structural genomics centers. In addition, GAP incorporates many novel prediction and analysis methods not available in other molecular modeling packages. This development led to signing the first Molsoft agreement in the structural genomics annotation area with the University of oxford Structural Genomics Center. This commercial agreement validated the Molsoft efforts under the GAP project and provided the basis for further development of the large scale functional annotation platform.« less
2010-01-01
Background Thoroughbred horses have been selected for traits contributing to speed and stamina for centuries. It is widely recognized that inherited variation in physical and physiological characteristics is responsible for variation in individual aptitude for race distance, and that muscle phenotypes in particular are important. Results A genome-wide SNP-association study for optimum racing distance was performed using the EquineSNP50 Bead Chip genotyping array in a cohort of n = 118 elite Thoroughbred racehorses divergent for race distance aptitude. In a cohort-based association test we evaluated genotypic variation at 40,977 SNPs between horses suited to short distance (≤ 8 f) and middle-long distance (> 8 f) races. The most significant SNP was located on chromosome 18: BIEC2-417495 ~690 kb from the gene encoding myostatin (MSTN) [Punadj. = 6.96 × 10-6]. Considering best race distance as a quantitative phenotype, a peak of association on chromosome 18 (chr18:65809482-67545806) comprising eight SNPs encompassing a 1.7 Mb region was observed. Again, similar to the cohort-based analysis, the most significant SNP was BIEC2-417495 (Punadj. = 1.61 × 10-9; PBonf. = 6.58 × 10-5). In a candidate gene study we have previously reported a SNP (g.66493737C>T) in MSTN associated with best race distance in Thoroughbreds; however, its functional and genome-wide relevance were uncertain. Additional re-sequencing in the flanking regions of the MSTN gene revealed four novel 3' UTR SNPs and a 227 bp SINE insertion polymorphism in the 5' UTR promoter sequence. Linkage disequilibrium was highest between g.66493737C>T and BIEC2-417495 (r2 = 0.86). Conclusions Comparative association tests consistently demonstrated the g.66493737C>T SNP as the superior variant in the prediction of distance aptitude in racehorses (g.66493737C>T, P = 1.02 × 10-10; BIEC2-417495, Punadj. = 1.61 × 10-9). Functional investigations will be required to determine whether this polymorphism affects putative transcription-factor binding and gives rise to variation in gene and protein expression. Nonetheless, this study demonstrates that the g.66493737C>T SNP provides the most powerful genetic marker for prediction of race distance aptitude in Thoroughbreds. PMID:20932346
Hill, Emmeline W; McGivney, Beatrice A; Gu, Jingjing; Whiston, Ronan; Machugh, David E
2010-10-11
Thoroughbred horses have been selected for traits contributing to speed and stamina for centuries. It is widely recognized that inherited variation in physical and physiological characteristics is responsible for variation in individual aptitude for race distance, and that muscle phenotypes in particular are important. A genome-wide SNP-association study for optimum racing distance was performed using the EquineSNP50 Bead Chip genotyping array in a cohort of n = 118 elite Thoroughbred racehorses divergent for race distance aptitude. In a cohort-based association test we evaluated genotypic variation at 40,977 SNPs between horses suited to short distance (≤ 8 f) and middle-long distance (> 8 f) races. The most significant SNP was located on chromosome 18: BIEC2-417495 ~690 kb from the gene encoding myostatin (MSTN) [P(unadj.) = 6.96 x 10⁻⁶]. Considering best race distance as a quantitative phenotype, a peak of association on chromosome 18 (chr18:65809482-67545806) comprising eight SNPs encompassing a 1.7 Mb region was observed. Again, similar to the cohort-based analysis, the most significant SNP was BIEC2-417495 (P(unadj.) = 1.61 x 10⁻⁹; P(Bonf.) = 6.58 x 10⁻⁵). In a candidate gene study we have previously reported a SNP (g.66493737C>T) in MSTN associated with best race distance in Thoroughbreds; however, its functional and genome-wide relevance were uncertain. Additional re-sequencing in the flanking regions of the MSTN gene revealed four novel 3' UTR SNPs and a 227 bp SINE insertion polymorphism in the 5' UTR promoter sequence. Linkage disequilibrium was highest between g.66493737C>T and BIEC2-417495 (r² = 0.86). Comparative association tests consistently demonstrated the g.66493737C>T SNP as the superior variant in the prediction of distance aptitude in racehorses (g.66493737C>T, P = 1.02 x 10⁻¹⁰; BIEC2-417495, P(unadj.) = 1.61 x 10⁻⁹). Functional investigations will be required to determine whether this polymorphism affects putative transcription-factor binding and gives rise to variation in gene and protein expression. Nonetheless, this study demonstrates that the g.66493737C>T SNP provides the most powerful genetic marker for prediction of race distance aptitude in Thoroughbreds.
ERIC Educational Resources Information Center
Nist, Sherrie L.
Of all the effective strategies available to college developmental reading students, annotating (noting important ideas or examples in text margins) and underlining have the widest appeal among students and the most practical application in any course. Annotating/underlining serves a dual function: students can isolate key ideas at the time of the…
Mudgal, Richa; Srinivasan, Narayanaswamy; Chandra, Nagasuma
2017-07-01
Functional annotation is seldom straightforward with complexities arising due to functional divergence in protein families or functional convergence between non-homologous protein families, leading to mis-annotations. An enzyme may contain multiple domains and not all domains may be involved in a given function, adding to the complexity in function annotation. To address this, we use binding site information from bound cognate ligands and catalytic residues, since it can help in resolving fold-function relationships at a finer level and with higher confidence. A comprehensive database of 2,020 fold-function-binding site relationships has been systematically generated. A network-based approach is employed to capture the complexity in these relationships, from which different types of associations are deciphered, that identify versatile protein folds performing diverse functions, same function associated with multiple folds and one-to-one relationships. Binding site similarity networks integrated with fold, function, and ligand similarity information are generated to understand the depth of these relationships. Apart from the observed continuity in the functional site space, network properties of these revealed versatile families with topologically different or dissimilar binding sites and structural families that perform very similar functions. As a case study, subtle changes in the active site of a set of evolutionarily related superfamilies are studied using these networks. Tracing of such similarities in evolutionarily related proteins provide clues into the transition and evolution of protein functions. Insights from this study will be helpful in accurate and reliable functional annotations of uncharacterized proteins, poly-pharmacology, and designing enzymes with new functional capabilities. Proteins 2017; 85:1319-1335. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.
Improved annotation through genome-scale metabolic modeling of Aspergillus oryzae
Vongsangnak, Wanwipa; Olsen, Peter; Hansen, Kim; Krogsgaard, Steen; Nielsen, Jens
2008-01-01
Background Since ancient times the filamentous fungus Aspergillus oryzae has been used in the fermentation industry for the production of fermented sauces and the production of industrial enzymes. Recently, the genome sequence of A. oryzae with 12,074 annotated genes was released but the number of hypothetical proteins accounted for more than 50% of the annotated genes. Considering the industrial importance of this fungus, it is therefore valuable to improve the annotation and further integrate genomic information with biochemical and physiological information available for this microorganism and other related fungi. Here we proposed the gene prediction by construction of an A. oryzae Expressed Sequence Tag (EST) library, sequencing and assembly. We enhanced the function assignment by our developed annotation strategy. The resulting better annotation was used to reconstruct the metabolic network leading to a genome scale metabolic model of A. oryzae. Results Our assembled EST sequences we identified 1,046 newly predicted genes in the A. oryzae genome. Furthermore, it was possible to assign putative protein functions to 398 of the newly predicted genes. Noteworthy, our annotation strategy resulted in assignment of new putative functions to 1,469 hypothetical proteins already present in the A. oryzae genome database. Using the substantially improved annotated genome we reconstructed the metabolic network of A. oryzae. This network contains 729 enzymes, 1,314 enzyme-encoding genes, 1,073 metabolites and 1,846 (1,053 unique) biochemical reactions. The metabolic reactions are compartmentalized into the cytosol, the mitochondria, the peroxisome and the extracellular space. Transport steps between the compartments and the extracellular space represent 281 reactions, of which 161 are unique. The metabolic model was validated and shown to correctly describe the phenotypic behavior of A. oryzae grown on different carbon sources. Conclusion A much enhanced annotation of the A. oryzae genome was performed and a genome-scale metabolic model of A. oryzae was reconstructed. The model accurately predicted the growth and biomass yield on different carbon sources. The model serves as an important resource for gaining further insight into our understanding of A. oryzae physiology. PMID:18500999
Ordovás, Laura; Roy, Rosa; Pampín, Sandra; Zaragoza, Pilar; Osta, Rosario; Rodríguez-Rey, Jose Carlos; Rodellar, Clementina
2008-07-15
Fatty acid synthase (FASN) is an enzyme that catalyzes de novo synthesis of fatty acids in cells. The bovine FASN gene maps to BTA 19, where several quantitative trait loci for fat-related traits have been described. Our group recently reported the identification of a single nucleotide polymorphism (SNP), g.763G>C, in the bovine FASN 5' flanking region that was significantly associated with milk fat content in dairy cattle. The g.763G>C SNP was part of a GC-rich region that may constitute a cis element for members of the Sp transcription factor family. Thus the SNP could alter the transcription factor binding ability of the FASN promoter and consequently affect the promoter activity of the gene. However, the functional consequences of the SNP on FASN gene expression are unknown. The present study was therefore directed at elucidating the underlying molecular mechanism that could explain the association of the SNP with milk fat content. Three cellular lines (3T3L1, HepG2, and MCF-7) were used to test the promoter and the transcription factor binding activities by luciferase reporter assays and electrophoretic mobility shift assays, respectively. Band shift assays were also carried out with nuclear extracts from lactating mammary gland (LMG) to further investigate the role of the SNP in this tissue. Our results demonstrate that the SNP alters the bovine FASN promoter activity in vitro and the Sp1/Sp3 binding ability of the sequence. In bovine LMG, the specific binding of Sp3 may account for the association with milk fat content.
Leyva-Corona, Jose C; Reyna-Granados, Javier R; Zamorano-Algandar, Ricardo; Sanchez-Castro, Miguel A; Thomas, Milton G; Enns, R Mark; Speidel, Scott E; Medrano, Juan F; Rincon, Gonzalo; Luna-Nevarez, Pablo
2018-06-20
Prolactin (PRL), growth hormone (GH), and insulin-like growth factor-1 (IGF-1) are in hormone-response pathways involved in energy metabolism during thermoregulation processes in cattle. Objective herein was to study the association between single nucleotide polymorphisms (SNP) within genes of the PRL and GH/IGF-1 pathways with fertility traits such as services per conception (SPC) and days open (DO) in Holstein cattle lactating under a hot-humid climate. Ambient temperature and relative humidity were used to calculate the temperature-humidity index (THI) which revealed that the cows were exposed to heat stress conditions from June to November of 2012 in southern Sonora, Mexico. Individual blood samples from all cows were collected, spotted on FTA cards, and used to genotype a 179 tag SNP panel within 44 genes from the PRL and GH/IGF-1 pathways. The associative analyses among SNP genotypes and fertility traits were performed using mixed-effect models. Allele substitution effects were calculated using a regression model that included the genotype term as covariate. Single-SNP association analyses indicated that eight SNP within the genes IGF-1, IGF-1R, IGFBP5, PAPPA1, PMCH, PRLR, SOCS5, and SSTR2 were associated with SPC (P < 0.05), whereas four SNP in the genes GHR, PAPPA2, PRLR, and SOCS4 were associated with DO (P < 0.05). In conclusion, SNP within genes of the PRL and GH/IGF-1 pathways resulted as predictors of reproductive phenotypes in heat-stressed Holstein cows, and these SNP are proposed as candidates for a marker-assisted selection program intended to improve fertility of dairy cattle raised in warm climates.
Evaluating Hierarchical Structure in Music Annotations
McFee, Brian; Nieto, Oriol; Farbood, Morwaread M.; Bello, Juan Pablo
2017-01-01
Music exhibits structure at multiple scales, ranging from motifs to large-scale functional components. When inferring the structure of a piece, different listeners may attend to different temporal scales, which can result in disagreements when they describe the same piece. In the field of music informatics research (MIR), it is common to use corpora annotated with structural boundaries at different levels. By quantifying disagreements between multiple annotators, previous research has yielded several insights relevant to the study of music cognition. First, annotators tend to agree when structural boundaries are ambiguous. Second, this ambiguity seems to depend on musical features, time scale, and genre. Furthermore, it is possible to tune current annotation evaluation metrics to better align with these perceptual differences. However, previous work has not directly analyzed the effects of hierarchical structure because the existing methods for comparing structural annotations are designed for “flat” descriptions, and do not readily generalize to hierarchical annotations. In this paper, we extend and generalize previous work on the evaluation of hierarchical descriptions of musical structure. We derive an evaluation metric which can compare hierarchical annotations holistically across multiple levels. sing this metric, we investigate inter-annotator agreement on the multilevel annotations of two different music corpora, investigate the influence of acoustic properties on hierarchical annotations, and evaluate existing hierarchical segmentation algorithms against the distribution of inter-annotator agreement. PMID:28824514
Neuhaus, Klaus; Landstorfer, Richard; Fellner, Lea; Simon, Svenja; Schafferhans, Andrea; Goldberg, Tatyana; Marx, Harald; Ozoline, Olga N; Rost, Burkhard; Kuster, Bernhard; Keim, Daniel A; Scherer, Siegfried
2016-02-24
Genomes of E. coli, including that of the human pathogen Escherichia coli O157:H7 (EHEC) EDL933, still harbor undetected protein-coding genes which, apparently, have escaped annotation due to their small size and non-essential function. To find such genes, global gene expression of EHEC EDL933 was examined, using strand-specific RNAseq (transcriptome), ribosomal footprinting (translatome) and mass spectrometry (proteome). Using the above methods, 72 short, non-annotated protein-coding genes were detected. All of these showed signals in the ribosomal footprinting assay indicating mRNA translation. Seven were verified by mass spectrometry. Fifty-seven genes are annotated in other enterobacteriaceae, mainly as hypothetical genes; the remaining 15 genes constitute novel discoveries. In addition, protein structure and function were predicted computationally and compared between EHEC-encoded proteins and 100-times randomly shuffled proteins. Based on this comparison, 61 of the 72 novel proteins exhibit predicted structural and functional features similar to those of annotated proteins. Many of the novel genes show differential transcription when grown under eleven diverse growth conditions suggesting environmental regulation. Three genes were found to confer a phenotype in previous studies, e.g., decreased cattle colonization. These findings demonstrate that ribosomal footprinting can be used to detect novel protein coding genes, contributing to the growing body of evidence that hypothetical genes are not annotation artifacts and opening an additional way to study their functionality. All 72 genes are taxonomically restricted and, therefore, appear to have evolved relatively recently de novo.
Naqvi, Ahmad Abu Turab; Shahbaaz, Mohd; Ahmad, Faizan; Hassan, Md Imtaiyaz
2015-01-01
Syphilis is a globally occurring venereal disease, and its infection is propagated through sexual contact. The causative agent of syphilis, Treponema pallidum ssp. pallidum, a Gram-negative sphirochaete, is an obligate human parasite. Genome of T. pallidum ssp. pallidum SS14 strain (RefSeq NC_010741.1) encodes 1,027 proteins, of which 444 proteins are known as hypothetical proteins (HPs), i.e., proteins of unknown functions. Here, we performed functional annotation of HPs of T. pallidum ssp. pallidum using various database, domain architecture predictors, protein function annotators and clustering tools. We have analyzed the sequences of 444 HPs of T. pallidum ssp. pallidum and subsequently predicted the function of 207 HPs with a high level of confidence. However, functions of 237 HPs are predicted with less accuracy. We found various enzymes, transporters, binding proteins in the annotated group of HPs that may be possible molecular targets, facilitating for the survival of pathogen. Our comprehensive analysis helps to understand the mechanism of pathogenesis to provide many novel potential therapeutic interventions.
Aviation medicine translations : annotated bibliography of recently translated material, VI.
DOT National Transportation Integrated Search
1971-01-01
An annotated bibliography of translations of foreign-language articles is presented. The 22 entries are concerned with studies in aviation medicine, vestibular function, body temperature, color vision, cholinesterase, nystagmus, alcohol, vestibulo-oc...
Aviation medicine translations : annotated bibliography of recently translated material, V .
DOT National Transportation Integrated Search
1968-04-01
An annotated bibliography of translations of foreign-language articles is presented. The 24 entries are concerned with studies in aviation medicine, vestibular function, hearing, intercontinental flight, visual illusions, aviation visual aids, body t...
DePietro, Paul J; Julfayev, Elchin S; McLaughlin, William A
2013-10-21
Protein Structure Initiative:Biology (PSI:Biology) is the third phase of PSI where protein structures are determined in high-throughput to characterize their biological functions. The transition to the third phase entailed the formation of PSI:Biology Partnerships which are composed of structural genomics centers and biomedical science laboratories. We present a method to examine the impact of protein structures determined under the auspices of PSI:Biology by measuring their rates of annotations. The mean numbers of annotations per structure and per residue are examined. These are designed to provide measures of the amount of structure to function connections that can be leveraged from each structure. One result is that PSI:Biology structures are found to have a higher rate of annotations than structures determined during the first two phases of PSI. A second result is that the subset of PSI:Biology structures determined through PSI:Biology Partnerships have a higher rate of annotations than those determined exclusive of those partnerships. Both results hold when the annotation rates are examined either at the level of the entire protein or for annotations that are known to fall at specific residues within the portion of the protein that has a determined structure. We conclude that PSI:Biology determines structures that are estimated to have a higher degree of biomedical interest than those determined during the first two phases of PSI based on a broad array of biomedical annotations. For the PSI:Biology Partnerships, we see that there is an associated added value that represents part of the progress toward the goals of PSI:Biology. We interpret the added value to mean that team-based structural biology projects that utilize the expertise and technologies of structural genomics centers together with biological laboratories in the community are conducted in a synergistic manner. We show that the annotation rates can be used in conjunction with established metrics, i.e. the numbers of structures and impact of publication records, to monitor the progress of PSI:Biology towards its goals of examining structure to function connections of high biomedical relevance. The metric provides an objective means to quantify the overall impact of PSI:Biology as it uses biomedical annotations from external sources.
2013-01-01
Background Protein Structure Initiative:Biology (PSI:Biology) is the third phase of PSI where protein structures are determined in high-throughput to characterize their biological functions. The transition to the third phase entailed the formation of PSI:Biology Partnerships which are composed of structural genomics centers and biomedical science laboratories. We present a method to examine the impact of protein structures determined under the auspices of PSI:Biology by measuring their rates of annotations. The mean numbers of annotations per structure and per residue are examined. These are designed to provide measures of the amount of structure to function connections that can be leveraged from each structure. Results One result is that PSI:Biology structures are found to have a higher rate of annotations than structures determined during the first two phases of PSI. A second result is that the subset of PSI:Biology structures determined through PSI:Biology Partnerships have a higher rate of annotations than those determined exclusive of those partnerships. Both results hold when the annotation rates are examined either at the level of the entire protein or for annotations that are known to fall at specific residues within the portion of the protein that has a determined structure. Conclusions We conclude that PSI:Biology determines structures that are estimated to have a higher degree of biomedical interest than those determined during the first two phases of PSI based on a broad array of biomedical annotations. For the PSI:Biology Partnerships, we see that there is an associated added value that represents part of the progress toward the goals of PSI:Biology. We interpret the added value to mean that team-based structural biology projects that utilize the expertise and technologies of structural genomics centers together with biological laboratories in the community are conducted in a synergistic manner. We show that the annotation rates can be used in conjunction with established metrics, i.e. the numbers of structures and impact of publication records, to monitor the progress of PSI:Biology towards its goals of examining structure to function connections of high biomedical relevance. The metric provides an objective means to quantify the overall impact of PSI:Biology as it uses biomedical annotations from external sources. PMID:24139526
Genomic Sequence Variation Markup Language (GSVML).
Nakaya, Jun; Kimura, Michio; Hiroi, Kaei; Ido, Keisuke; Yang, Woosung; Tanaka, Hiroshi
2010-02-01
With the aim of making good use of internationally accumulated genomic sequence variation data, which is increasing rapidly due to the explosive amount of genomic research at present, the development of an interoperable data exchange format and its international standardization are necessary. Genomic Sequence Variation Markup Language (GSVML) will focus on genomic sequence variation data and human health applications, such as gene based medicine or pharmacogenomics. We developed GSVML through eight steps, based on case analysis and domain investigations. By focusing on the design scope to human health applications and genomic sequence variation, we attempted to eliminate ambiguity and to ensure practicability. We intended to satisfy the requirements derived from the use case analysis of human-based clinical genomic applications. Based on database investigations, we attempted to minimize the redundancy of the data format, while maximizing the data covering range. We also attempted to ensure communication and interface ability with other Markup Languages, for exchange of omics data among various omics researchers or facilities. The interface ability with developing clinical standards, such as the Health Level Seven Genotype Information model, was analyzed. We developed the human health-oriented GSVML comprising variation data, direct annotation, and indirect annotation categories; the variation data category is required, while the direct and indirect annotation categories are optional. The annotation categories contain omics and clinical information, and have internal relationships. For designing, we examined 6 cases for three criteria as human health application and 15 data elements for three criteria as data formats for genomic sequence variation data exchange. The data format of five international SNP databases and six Markup Languages and the interface ability to the Health Level Seven Genotype Model in terms of 317 items were investigated. GSVML was developed as a potential data exchanging format for genomic sequence variation data exchange focusing on human health applications. The international standardization of GSVML is necessary, and is currently underway. GSVML can be applied to enhance the utilization of genomic sequence variation data worldwide by providing a communicable platform between clinical and research applications. Copyright 2009 Elsevier Ireland Ltd. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ansong, Charles; Tolic, Nikola; Purvine, Samuel O.
Complete and accurate genome annotation is crucial for comprehensive and systematic studies of biological systems. For example systems biology-oriented genome scale modeling efforts greatly benefit from accurate annotation of protein-coding genes to develop proper functioning models. However, determining protein-coding genes for most new genomes is almost completely performed by inference, using computational predictions with significant documented error rates (> 15%). Furthermore, gene prediction programs provide no information on biologically important post-translational processing events critical for protein function. With the ability to directly measure peptides arising from expressed proteins, mass spectrometry-based proteomics approaches can be used to augment and verify codingmore » regions of a genomic sequence and importantly detect post-translational processing events. In this study we utilized “shotgun” proteomics to guide accurate primary genome annotation of the bacterial pathogen Salmonella Typhimurium 14028 to facilitate a systems-level understanding of Salmonella biology. The data provides protein-level experimental confirmation for 44% of predicted protein-coding genes, suggests revisions to 48 genes assigned incorrect translational start sites, and uncovers 13 non-annotated genes missed by gene prediction programs. We also present a comprehensive analysis of post-translational processing events in Salmonella, revealing a wide range of complex chemical modifications (70 distinct modifications) and confirming more than 130 signal peptide and N-terminal methionine cleavage events in Salmonella. This study highlights several ways in which proteomics data applied during the primary stages of annotation can improve the quality of genome annotations, especially with regards to the annotation of mature protein products.« less
The CTD2 Center at the University of Texas MD Anderson Cancer Center utilized a functional annotation of mutations and fusions found in human cancers using two cell models, Ba/F3 (murine pro-B suspension cells) and MCF10A (human non-tumorigenic mammary epithelial cells). Read the abstract
The CTD2 Center at the University of Texas MD Anderson Cancer Center utilized a functional annotation of mutations and fusions found in human cancers using two cell models, Ba/F3 (murine pro-B suspension cells) and MCF10A (human non-tumorigenic mammary epithelial cells). Read the abstract
Report on the 2011 Critical Assessment of Function Annotation (CAFA) meeting
DOE Office of Scientific and Technical Information (OSTI.GOV)
Friedberg, Iddo
2015-01-21
The Critical Assessment of Function Annotation meeting was held July 14-15, 2011 at the Austria Conference Center in Vienna, Austria. There were 73 registered delegates at the meeting. We thank the DOE for this award. It helped us organize and support a scientific meeting AFP 2011 as a special interest group (SIG) meeting associated with the ISMB 2011 conference. The conference was held in Vienna, Austria, in July 2011. The AFP SIG was held on July 15-16, 2011 (immediately preceding the conference). The meeting consisted of two components, the first being a series of talks (invited and contributed) and discussionmore » sections dedicated to protein function research, with an emphasis on the theory and practice of computational methods utilized in functional annotation. The second component provided a large-scale assessment of computational methods through participation in the Critical Assessment of Functional Annotation (CAFA). The meeting was exciting and, based on feedback, quite successful. There were 73 registered participants. The schedule was only slightly different from the one proposed, due to two cancellations. Dr. Olga Troyanskaya has canceled and we invited Dr. David Jones instead. Similarly, instead of Dr. Richard Roberts, Dr. Simon Kasif gave a closing keynote. The remaining invited speakers were Janet Thornton (EBI) and Amos Bairoch (University of Geneva).« less
Genome-Wide SNP Detection, Validation, and Development of an 8K SNP Array for Apple
Chagné, David; Crowhurst, Ross N.; Troggio, Michela; Davey, Mark W.; Gilmore, Barbara; Lawley, Cindy; Vanderzande, Stijn; Hellens, Roger P.; Kumar, Satish; Cestaro, Alessandro; Velasco, Riccardo; Main, Dorrie; Rees, Jasper D.; Iezzoni, Amy; Mockler, Todd; Wilhelm, Larry; Van de Weg, Eric; Gardiner, Susan E.; Bassil, Nahla; Peace, Cameron
2012-01-01
As high-throughput genetic marker screening systems are essential for a range of genetics studies and plant breeding applications, the International RosBREED SNP Consortium (IRSC) has utilized the Illumina Infinium® II system to develop a medium- to high-throughput SNP screening tool for genome-wide evaluation of allelic variation in apple (Malus×domestica) breeding germplasm. For genome-wide SNP discovery, 27 apple cultivars were chosen to represent worldwide breeding germplasm and re-sequenced at low coverage with the Illumina Genome Analyzer II. Following alignment of these sequences to the whole genome sequence of ‘Golden Delicious’, SNPs were identified using SoapSNP. A total of 2,113,120 SNPs were detected, corresponding to one SNP to every 288 bp of the genome. The Illumina GoldenGate® assay was then used to validate a subset of 144 SNPs with a range of characteristics, using a set of 160 apple accessions. This validation assay enabled fine-tuning of the final subset of SNPs for the Illumina Infinium® II system. The set of stringent filtering criteria developed allowed choice of a set of SNPs that not only exhibited an even distribution across the apple genome and a range of minor allele frequencies to ensure utility across germplasm, but also were located in putative exonic regions to maximize genotyping success rate. A total of 7867 apple SNPs was established for the IRSC apple 8K SNP array v1, of which 5554 were polymorphic after evaluation in segregating families and a germplasm collection. This publicly available genomics resource will provide an unprecedented resolution of SNP haplotypes, which will enable marker-locus-trait association discovery, description of the genetic architecture of quantitative traits, investigation of genetic variation (neutral and functional), and genomic selection in apple. PMID:22363718
Cruz, Mercedes Cecilia; Ruano, Gustavo; Wolf, Marcus; Hecker, Dominic; Vidaurre, Elza Castro; Schmittgens, Ralph; Rajal, Verónica Beatriz
2015-02-01
A novel and versatile plasma reactor was used to modify Polyethersulphone commercial membranes. The equipment was applied to: i) functionalize the membranes with low-temperature plasmas, ii) deposit a film of poly(methyl methacrylate) (PMMA) by Plasma Enhanced Chemical Vapor Deposition (PECVD) and, iii) deposit silver nanoparticles (SNP) by Gas Flow Sputtering. Each modification process was performed in the same reactor consecutively, without exposure of the membranes to atmospheric air. Scanning electron microscopy and transmission electron microscopy were used to characterize the particles and modified membranes. SNP are evenly distributed on the membrane surface. Particle fixation and transport inside membranes were assessed before- and after-washing assays by X-ray photoelectron spectroscopy depth profiling analysis. PMMA addition improved SNP fixation. Plasma-treated membranes showed higher hydrophilicity. Anti-biofouling activity was successfully achieved against Gram-positive ( Enterococcus faecalis ) and -negative ( Salmonella Typhimurium) bacteria. Therefore, disinfection by ultrafiltration showed substantial resistance to biofouling. The post-synthesis functionalization process developed provides a more efficient fabrication route for anti-biofouling and anti-bacterial membranes used in the water treatment field. To the best of our knowledge, this is the first report of a gas phase condensation process combined with a PECVD procedure in order to deposit SNP on commercial membranes to inhibit biofouling formation.
Cruz, Mercedes Cecilia; Ruano, Gustavo; Wolf, Marcus; Hecker, Dominic; Vidaurre, Elza Castro; Schmittgens, Ralph; Rajal, Verónica Beatriz
2015-01-01
A novel and versatile plasma reactor was used to modify Polyethersulphone commercial membranes. The equipment was applied to: i) functionalize the membranes with low-temperature plasmas, ii) deposit a film of poly(methyl methacrylate) (PMMA) by Plasma Enhanced Chemical Vapor Deposition (PECVD) and, iii) deposit silver nanoparticles (SNP) by Gas Flow Sputtering. Each modification process was performed in the same reactor consecutively, without exposure of the membranes to atmospheric air. Scanning electron microscopy and transmission electron microscopy were used to characterize the particles and modified membranes. SNP are evenly distributed on the membrane surface. Particle fixation and transport inside membranes were assessed before- and after-washing assays by X-ray photoelectron spectroscopy depth profiling analysis. PMMA addition improved SNP fixation. Plasma-treated membranes showed higher hydrophilicity. Anti-biofouling activity was successfully achieved against Gram-positive (Enterococcus faecalis) and -negative (Salmonella Typhimurium) bacteria. Therefore, disinfection by ultrafiltration showed substantial resistance to biofouling. The post-synthesis functionalization process developed provides a more efficient fabrication route for anti-biofouling and anti-bacterial membranes used in the water treatment field. To the best of our knowledge, this is the first report of a gas phase condensation process combined with a PECVD procedure in order to deposit SNP on commercial membranes to inhibit biofouling formation. PMID:26166926
The what, where, how and why of gene ontology—a primer for bioinformaticians
du Plessis, Louis; Škunca, Nives
2011-01-01
With high-throughput technologies providing vast amounts of data, it has become more important to provide systematic, quality annotations. The Gene Ontology (GO) project is the largest resource for cataloguing gene function. Nonetheless, its use is not yet ubiquitous and is still fraught with pitfalls. In this review, we provide a short primer to the GO for bioinformaticians. We summarize important aspects of the structure of the ontology, describe sources and types of functional annotations, survey measures of GO annotation similarity, review typical uses of GO and discuss other important considerations pertaining to the use of GO in bioinformatics applications. PMID:21330331
Chado Controller: advanced annotation management with a community annotation system
Guignon, Valentin; Droc, Gaëtan; Alaux, Michael; Baurens, Franc-Christophe; Garsmeur, Olivier; Poiron, Claire; Carver, Tim; Rouard, Mathieu; Bocs, Stéphanie
2012-01-01
Summary: We developed a controller that is compliant with the Chado database schema, GBrowse and genome annotation-editing tools such as Artemis and Apollo. It enables the management of public and private data, monitors manual annotation (with controlled vocabularies, structural and functional annotation controls) and stores versions of annotation for all modified features. The Chado controller uses PostgreSQL and Perl. Availability: The Chado Controller package is available for download at http://www.gnpannot.org/content/chado-controller and runs on any Unix-like operating system, and documentation is available at http://www.gnpannot.org/content/chado-controller-doc The system can be tested using the GNPAnnot Sandbox at http://www.gnpannot.org/content/gnpannot-sandbox-form Contact: valentin.guignon@cirad.fr; stephanie.sidibe-bocs@cirad.fr Supplementary information: Supplementary data are available at Bioinformatics online. PMID:22285827
Ten steps to get started in Genome Assembly and Annotation
Dominguez Del Angel, Victoria; Hjerde, Erik; Sterck, Lieven; Capella-Gutierrez, Salvadors; Notredame, Cederic; Vinnere Pettersson, Olga; Amselem, Joelle; Bouri, Laurent; Bocs, Stephanie; Klopp, Christophe; Gibrat, Jean-Francois; Vlasova, Anna; Leskosek, Brane L.; Soler, Lucile; Binzer-Panchal, Mahesh; Lantz, Henrik
2018-01-01
As a part of the ELIXIR-EXCELERATE efforts in capacity building, we present here 10 steps to facilitate researchers getting started in genome assembly and genome annotation. The guidelines given are broadly applicable, intended to be stable over time, and cover all aspects from start to finish of a general assembly and annotation project. Intrinsic properties of genomes are discussed, as is the importance of using high quality DNA. Different sequencing technologies and generally applicable workflows for genome assembly are also detailed. We cover structural and functional annotation and encourage readers to also annotate transposable elements, something that is often omitted from annotation workflows. The importance of data management is stressed, and we give advice on where to submit data and how to make your results Findable, Accessible, Interoperable, and Reusable (FAIR). PMID:29568489
Liu, Jindong; He, Zhonghu; Wu, Ling; Bai, Bin; Wen, Weie; Xie, Chaojie; Xia, Xianchun
2015-01-01
Stripe rust is one of the most devastating diseases of wheat (Triticum aestivum) worldwide. Adult-plant resistance (APR) is an efficient approach to provide long-term protection of wheat from the disease. The Chinese winter wheat cultivar Zhong 892 has a moderate level of APR to stripe rust in the field. To determine the inheritance of the APR resistance in this cultivar, 273 F6 recombinant inbred lines (RILs) were developed from a cross between Linmai 2 and Zhong 892. The RILs were evaluated for maximum disease severity (MDS) in two sites during the 2011–2012, 2012–2013 and 2013–2014 cropping seasons, providing data for five environments. Illumina 90k SNP (single nucleotide polymorphism) chips were used to genotype the RILs and their parents. Composite interval mapping (CIM) detected eight QTL, namely QYr.caas-2AL, QYr.caas-2BL.3, QYr.caas-3AS, QYr.caas-3BS, QYr.caas-5DL, QYr.caas-6AL, QYr.caas-7AL and QYr.caas-7DS.1, respectively. All except QYr.caas-2BL.3 resistance alleles were contributed by Zhong 892. QYr.caas-3AS and QYr.caas-3BS conferred stable resistance to stripe rust in all environments, explaining 6.2–17.4% and 5.0–11.5% of the phenotypic variances, respectively. The genome scan of SNP sequences tightly linked to QTL for APR against annotated proteins in wheat and related cereals genomes identified two candidate genes (autophagy-related gene and disease resistance gene RGA1), significantly associated with stripe rust resistance. These QTL and their closely linked SNP markers, in combination with kompetitive allele specific PCR (KASP) technology, are potentially useful for improving stripe rust resistances in wheat breeding. PMID:26714310
Aravind Kumar, M; Singh, Vineeta; Naushad, Shaik Mohammad; Shanker, Uday; Lakshmi Narasu, M
2018-05-01
In the view of aggressive nature of Triple-Negative Breast cancer (TNBC) due to the lack of receptors (ER, PR, HER2) and high incidence of drug resistance associated with it, a case-control association study was conducted to identify the contributing genetic risk factors for Triple-negative breast cancer (TNBC). A total of 30 TNBC patients and 50 age and gender-matched controls of Indian origin were screened for 9,00,000 SNP markers using microarray-based SNP genotyping approach. The initial PLINK association analysis (p < 0.01, MAF 0.14-0.44, OR 10-24) identified 28 non-synonymous SNPs and one stop gain mutation in the exonic region as possible determinants of TNBC risk. All the 29 SNPs were annotated using ANNOVAR. The interactions between these markers were evaluated using Multifactor dimensionality reduction (MDR) analysis. The interactions were in the following order: exm408776 > exm1278309 > rs316389 > rs1651654 > rs635538 > exm1292477. Recursive partitioning analysis (RPA) was performed to construct decision tree useful in predicting TNBC risk. As shown in this analysis, rs1651654 and exm585172 SNPs are found to be determinants of TNBC risk. Artificial neural network model was used to generate the Receiver operating characteristic curves (ROC), which showed high sensitivity and specificity (AUC-0.94) of these markers. To conclude, among the 9,00,000 SNPs tested, CCDC42 exm1292477, ANXA3 exm408776, SASH1 exm585172 are found to be the most significant genetic predicting factors for TNBC. The interactions among exm408776, exm1278309, rs316389, rs1651654, rs635538, exm1292477 SNPs inflate the risk for TNBC further. Targeted analysis of these SNPs and genes alone also will have similar clinical utility in predicting TNBC.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chai, Juanjuan; Kora, Guruprasad; Ahn, Tae-Hyuk
2014-10-09
To supply some background, phylogenetic studies have provided detailed knowledge on the evolutionary mechanisms of genes and species in Bacteria and Archaea. However, the evolution of cellular functions, represented by metabolic pathways and biological processes, has not been systematically characterized. Many clades in the prokaryotic tree of life have now been covered by sequenced genomes in GenBank. This enables a large-scale functional phylogenomics study of many computationally inferred cellular functions across all sequenced prokaryotes. Our results show a total of 14,727 GenBank prokaryotic genomes were re-annotated using a new protein family database, UniFam, to obtain consistent functional annotations for accuratemore » comparison. The functional profile of a genome was represented by the biological process Gene Ontology (GO) terms in its annotation. The GO term enrichment analysis differentiated the functional profiles between selected archaeal taxa. 706 prokaryotic metabolic pathways were inferred from these genomes using Pathway Tools and MetaCyc. The consistency between the distribution of metabolic pathways in the genomes and the phylogenetic tree of the genomes was measured using parsimony scores and retention indices. The ancestral functional profiles at the internal nodes of the phylogenetic tree were reconstructed to track the gains and losses of metabolic pathways in evolutionary history. In conclusion, our functional phylogenomics analysis shows divergent functional profiles of taxa and clades. Such function-phylogeny correlation stems from a set of clade-specific cellular functions with low parsimony scores. On the other hand, many cellular functions are sparsely dispersed across many clades with high parsimony scores. These different types of cellular functions have distinct evolutionary patterns reconstructed from the prokaryotic tree.« less
Mosquera, Teresa; Alvarez, Maria Fernanda; Jiménez-Gómez, José M; Muktar, Meki Shehabu; Paulo, Maria João; Steinemann, Sebastian; Li, Jinquan; Draffehn, Astrid; Hofmann, Andrea; Lübeck, Jens; Strahwald, Josef; Tacke, Eckhard; Hofferbert, Hans-Reinhardt; Walkemeier, Birgit; Gebhardt, Christiane
2016-01-01
The oomycete Phytophthora infestans causes late blight of potato, which can completely destroy the crop. Therefore, for the past 160 years, late blight has been the most important potato disease worldwide. The identification of cultivars with high and durable field resistance to P. infestans is an objective of most potato breeding programs. This type of resistance is polygenic and therefore quantitative. Its evaluation requires multi-year and location trials. Furthermore, quantitative resistance to late blight correlates with late plant maturity, a negative agricultural trait. Knowledge of the molecular genetic basis of quantitative resistance to late blight not compromised by late maturity is very limited. It is however essential for developing diagnostic DNA markers that facilitate the efficient combination of superior resistance alleles in improved cultivars. We used association genetics in a population of 184 tetraploid potato cultivars in order to identify single nucleotide polymorphisms (SNPs) that are associated with maturity corrected resistance (MCR) to late blight. The population was genotyped for almost 9000 SNPs from three different sources. The first source was candidate genes specifically selected for their function in the jasmonate pathway. The second source was novel candidate genes selected based on comparative transcript profiling (RNA-Seq) of groups of genotypes with contrasting levels of quantitative resistance to P. infestans. The third source was the first generation 8.3k SolCAP SNP genotyping array available in potato for genome wide association studies (GWAS). Twenty seven SNPs from all three sources showed robust association with MCR. Some of those were located in genes that are strong candidates for directly controlling quantitative resistance, based on functional annotation. Most important were: a lipoxygenase (jasmonate pathway), a 3-hydroxy-3-methylglutaryl coenzyme A reductase (mevalonate pathway), a P450 protein (terpene biosynthesis), a transcription factor and a homolog of a major gene for resistance to P. infestans from the wild potato species Solanum venturii. The candidate gene approach and GWAS complemented each other as they identified different genes. The results of this study provide new insight in the molecular genetic basis of quantitative resistance in potato and a toolbox of diagnostic SNP markers for breeding applications.
Jiménez-Gómez, José M.; Muktar, Meki Shehabu; Paulo, Maria João; Steinemann, Sebastian; Li, Jinquan; Draffehn, Astrid; Hofmann, Andrea; Lübeck, Jens; Strahwald, Josef; Tacke, Eckhard; Hofferbert, Hans-Reinhardt; Walkemeier, Birgit; Gebhardt, Christiane
2016-01-01
The oomycete Phytophthora infestans causes late blight of potato, which can completely destroy the crop. Therefore, for the past 160 years, late blight has been the most important potato disease worldwide. The identification of cultivars with high and durable field resistance to P. infestans is an objective of most potato breeding programs. This type of resistance is polygenic and therefore quantitative. Its evaluation requires multi-year and location trials. Furthermore, quantitative resistance to late blight correlates with late plant maturity, a negative agricultural trait. Knowledge of the molecular genetic basis of quantitative resistance to late blight not compromised by late maturity is very limited. It is however essential for developing diagnostic DNA markers that facilitate the efficient combination of superior resistance alleles in improved cultivars. We used association genetics in a population of 184 tetraploid potato cultivars in order to identify single nucleotide polymorphisms (SNPs) that are associated with maturity corrected resistance (MCR) to late blight. The population was genotyped for almost 9000 SNPs from three different sources. The first source was candidate genes specifically selected for their function in the jasmonate pathway. The second source was novel candidate genes selected based on comparative transcript profiling (RNA-Seq) of groups of genotypes with contrasting levels of quantitative resistance to P. infestans. The third source was the first generation 8.3k SolCAP SNP genotyping array available in potato for genome wide association studies (GWAS). Twenty seven SNPs from all three sources showed robust association with MCR. Some of those were located in genes that are strong candidates for directly controlling quantitative resistance, based on functional annotation. Most important were: a lipoxygenase (jasmonate pathway), a 3-hydroxy-3-methylglutaryl coenzyme A reductase (mevalonate pathway), a P450 protein (terpene biosynthesis), a transcription factor and a homolog of a major gene for resistance to P. infestans from the wild potato species Solanum venturii. The candidate gene approach and GWAS complemented each other as they identified different genes. The results of this study provide new insight in the molecular genetic basis of quantitative resistance in potato and a toolbox of diagnostic SNP markers for breeding applications. PMID:27281327
N-acetylcysteine improves coronary and peripheral vascular function.
Andrews, N P; Prasad, A; Quyyumi, A A
2001-01-01
We investigated whether N-acetylcysteine (NAC), a reduced thiol that modulates redox state and forms adducts of nitric oxide (NO), improves endothelium-dependent vasomotion. Coronary atherosclerosis is associated with endothelial dysfunction and reduced NO activity. In 16 patients undergoing cardiac catheterization, seven with and nine without atherosclerosis, we assessed endothelium-dependent vasodilation with acetylcholine (ACH) and endothelium-independent vasodilation with nitroglycerin (NTG) and sodium nitroprusside (SNP) before and after intracoronary NAC. In 14 patients femoral vascular responses to ACH, NTG and SNP were measured before and after NAC. Intraarterial NAC did not change resting coronary or peripheral vascular tone. N-acetylcysteine potentiated ACH-mediated coronary vasodilation; coronary blood flow was 36 +/- 11% higher (p < 0.02), and epicardial diameter changed from -1.2 +/- 2% constriction to 4.7 +/- 2% dilation after NAC (p = 0.03). Acetylcholine-mediated femoral vasodilation was similarly potentiated by NAC (p = 0.001). Augmentation of the ACH response was similar in patients with or without atherosclerosis. N-acetylcysteine did not affect NTG-mediated vasodilation in either the femoral or coronary circulations and did not alter SNP responses in the femoral circulation. In contrast, coronary vasodilation with SNP was significantly greater after NAC (p < 0.05). Thiol supplementation with NAC improves human coronary and peripheral endothelium-dependent vasodilation. Nitroglycerin responses are not enhanced, but SNP-mediated responses are potentiated only in the coronary circulation. These NO-enhancing effects of thiols reflect the importance of the redox state in the control of vascular function and may be of therapeutic benefit in treating acute and chronic manifestations of atherosclerosis.
Singh, Kh Dhanachandra; Karthikeyan, Muthusamy
2014-12-01
The renin-angiotensin-aldosterone system (RAAS) plays a key role in the regulation of blood pressure (BP). Mutations on the genes that encode components of the RAAS have played a significant role in genetic susceptibility to hypertension and have been intensively scrutinized. The identification of such probably causal mutations not only provides insight into the RAAS but may also serve as antihypertensive therapeutic targets and diagnostic markers. The methods for analyzing the SNPs from the huge dataset of SNPs, containing both functional and neutral SNPs is challenging by the experimental approach on every SNPs to determine their biological significance. To explore the functional significance of genetic mutation (SNPs), we adopted combined sequence and sequence-structure-based SNP analysis algorithm. Out of 3864 SNPs reported in dbSNP, we found 108 missense SNPs in the coding region and remaining in the non-coding region. In this study, we are reporting only those SNPs in coding region to be deleterious when three or more tools are predicted to be deleterious and which have high RMSD from the native structure. Based on these analyses, we have identified two SNPs of REN gene, eight SNPs of AGT gene, three SNPs of ACE gene, two SNPs of AT1R gene, three SNPs of CYP11B2 gene and three SNPs of CMA1 gene in the coding region were found to be deleterious. Further this type of study will be helpful in reducing the cost and time for identification of potential SNP and also helpful in selecting potential SNP for experimental study out of SNP pool.
Sablok, Gaurav; Pérez-Pulido, Antonio J.; Do, Thac; Seong, Tan Y.; Casimiro-Soriguer, Carlos S.; La Porta, Nicola; Ralph, Peter J.; Squartini, Andrea; Muñoz-Merida, Antonio; Harikrishna, Jennifer A.
2016-01-01
Analysis of repetitive DNA sequence content and divergence among the repetitive functional classes is a well-accepted approach for estimation of inter- and intra-generic differences in plant genomes. Among these elements, microsatellites, or Simple Sequence Repeats (SSRs), have been widely demonstrated as powerful genetic markers for species and varieties discrimination. We present PlantFuncSSRs platform having more than 364 plant species with more than 2 million functional SSRs. They are provided with detailed annotations for easy functional browsing of SSRs and with information on primer pairs and associated functional domains. PlantFuncSSRs can be leveraged to identify functional-based genic variability among the species of interest, which might be of particular interest in developing functional markers in plants. This comprehensive on-line portal unifies mining of SSRs from first and next generation sequencing datasets, corresponding primer pairs and associated in-depth functional annotation such as gene ontology annotation, gene interactions and its identification from reference protein databases. PlantFuncSSRs is freely accessible at: http://www.bioinfocabd.upo.es/plantssr. PMID:27446111
A graph-based semantic similarity measure for the gene ontology.
Alvarez, Marco A; Yan, Changhui
2011-12-01
Existing methods for calculating semantic similarities between pairs of Gene Ontology (GO) terms and gene products often rely on external databases like Gene Ontology Annotation (GOA) that annotate gene products using the GO terms. This dependency leads to some limitations in real applications. Here, we present a semantic similarity algorithm (SSA), that relies exclusively on the GO. When calculating the semantic similarity between a pair of input GO terms, SSA takes into account the shortest path between them, the depth of their nearest common ancestor, and a novel similarity score calculated between the definitions of the involved GO terms. In our work, we use SSA to calculate semantic similarities between pairs of proteins by combining pairwise semantic similarities between the GO terms that annotate the involved proteins. The reliability of SSA was evaluated by comparing the resulting semantic similarities between proteins with the functional similarities between proteins derived from expert annotations or sequence similarity. Comparisons with existing state-of-the-art methods showed that SSA is highly competitive with the other methods. SSA provides a reliable measure for semantics similarity independent of external databases of functional-annotation observations.
Optimizing high performance computing workflow for protein functional annotation.
Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene
2014-09-10
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.
Optimizing high performance computing workflow for protein functional annotation
Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene
2014-01-01
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296
The Protein Information Resource: an integrated public resource of functional annotation of proteins
Wu, Cathy H.; Huang, Hongzhan; Arminski, Leslie; Castro-Alvear, Jorge; Chen, Yongxing; Hu, Zhang-Zhi; Ledley, Robert S.; Lewis, Kali C.; Mewes, Hans-Werner; Orcutt, Bruce C.; Suzek, Baris E.; Tsugita, Akira; Vinayaka, C. R.; Yeh, Lai-Su L.; Zhang, Jian; Barker, Winona C.
2002-01-01
The Protein Information Resource (PIR) serves as an integrated public resource of functional annotation of protein data to support genomic/proteomic research and scientific discovery. The PIR, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the PIR-International Protein Sequence Database (PSD), the major annotated protein sequence database in the public domain, containing about 250 000 proteins. To improve protein annotation and the coverage of experimentally validated data, a bibliography submission system is developed for scientists to submit, categorize and retrieve literature information. Comprehensive protein information is available from iProClass, which includes family classification at the superfamily, domain and motif levels, structural and functional features of proteins, as well as cross-references to over 40 biological databases. To provide timely and comprehensive protein data with source attribution, we have introduced a non-redundant reference protein database, PIR-NREF. The database consists of about 800 000 proteins collected from PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq and PDB, with composite protein names and literature data. To promote database interoperability, we provide XML data distribution and open database schema, and adopt common ontologies. The PIR web site (http://pir.georgetown.edu/) features data mining and sequence analysis tools for information retrieval and functional identification of proteins based on both sequence and annotation information. The PIR databases and other files are also available by FTP (ftp://nbrfa.georgetown.edu/pir_databases). PMID:11752247
Coding gestural behavior with the NEUROGES--ELAN system.
Lausberg, Hedda; Sloetjes, Han
2009-08-01
We present a coding system combined with an annotation tool for the analysis of gestural behavior. The NEUROGES coding system consists of three modules that progress from gesture kinetics to gesture function. Grounded on empirical neuropsychological and psychological studies, the theoretical assumption behind NEUROGES is that its main kinetic and functional movement categories are differentially associated with specific cognitive, emotional, and interactive functions. ELAN is a free, multimodal annotation tool for digital audio and video media. It supports multileveled transcription and complies with such standards as XML and Unicode. ELAN allows gesture categories to be stored with associated vocabularies that are reusable by means of template files. The combination of the NEUROGES coding system and the annotation tool ELAN creates an effective tool for empirical research on gestural behavior.
Tian, Wenlan; Paudel, Dev
2017-01-01
Jatropha (Jatropha curcas L.) is an economically important species with a great potential for biodiesel production. To enrich the jatropha genomic databases and resources for microgravity studies, we sequenced and annotated the transcriptome of jatropha and developed SSR and SNP markers from the transcriptome sequences. In total 1,714,433 raw reads with an average length of 441.2 nucleotides were generated. De novo assembling and clustering resulted in 115,611 uniquely assembled sequences (UASs) including 21,418 full-length cDNAs and 23,264 new jatropha transcript sequences. The whole set of UASs were fully annotated, out of which 59,903 (51.81%) were assigned with gene ontology (GO) term, 12,584 (10.88%) had orthologs in Eukaryotic Orthologous Groups (KOG), and 8,822 (7.63%) were mapped to 317 pathways in six different categories in Kyoto Encyclopedia of Genes and Genome (KEGG) database, and it contained 3,588 putative transcription factors. From the UASs, 9,798 SSRs were discovered with AG/CT as the most frequent (45.8%) SSR motif type. Further 38,693 SNPs were detected and 7,584 remained after filtering. This UAS set has enriched the current jatropha genomic databases and provided a large number of genetic markers, which can facilitate jatropha genetic improvement and many other genetic and biological studies. PMID:28154822
Harris, Stephen E; Munshi-South, Jason
2017-11-01
Urbanization significantly alters natural ecosystems and has accelerated globally. Urban wildlife populations are often highly fragmented by human infrastructure, and isolated populations may adapt in response to local urban pressures. However, relatively few studies have identified genomic signatures of adaptation in urban animals. We used a landscape genomic approach to examine signatures of selection in urban populations of white-footed mice (Peromyscus leucopus) in New York City. We analysed 154,770 SNPs identified from transcriptome data from 48 P. leucopus individuals from three urban and three rural populations and used outlier tests to identify evidence of urban adaptation. We accounted for demography by simulating a neutral SNP data set under an inferred demographic history as a null model for outlier analysis. We also tested whether candidate genes were associated with environmental variables related to urbanization. In total, we detected 381 outlier loci and after stringent filtering, identified and annotated 19 candidate loci. Many of the candidate genes were involved in metabolic processes and have well-established roles in metabolizing lipids and carbohydrates. Our results indicate that white-footed mice in New York City are adapting at the biomolecular level to local selective pressures in urban habitats. Annotation of outlier loci suggests selection is acting on metabolic pathways in urban populations, likely related to novel diets in cities that differ from diets in less disturbed areas. © 2017 John Wiley & Sons Ltd.
Functional annotation of chemical libraries across diverse biological processes.
Piotrowski, Jeff S; Li, Sheena C; Deshpande, Raamesh; Simpkins, Scott W; Nelson, Justin; Yashiroda, Yoko; Barber, Jacqueline M; Safizadeh, Hamid; Wilson, Erin; Okada, Hiroki; Gebre, Abraham A; Kubo, Karen; Torres, Nikko P; LeBlanc, Marissa A; Andrusiak, Kerry; Okamoto, Reika; Yoshimura, Mami; DeRango-Adem, Eva; van Leeuwen, Jolanda; Shirahige, Katsuhiko; Baryshnikova, Anastasia; Brown, Grant W; Hirano, Hiroyuki; Costanzo, Michael; Andrews, Brenda; Ohya, Yoshikazu; Osada, Hiroyuki; Yoshida, Minoru; Myers, Chad L; Boone, Charles
2017-09-01
Chemical-genetic approaches offer the potential for unbiased functional annotation of chemical libraries. Mutations can alter the response of cells in the presence of a compound, revealing chemical-genetic interactions that can elucidate a compound's mode of action. We developed a highly parallel, unbiased yeast chemical-genetic screening system involving three key components. First, in a drug-sensitive genetic background, we constructed an optimized diagnostic mutant collection that is predictive for all major yeast biological processes. Second, we implemented a multiplexed (768-plex) barcode-sequencing protocol, enabling the assembly of thousands of chemical-genetic profiles. Finally, based on comparison of the chemical-genetic profiles with a compendium of genome-wide genetic interaction profiles, we predicted compound functionality. Applying this high-throughput approach, we screened seven different compound libraries and annotated their functional diversity. We further validated biological process predictions, prioritized a diverse set of compounds, and identified compounds that appear to have dual modes of action.
Lynx web services for annotations and systems analysis of multi-gene disorders.
Sulakhe, Dinanath; Taylor, Andrew; Balasubramanian, Sandhya; Feng, Bo; Xie, Bingqing; Börnigen, Daniela; Dave, Utpal J; Foster, Ian T; Gilliam, T Conrad; Maltsev, Natalia
2014-07-01
Lynx is a web-based integrated systems biology platform that supports annotation and analysis of experimental data and generation of weighted hypotheses on molecular mechanisms contributing to human phenotypes and disorders of interest. Lynx has integrated multiple classes of biomedical data (genomic, proteomic, pathways, phenotypic, toxicogenomic, contextual and others) from various public databases as well as manually curated data from our group and collaborators (LynxKB). Lynx provides tools for gene list enrichment analysis using multiple functional annotations and network-based gene prioritization. Lynx provides access to the integrated database and the analytical tools via REST based Web Services (http://lynx.ci.uchicago.edu/webservices.html). This comprises data retrieval services for specific functional annotations, services to search across the complete LynxKB (powered by Lucene), and services to access the analytical tools built within the Lynx platform. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
An Atlas of annotations of Hydra vulgaris transcriptome.
Evangelista, Daniela; Tripathi, Kumar Parijat; Guarracino, Mario Rosario
2016-09-22
RNA sequencing takes advantage of the Next Generation Sequencing (NGS) technologies for analyzing RNA transcript counts with an excellent accuracy. Trying to interpret this huge amount of data in biological information is still a key issue, reason for which the creation of web-resources useful for their analysis is highly desiderable. Starting from a previous work, Transcriptator, we present the Atlas of Hydra's vulgaris, an extensible web tool in which its complete transcriptome is annotated. In order to provide to the users an advantageous resource that include the whole functional annotated transcriptome of Hydra vulgaris water polyp, we implemented the Atlas web-tool contains 31.988 accesible and downloadable transcripts of this non-reference model organism. Atlas, as a freely available resource, can be considered a valuable tool to rapidly retrieve functional annotation for transcripts differentially expressed in Hydra vulgaris exposed to the distinct experimental treatments. WEB RESOURCE URL: http://www-labgtp.na.icar.cnr.it/Atlas .
Spencer, Amy V; Cox, Angela; Lin, Wei-Yu; Easton, Douglas F; Michailidou, Kyriaki; Walters, Kevin
2016-04-01
There is a large amount of functional genetic data available, which can be used to inform fine-mapping association studies (in diseases with well-characterised disease pathways). Single nucleotide polymorphism (SNP) prioritization via Bayes factors is attractive because prior information can inform the effect size or the prior probability of causal association. This approach requires the specification of the effect size. If the information needed to estimate a priori the probability density for the effect sizes for causal SNPs in a genomic region isn't consistent or isn't available, then specifying a prior variance for the effect sizes is challenging. We propose both an empirical method to estimate this prior variance, and a coherent approach to using SNP-level functional data, to inform the prior probability of causal association. Through simulation we show that when ranking SNPs by our empirical Bayes factor in a fine-mapping study, the causal SNP rank is generally as high or higher than the rank using Bayes factors with other plausible values of the prior variance. Importantly, we also show that assigning SNP-specific prior probabilities of association based on expert prior functional knowledge of the disease mechanism can lead to improved causal SNPs ranks compared to ranking with identical prior probabilities of association. We demonstrate the use of our methods by applying the methods to the fine mapping of the CASP8 region of chromosome 2 using genotype data from the Collaborative Oncological Gene-Environment Study (COGS) Consortium. The data we analysed included approximately 46,000 breast cancer case and 43,000 healthy control samples. © 2016 The Authors. *Genetic Epidemiology published by Wiley Periodicals, Inc.
Trampush, J W; Yang, M L Z; Yu, J; Knowles, E; Davies, G; Liewald, D C; Starr, J M; Djurovic, S; Melle, I; Sundet, K; Christoforou, A; Reinvang, I; DeRosse, P; Lundervold, A J; Steen, V M; Espeseth, T; Räikkönen, K; Widen, E; Palotie, A; Eriksson, J G; Giegling, I; Konte, B; Roussos, P; Giakoumaki, S; Burdick, K E; Payton, A; Ollier, W; Horan, M; Chiba-Falek, O; Attix, D K; Need, A C; Cirulli, E T; Voineskos, A N; Stefanis, N C; Avramopoulos, D; Hatzimanolis, A; Arking, D E; Smyrnis, N; Bilder, R M; Freimer, N A; Cannon, T D; London, E; Poldrack, R A; Sabb, F W; Congdon, E; Conley, E D; Scult, M A; Dickinson, D; Straub, R E; Donohoe, G; Morris, D; Corvin, A; Gill, M; Hariri, A R; Weinberger, D R; Pendleton, N; Bitsios, P; Rujescu, D; Lahti, J; Le Hellard, S; Keller, M C; Andreassen, O A; Deary, I J; Glahn, D C; Malhotra, A K; Lencz, T
2017-03-01
The complex nature of human cognition has resulted in cognitive genomics lagging behind many other fields in terms of gene discovery using genome-wide association study (GWAS) methods. In an attempt to overcome these barriers, the current study utilized GWAS meta-analysis to examine the association of common genetic variation (~8M single-nucleotide polymorphisms (SNP) with minor allele frequency ⩾1%) to general cognitive function in a sample of 35 298 healthy individuals of European ancestry across 24 cohorts in the Cognitive Genomics Consortium (COGENT). In addition, we utilized individual SNP lookups and polygenic score analyses to identify genetic overlap with other relevant neurobehavioral phenotypes. Our primary GWAS meta-analysis identified two novel SNP loci (top SNPs: rs76114856 in the CENPO gene on chromosome 2 and rs6669072 near LOC105378853 on chromosome 1) associated with cognitive performance at the genome-wide significance level (P<5 × 10 -8 ). Gene-based analysis identified an additional three Bonferroni-corrected significant loci at chromosomes 17q21.31, 17p13.1 and 1p13.3. Altogether, common variation across the genome resulted in a conservatively estimated SNP heritability of 21.5% (s.e.=0.01%) for general cognitive function. Integration with prior GWAS of cognitive performance and educational attainment yielded several additional significant loci. Finally, we found robust polygenic correlations between cognitive performance and educational attainment, several psychiatric disorders, birth length/weight and smoking behavior, as well as a novel genetic association to the personality trait of openness. These data provide new insight into the genetics of neurocognitive function with relevance to understanding the pathophysiology of neuropsychiatric illness.
Trampush, J W; Yang, M L Z; Yu, J; Knowles, E; Davies, G; Liewald, D C; Starr, J M; Djurovic, S; Melle, I; Sundet, K; Christoforou, A; Reinvang, I; DeRosse, P; Lundervold, A J; Steen, V M; Espeseth, T; Räikkönen, K; Widen, E; Palotie, A; Eriksson, J G; Giegling, I; Konte, B; Roussos, P; Giakoumaki, S; Burdick, K E; Payton, A; Ollier, W; Horan, M; Chiba-Falek, O; Attix, D K; Need, A C; Cirulli, E T; Voineskos, A N; Stefanis, N C; Avramopoulos, D; Hatzimanolis, A; Arking, D E; Smyrnis, N; Bilder, R M; Freimer, N A; Cannon, T D; London, E; Poldrack, R A; Sabb, F W; Congdon, E; Conley, E D; Scult, M A; Dickinson, D; Straub, R E; Donohoe, G; Morris, D; Corvin, A; Gill, M; Hariri, A R; Weinberger, D R; Pendleton, N; Bitsios, P; Rujescu, D; Lahti, J; Le Hellard, S; Keller, M C; Andreassen, O A; Deary, I J; Glahn, D C; Malhotra, A K; Lencz, T
2017-01-01
The complex nature of human cognition has resulted in cognitive genomics lagging behind many other fields in terms of gene discovery using genome-wide association study (GWAS) methods. In an attempt to overcome these barriers, the current study utilized GWAS meta-analysis to examine the association of common genetic variation (~8M single-nucleotide polymorphisms (SNP) with minor allele frequency ⩾1%) to general cognitive function in a sample of 35 298 healthy individuals of European ancestry across 24 cohorts in the Cognitive Genomics Consortium (COGENT). In addition, we utilized individual SNP lookups and polygenic score analyses to identify genetic overlap with other relevant neurobehavioral phenotypes. Our primary GWAS meta-analysis identified two novel SNP loci (top SNPs: rs76114856 in the CENPO gene on chromosome 2 and rs6669072 near LOC105378853 on chromosome 1) associated with cognitive performance at the genome-wide significance level (P<5 × 10−8). Gene-based analysis identified an additional three Bonferroni-corrected significant loci at chromosomes 17q21.31, 17p13.1 and 1p13.3. Altogether, common variation across the genome resulted in a conservatively estimated SNP heritability of 21.5% (s.e.=0.01%) for general cognitive function. Integration with prior GWAS of cognitive performance and educational attainment yielded several additional significant loci. Finally, we found robust polygenic correlations between cognitive performance and educational attainment, several psychiatric disorders, birth length/weight and smoking behavior, as well as a novel genetic association to the personality trait of openness. These data provide new insight into the genetics of neurocognitive function with relevance to understanding the pathophysiology of neuropsychiatric illness. PMID:28093568
Annotating Socio-Cultural Structures in Text
2012-10-31
parts of speech (POS) within text, using the Stanford Part of Speech Tagger (Stanford Log-Linear, 2011). The ERDC-CERL taxonomy is then used to...annotated NP/VP Pane: Shows the sentence parsed using the Parts of Speech tagger Document View Pane: Specifies the document (being annotated) in three...first parsed using the Stanford Parts of Speech tagger and converted to an XML document both components which are done through the Import function
Deutschbauer, Adam; Price, Morgan N.; Wetmore, Kelly M.; Shao, Wenjun; Baumohl, Jason K.; Xu, Zhuchen; Nguyen, Michelle; Tamse, Raquel; Davis, Ronald W.; Arkin, Adam P.
2011-01-01
Most genes in bacteria are experimentally uncharacterized and cannot be annotated with a specific function. Given the great diversity of bacteria and the ease of genome sequencing, high-throughput approaches to identify gene function experimentally are needed. Here, we use pools of tagged transposon mutants in the metal-reducing bacterium Shewanella oneidensis MR-1 to probe the mutant fitness of 3,355 genes in 121 diverse conditions including different growth substrates, alternative electron acceptors, stresses, and motility. We find that 2,350 genes have a pattern of fitness that is significantly different from random and 1,230 of these genes (37% of our total assayed genes) have enough signal to show strong biological correlations. We find that genes in all functional categories have phenotypes, including hundreds of hypotheticals, and that potentially redundant genes (over 50% amino acid identity to another gene in the genome) are also likely to have distinct phenotypes. Using fitness patterns, we were able to propose specific molecular functions for 40 genes or operons that lacked specific annotations or had incomplete annotations. In one example, we demonstrate that the previously hypothetical gene SO_3749 encodes a functional acetylornithine deacetylase, thus filling a missing step in S. oneidensis metabolism. Additionally, we demonstrate that the orphan histidine kinase SO_2742 and orphan response regulator SO_2648 form a signal transduction pathway that activates expression of acetyl-CoA synthase and is required for S. oneidensis to grow on acetate as a carbon source. Lastly, we demonstrate that gene expression and mutant fitness are poorly correlated and that mutant fitness generates more confident predictions of gene function than does gene expression. The approach described here can be applied generally to create large-scale gene-phenotype maps for evidence-based annotation of gene function in prokaryotes. PMID:22125499
Wang, Shur-Jen; Laulederkind, Stanley J F; Hayman, G Thomas; Petri, Victoria; Smith, Jennifer R; Tutaj, Marek; Nigam, Rajni; Dwinell, Melinda R; Shimoyama, Mary
2016-08-01
Cardiovascular diseases are complex diseases caused by a combination of genetic and environmental factors. To facilitate progress in complex disease research, the Rat Genome Database (RGD) provides the community with a disease portal where genome objects and biological data related to cardiovascular diseases are systematically organized. The purpose of this study is to present biocuration at RGD, including disease, genetic, and pathway data. The RGD curation team uses controlled vocabularies/ontologies to organize data curated from the published literature or imported from disease and pathway databases. These organized annotations are associated with genes, strains, and quantitative trait loci (QTLs), thus linking functional annotations to genome objects. Screen shots from the web pages are used to demonstrate the organization of annotations at RGD. The human cardiovascular disease genes identified by annotations were grouped according to data sources and their annotation profiles were compared by in-house tools and other enrichment tools available to the public. The analysis results show that the imported cardiovascular disease genes from ClinVar and OMIM are functionally different from the RGD manually curated genes in terms of pathway and Gene Ontology annotations. The inclusion of disease genes from other databases enriches the collection of disease genes not only in quantity but also in quality. Copyright © 2016 the American Physiological Society.
AnnoLnc: a web server for systematically annotating novel human lncRNAs.
Hou, Mei; Tang, Xing; Tian, Feng; Shi, Fangyuan; Liu, Fenglin; Gao, Ge
2016-11-16
Long noncoding RNAs (lncRNAs) have been shown to play essential roles in almost every important biological process through multiple mechanisms. Although the repertoire of human lncRNAs has rapidly expanded, their biological function and regulation remain largely elusive, calling for a systematic and integrative annotation tool. Here we present AnnoLnc ( http://annolnc.cbi.pku.edu.cn ), a one-stop portal for systematically annotating novel human lncRNAs. Based on more than 700 data sources and various tool chains, AnnoLnc enables a systematic annotation covering genomic location, secondary structure, expression patterns, transcriptional regulation, miRNA interaction, protein interaction, genetic association and evolution. An intuitive web interface is available for interactive analysis through both desktops and mobile devices, and programmers can further integrate AnnoLnc into their pipeline through standard JSON-based Web Service APIs. To the best of our knowledge, AnnoLnc is the only web server to provide on-the-fly and systematic annotation for newly identified human lncRNAs. Compared with similar tools, the annotation generated by AnnoLnc covers a much wider spectrum with intuitive visualization. Case studies demonstrate the power of AnnoLnc in not only rediscovering known functions of human lncRNAs but also inspiring novel hypotheses.
[Transcriptome analysis of Dunaliella viridis].
Zhu, Shuai-qi; Gong, Yi-fu; Hang, Yu-qing; Liu, Hao; Wang, He-yu
2015-08-01
In order to understand the gene information, function, haloduric pathway (glycerolipid metabolism) and related key genes for Dunaliella viridis, we used Illumina HiSeqTM 2000 high-throughput sequencing technology to sequence its transcriptome. Trinity soft was used to assemble the data to form transcripts. Based on the Clusters of Orthologous Groups (COG), Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG ) databases, we carried out functional annotation and classification, pathway annotation, and the opening reading fragment (ORF) sequence prediction of transcripts. The key genes in the glycerolipid metabolism were analyzed. The results suggested that 81,593 transcripts were found, and 77,117 ORF sequences were predicted, accounting for 94.50% of all transcripts. COG classification results showed that 16,569 transcripts were assigned to 24 categories. GO classification annotated 76,436 transcripts. The number of transcripts for biologcial processes was 30,678, accounting for 40.14% of all transcripts. KEGG pathway analysis showed that 26,428 transcripts were annotated to 317 pathways, and 131 pathways were related to metabolism, accounting for 41.32% of all annotated pathways. Only one transcript was annotated as coding the key enzyme dihydroxyacetone kinase involved in the glycerolipid pathway. This enzyme could be related to glycerol biosynthesis under salt stress. This study further improved the gene information and laid the foundation of metabolic pathway research for Dunaliella viridis.
Using comparative genome analysis to identify problems in annotated microbial genomes.
Poptsova, Maria S; Gogarten, J Peter
2010-07-01
Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as Escherichia coli, and consider additional quality control for their results.
Beijbom, Oscar; Edmunds, Peter J.; Roelfsema, Chris; Smith, Jennifer; Kline, David I.; Neal, Benjamin P.; Dunlap, Matthew J.; Moriarty, Vincent; Fan, Tung-Yung; Tan, Chih-Jui; Chan, Stephen; Treibitz, Tali; Gamst, Anthony; Mitchell, B. Greg; Kriegman, David
2015-01-01
Global climate change and other anthropogenic stressors have heightened the need to rapidly characterize ecological changes in marine benthic communities across large scales. Digital photography enables rapid collection of survey images to meet this need, but the subsequent image annotation is typically a time consuming, manual task. We investigated the feasibility of using automated point-annotation to expedite cover estimation of the 17 dominant benthic categories from survey-images captured at four Pacific coral reefs. Inter- and intra- annotator variability among six human experts was quantified and compared to semi- and fully- automated annotation methods, which are made available at coralnet.ucsd.edu. Our results indicate high expert agreement for identification of coral genera, but lower agreement for algal functional groups, in particular between turf algae and crustose coralline algae. This indicates the need for unequivocal definitions of algal groups, careful training of multiple annotators, and enhanced imaging technology. Semi-automated annotation, where 50% of the annotation decisions were performed automatically, yielded cover estimate errors comparable to those of the human experts. Furthermore, fully-automated annotation yielded rapid, unbiased cover estimates but with increased variance. These results show that automated annotation can increase spatial coverage and decrease time and financial outlay for image-based reef surveys. PMID:26154157
Zeron-Medina, Jorge; Wang, Xuting; Repapi, Emmanouela; Campbell, Michelle R.; Su, Dan; Castro-Giner, Francesc; Davies, Benjamin; Peterse, Elisabeth F.P.; Sacilotto, Natalia; Walker, Graeme J.; Terzian, Tamara; Tomlinson, Ian P.; Box, Neil F.; Meinshausen, Nicolai; De Val, Sarah; Bell, Douglas A.; Bond, Gareth L.
2014-01-01
SUMMARY The ability of p53 to regulate transcription is crucial for tumor suppression and implies that inherited polymorphisms in functional p53-binding sites could influence cancer. Here, we identify a polymorphic p53 responsive element and demonstrate its influence on cancer risk using genome-wide data sets of cancer susceptibility loci, genetic variation, p53 occupancy, and p53-binding sites. We uncover a single-nucleotide polymorphism (SNP) in a functional p53-binding site and establish its influence on the ability of p53 to bind to and regulate transcription of the KITLG gene. The SNP resides in KITLG and associates with one of the largest risks identified among cancer genome-wide association studies. We establish that the SNP has undergone positive selection throughout evolution, signifying a selective benefit, but go on to show that similar SNPs are rare in the genome due to negative selection, indicating that polymorphisms in p53-binding sites are primarily detrimental to humans. PMID:24120139
Castro-Martínez, Anna Gabriela; Sánchez-Corona, José; Vázquez-Vargas, Adriana Patricia; García-Zapién, Alejandra Guadalupe; López-Quintero, Andres; Villalpando-Velazco, Héctor Javier; Flores-Martínez, Silvia Esperanza
2018-02-28
Gestational diabetes mellitus (GDM) is a metabolically complex disease with major genetic determinants. GDM has been associated with insulin resistance and dysfunction of pancreatic beta cells, so the GDM candidate genes are those that encode proteins modulating the function and secretion of insulin, such as that for calpain 10 (CAPN10). This study aimed to assess whether single nucleotide polymorphism (SNP)-43, SNP-44, SNP-63, and the indel-19 variant, and specific haplotypes of the CAPN10 gene were associated with gestational diabetes mellitus. We studied 116 patients with gestational diabetes mellitus and 83 women with normal glucose tolerance. Measurements of anthropometric and biochemical parameters were performed. SNP-43, SNP-44, and SNP-63 were identified by polymerase chain reaction (PCR)-restriction fragment length polymorphisms, while the indel-19 variant was detected by TaqMan qPCR assays. The allele, genotype, and haplotype frequencies of the four variants did not differ significantly between women with gestational diabetes mellitus and controls. However, in women with gestational diabetes mellitus, glucose levels were significantly higher bearing the 3R/3R genotype than in carriers of the 3R/2R genotype of the indel-19 variant (p = 0.006). In conclusion, the 3R/3R genotype of the indel-19 variant of the CAPN-10 gene influenced increased glucose levels in these Mexican women with gestational diabetes mellitus.
A draft annotation and overview of the human genome
Wright, Fred A; Lemon, William J; Zhao, Wei D; Sears, Russell; Zhuo, Degen; Wang, Jian-Ping; Yang, Hee-Yung; Baer, Troy; Stredney, Don; Spitzner, Joe; Stutz, Al; Krahe, Ralf; Yuan, Bo
2001-01-01
Background The recent draft assembly of the human genome provides a unified basis for describing genomic structure and function. The draft is sufficiently accurate to provide useful annotation, enabling direct observations of previously inferred biological phenomena. Results We report here a functionally annotated human gene index placed directly on the genome. The index is based on the integration of public transcript, protein, and mapping information, supplemented with computational prediction. We describe numerous global features of the genome and examine the relationship of various genetic maps with the assembly. In addition, initial sequence analysis reveals highly ordered chromosomal landscapes associated with paralogous gene clusters and distinct functional compartments. Finally, these annotation data were synthesized to produce observations of gene density and number that accord well with historical estimates. Such a global approach had previously been described only for chromosomes 21 and 22, which together account for 2.2% of the genome. Conclusions We estimate that the genome contains 65,000-75,000 transcriptional units, with exon sequences comprising 4%. The creation of a comprehensive gene index requires the synthesis of all available computational and experimental evidence. PMID:11516338
Maldonado-Aguayo, Waleska; Chávez-Mardones, Jacqueline; Gonçalves, Ana Teresa; Gallardo-Escárate, Cristian
2015-01-01
Cathepsins are proteases involved in the ability of parasites to overcome and/or modulate host defenses so as to complete their own lifecycle. However, the mechanisms underlying this ability of cathepsins are still poorly understood. One excellent model for identifying and exploring the molecular functions of cathepsins is the marine ectoparasitic copepod Caligus rogercresseyi that currently affects the Chilean salmon industry. Using high-throughput transcriptome sequencing, 56 cathepsin-like sequences were found distributed in five cysteine protease groups (B, F, L, Z, and S) as well as in an aspartic protease group (D). Ontogenic transcriptome analysis evidenced that L cathepsins were the most abundant during the lifecycle, while cathepsins B and K were mostly expressed in the larval stages and adult females, thus suggesting participation in the molting processes and embryonic development, respectively. Interestingly, a variety of cathepsins from groups Z, L, D, B, K, and S were upregulated in the infective stage of copepodid, corroborating the complexity of the processes involved in the parasitic success of this copepod. Putative functional roles of cathepsins were conjectured based on the differential expressions found and on roles previously described in other phylogenetically related species. Moreover, 140 single nucleotide polymorphisms (SNP) were identified in transcripts annotated for cysteine and aspartic proteases located into untranslated regions, or the coding region. This study reports for the first time the presence of cathepsin-like genes and differential expressions throughout a copepod lifecycle. The identification of cathepsins together with functional validations represents a valuable strategy for pinpointing target molecules that could be used in the development of new delousing drugs or vaccines against C. rogercresseyi. PMID:25923525
Maldonado-Aguayo, Waleska; Chávez-Mardones, Jacqueline; Gonçalves, Ana Teresa; Gallardo-Escárate, Cristian
2015-01-01
Cathepsins are proteases involved in the ability of parasites to overcome and/or modulate host defenses so as to complete their own lifecycle. However, the mechanisms underlying this ability of cathepsins are still poorly understood. One excellent model for identifying and exploring the molecular functions of cathepsins is the marine ectoparasitic copepod Caligus rogercresseyi that currently affects the Chilean salmon industry. Using high-throughput transcriptome sequencing, 56 cathepsin-like sequences were found distributed in five cysteine protease groups (B, F, L, Z, and S) as well as in an aspartic protease group (D). Ontogenic transcriptome analysis evidenced that L cathepsins were the most abundant during the lifecycle, while cathepsins B and K were mostly expressed in the larval stages and adult females, thus suggesting participation in the molting processes and embryonic development, respectively. Interestingly, a variety of cathepsins from groups Z, L, D, B, K, and S were upregulated in the infective stage of copepodid, corroborating the complexity of the processes involved in the parasitic success of this copepod. Putative functional roles of cathepsins were conjectured based on the differential expressions found and on roles previously described in other phylogenetically related species. Moreover, 140 single nucleotide polymorphisms (SNP) were identified in transcripts annotated for cysteine and aspartic proteases located into untranslated regions, or the coding region. This study reports for the first time the presence of cathepsin-like genes and differential expressions throughout a copepod lifecycle. The identification of cathepsins together with functional validations represents a valuable strategy for pinpointing target molecules that could be used in the development of new delousing drugs or vaccines against C. rogercresseyi.
Fu, Yong-Bi; Yang, Mo-Hua; Zeng, Fangqin; Biligetu, Bill
2017-01-01
Molecular plant breeding with the aid of molecular markers has played an important role in modern plant breeding over the last two decades. Many marker-based predictions for quantitative traits have been made to enhance parental selection, but the trait prediction accuracy remains generally low, even with the aid of dense, genome-wide SNP markers. To search for more accurate trait-specific prediction with informative SNP markers, we conducted a literature review on the prediction issues in molecular plant breeding and on the applicability of an RNA-Seq technique for developing function-associated specific trait (FAST) SNP markers. To understand whether and how FAST SNP markers could enhance trait prediction, we also performed a theoretical reasoning on the effectiveness of these markers in a trait-specific prediction, and verified the reasoning through computer simulation. To the end, the search yielded an alternative to regular genomic selection with FAST SNP markers that could be explored to achieve more accurate trait-specific prediction. Continuous search for better alternatives is encouraged to enhance marker-based predictions for an individual quantitative trait in molecular plant breeding. PMID:28729875
Altermann, Eric; Lu, Jingli; McCulloch, Alan
2017-01-01
Expert curated annotation remains one of the critical steps in achieving a reliable biological relevant annotation. Here we announce the release of GAMOLA2, a user friendly and comprehensive software package to process, annotate and curate draft and complete bacterial, archaeal, and viral genomes. GAMOLA2 represents a wrapping tool to combine gene model determination, functional Blast, COG, Pfam, and TIGRfam analyses with structural predictions including detection of tRNAs, rRNA genes, non-coding RNAs, signal protein cleavage sites, transmembrane helices, CRISPR repeats and vector sequence contaminations. GAMOLA2 has already been validated in a wide range of bacterial and archaeal genomes, and its modular concept allows easy addition of further functionality in future releases. A modified and adapted version of the Artemis Genome Viewer (Sanger Institute) has been developed to leverage the additional features and underlying information provided by the GAMOLA2 analysis, and is part of the software distribution. In addition to genome annotations, GAMOLA2 features, among others, supplemental modules that assist in the creation of custom Blast databases, annotation transfers between genome versions, and the preparation of Genbank files for submission via the NCBI Sequin tool. GAMOLA2 is intended to be run under a Linux environment, whereas the subsequent visualization and manual curation in Artemis is mobile and platform independent. The development of GAMOLA2 is ongoing and community driven. New functionality can easily be added upon user requests, ensuring that GAMOLA2 provides information relevant to microbiologists. The software is available free of charge for academic use. PMID:28386247
Altermann, Eric; Lu, Jingli; McCulloch, Alan
2017-01-01
Expert curated annotation remains one of the critical steps in achieving a reliable biological relevant annotation. Here we announce the release of GAMOLA2, a user friendly and comprehensive software package to process, annotate and curate draft and complete bacterial, archaeal, and viral genomes. GAMOLA2 represents a wrapping tool to combine gene model determination, functional Blast, COG, Pfam, and TIGRfam analyses with structural predictions including detection of tRNAs, rRNA genes, non-coding RNAs, signal protein cleavage sites, transmembrane helices, CRISPR repeats and vector sequence contaminations. GAMOLA2 has already been validated in a wide range of bacterial and archaeal genomes, and its modular concept allows easy addition of further functionality in future releases. A modified and adapted version of the Artemis Genome Viewer (Sanger Institute) has been developed to leverage the additional features and underlying information provided by the GAMOLA2 analysis, and is part of the software distribution. In addition to genome annotations, GAMOLA2 features, among others, supplemental modules that assist in the creation of custom Blast databases, annotation transfers between genome versions, and the preparation of Genbank files for submission via the NCBI Sequin tool. GAMOLA2 is intended to be run under a Linux environment, whereas the subsequent visualization and manual curation in Artemis is mobile and platform independent. The development of GAMOLA2 is ongoing and community driven. New functionality can easily be added upon user requests, ensuring that GAMOLA2 provides information relevant to microbiologists. The software is available free of charge for academic use.
Vallenet, David; Calteau, Alexandra; Cruveiller, Stéphane; Gachet, Mathieu; Lajus, Aurélie; Josso, Adrien; Mercier, Jonathan; Renaux, Alexandre; Rollin, Johan; Rouy, Zoe; Roche, David; Scarpelli, Claude; Médigue, Claudine
2017-01-01
The annotation of genomes from NGS platforms needs to be automated and fully integrated. However, maintaining consistency and accuracy in genome annotation is a challenging problem because millions of protein database entries are not assigned reliable functions. This shortcoming limits the knowledge that can be extracted from genomes and metabolic models. Launched in 2005, the MicroScope platform (http://www.genoscope.cns.fr/agc/microscope) is an integrative resource that supports systematic and efficient revision of microbial genome annotation, data management and comparative analysis. Effective comparative analysis requires a consistent and complete view of biological data, and therefore, support for reviewing the quality of functional annotation is critical. MicroScope allows users to analyze microbial (meta)genomes together with post-genomic experiment results if any (i.e. transcriptomics, re-sequencing of evolved strains, mutant collections, phenotype data). It combines tools and graphical interfaces to analyze genomes and to perform the expert curation of gene functions in a comparative context. Starting with a short overview of the MicroScope system, this paper focuses on some major improvements of the Web interface, mainly for the submission of genomic data and on original tools and pipelines that have been developed and integrated in the platform: computation of pan-genomes and prediction of biosynthetic gene clusters. Today the resource contains data for more than 6000 microbial genomes, and among the 2700 personal accounts (65% of which are now from foreign countries), 14% of the users are performing expert annotations, on at least a weekly basis, contributing to improve the quality of microbial genome annotations. PMID:27899624
Chiu, Shih-Hau; Chen, Chien-Chi; Yuan, Gwo-Fang; Lin, Thy-Hou
2006-06-15
The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart.
NCBI prokaryotic genome annotation pipeline.
Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James
2016-08-19
Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.
GeneTools--application for functional annotation and statistical hypothesis testing.
Beisvag, Vidar; Jünge, Frode K R; Bergum, Hallgeir; Jølsum, Lars; Lydersen, Stian; Günther, Clara-Cecilie; Ramampiaro, Heri; Langaas, Mette; Sandvik, Arne K; Laegreid, Astrid
2006-10-24
Modern biology has shifted from "one gene" approaches to methods for genomic-scale analysis like microarray technology, which allow simultaneous measurement of thousands of genes. This has created a need for tools facilitating interpretation of biological data in "batch" mode. However, such tools often leave the investigator with large volumes of apparently unorganized information. To meet this interpretation challenge, gene-set, or cluster testing has become a popular analytical tool. Many gene-set testing methods and software packages are now available, most of which use a variety of statistical tests to assess the genes in a set for biological information. However, the field is still evolving, and there is a great need for "integrated" solutions. GeneTools is a web-service providing access to a database that brings together information from a broad range of resources. The annotation data are updated weekly, guaranteeing that users get data most recently available. Data submitted by the user are stored in the database, where it can easily be updated, shared between users and exported in various formats. GeneTools provides three different tools: i) NMC Annotation Tool, which offers annotations from several databases like UniGene, Entrez Gene, SwissProt and GeneOntology, in both single- and batch search mode. ii) GO Annotator Tool, where users can add new gene ontology (GO) annotations to genes of interest. These user defined GO annotations can be used in further analysis or exported for public distribution. iii) eGOn, a tool for visualization and statistical hypothesis testing of GO category representation. As the first GO tool, eGOn supports hypothesis testing for three different situations (master-target situation, mutually exclusive target-target situation and intersecting target-target situation). An important additional function is an evidence-code filter that allows users, to select the GO annotations for the analysis. GeneTools is the first "all in one" annotation tool, providing users with a rapid extraction of highly relevant gene annotation data for e.g. thousands of genes or clones at once. It allows a user to define and archive new GO annotations and it supports hypothesis testing related to GO category representations. GeneTools is freely available through www.genetools.no
The Community Junior College: An Annotated Bibliography.
ERIC Educational Resources Information Center
Rarig, Emory W., Jr., Ed.
This annotated bibliography on the junior college is arranged by topic: research tools, history, functions and purposes, organization and administration, students, programs, personnel, facilities, and research. It covers publications through the fall of 1965 and has an author index. (HH)
PipeOnline 2.0: automated EST processing and functional data sorting.
Ayoubi, Patricia; Jin, Xiaojing; Leite, Saul; Liu, Xianghui; Martajaja, Jeson; Abduraham, Abdurashid; Wan, Qiaolan; Yan, Wei; Misawa, Eduardo; Prade, Rolf A
2002-11-01
Expressed sequence tags (ESTs) are generated and deposited in the public domain, as redundant, unannotated, single-pass reactions, with virtually no biological content. PipeOnline automatically analyses and transforms large collections of raw DNA-sequence data from chromatograms or FASTA files by calling the quality of bases, screening and removing vector sequences, assembling and rewriting consensus sequences of redundant input files into a unigene EST data set and finally through translation, amino acid sequence similarity searches, annotation of public databases and functional data. PipeOnline generates an annotated database, retaining the processed unigene sequence, clone/file history, alignments with similar sequences, and proposed functional classification, if available. Functional annotation is automatic and based on a novel method that relies on homology of amino acid sequence multiplicity within GenBank records. Records are examined through a function ordered browser or keyword queries with automated export of results. PipeOnline offers customization for individual projects (MyPipeOnline), automated updating and alert service. PipeOnline is available at http://stress-genomics.org.
Leuthaeuser, Janelle B; Knutson, Stacy T; Kumar, Kiran; Babbitt, Patricia C; Fetrow, Jacquelyn S
2015-09-01
The development of accurate protein function annotation methods has emerged as a major unsolved biological problem. Protein similarity networks, one approach to function annotation via annotation transfer, group proteins into similarity-based clusters. An underlying assumption is that the edge metric used to identify such clusters correlates with functional information. In this contribution, this assumption is evaluated by observing topologies in similarity networks using three different edge metrics: sequence (BLAST), structure (TM-Align), and active site similarity (active site profiling, implemented in DASP). Network topologies for four well-studied protein superfamilies (enolase, peroxiredoxin (Prx), glutathione transferase (GST), and crotonase) were compared with curated functional hierarchies and structure. As expected, network topology differs, depending on edge metric; comparison of topologies provides valuable information on structure/function relationships. Subnetworks based on active site similarity correlate with known functional hierarchies at a single edge threshold more often than sequence- or structure-based networks. Sequence- and structure-based networks are useful for identifying sequence and domain similarities and differences; therefore, it is important to consider the clustering goal before deciding appropriate edge metric. Further, conserved active site residues identified in enolase and GST active site subnetworks correspond with published functionally important residues. Extension of this analysis yields predictions of functionally determinant residues for GST subgroups. These results support the hypothesis that active site similarity-based networks reveal clusters that share functional details and lay the foundation for capturing functionally relevant hierarchies using an approach that is both automatable and can deliver greater precision in function annotation than current similarity-based methods. © 2015 The Authors Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.
Leuthaeuser, Janelle B; Knutson, Stacy T; Kumar, Kiran; Babbitt, Patricia C; Fetrow, Jacquelyn S
2015-01-01
The development of accurate protein function annotation methods has emerged as a major unsolved biological problem. Protein similarity networks, one approach to function annotation via annotation transfer, group proteins into similarity-based clusters. An underlying assumption is that the edge metric used to identify such clusters correlates with functional information. In this contribution, this assumption is evaluated by observing topologies in similarity networks using three different edge metrics: sequence (BLAST), structure (TM-Align), and active site similarity (active site profiling, implemented in DASP). Network topologies for four well-studied protein superfamilies (enolase, peroxiredoxin (Prx), glutathione transferase (GST), and crotonase) were compared with curated functional hierarchies and structure. As expected, network topology differs, depending on edge metric; comparison of topologies provides valuable information on structure/function relationships. Subnetworks based on active site similarity correlate with known functional hierarchies at a single edge threshold more often than sequence- or structure-based networks. Sequence- and structure-based networks are useful for identifying sequence and domain similarities and differences; therefore, it is important to consider the clustering goal before deciding appropriate edge metric. Further, conserved active site residues identified in enolase and GST active site subnetworks correspond with published functionally important residues. Extension of this analysis yields predictions of functionally determinant residues for GST subgroups. These results support the hypothesis that active site similarity-based networks reveal clusters that share functional details and lay the foundation for capturing functionally relevant hierarchies using an approach that is both automatable and can deliver greater precision in function annotation than current similarity-based methods. PMID:26073648
SNP_tools: A compact tool package for analysis and conversion of genotype data for MS-Excel
Chen, Bowang; Wilkening, Stefan; Drechsel, Marion; Hemminki, Kari
2009-01-01
Background Single nucleotide polymorphism (SNP) genotyping is a major activity in biomedical research. Scientists prefer to have a facile access to the results which may require conversions between data formats. First hand SNP data is often entered in or saved in the MS-Excel format, but this software lacks genetic and epidemiological related functions. A general tool to do basic genetic and epidemiological analysis and data conversion for MS-Excel is needed. Findings The SNP_tools package is prepared as an add-in for MS-Excel. The code is written in Visual Basic for Application, embedded in the Microsoft Office package. This add-in is an easy to use tool for users with basic computer knowledge (and requirements for basic statistical analysis). Conclusion Our implementation for Microsoft Excel 2000-2007 in Microsoft Windows 2000, XP, Vista and Windows 7 beta can handle files in different formats and converts them into other formats. It is a free software. PMID:19852806
SNP_tools: A compact tool package for analysis and conversion of genotype data for MS-Excel.
Chen, Bowang; Wilkening, Stefan; Drechsel, Marion; Hemminki, Kari
2009-10-23
Single nucleotide polymorphism (SNP) genotyping is a major activity in biomedical research. Scientists prefer to have a facile access to the results which may require conversions between data formats. First hand SNP data is often entered in or saved in the MS-Excel format, but this software lacks genetic and epidemiological related functions. A general tool to do basic genetic and epidemiological analysis and data conversion for MS-Excel is needed. The SNP_tools package is prepared as an add-in for MS-Excel. The code is written in Visual Basic for Application, embedded in the Microsoft Office package. This add-in is an easy to use tool for users with basic computer knowledge (and requirements for basic statistical analysis). Our implementation for Microsoft Excel 2000-2007 in Microsoft Windows 2000, XP, Vista and Windows 7 beta can handle files in different formats and converts them into other formats. It is a free software.
Variation in the X-Linked EFHC2 Gene Is Associated with Social Cognitive Abilities in Males
Startin, Carla M.; Fiorentini, Chiara; de Haan, Michelle; Skuse, David H.
2015-01-01
Females outperform males on many social cognitive tasks. X-linked genes may contribute to this sex difference. Males possess one X chromosome, while females possess two X chromosomes. Functional variations in X-linked genes are therefore likely to impact more on males than females. Previous studies of X-monosomic women with Turner syndrome suggest a genetic association with facial fear recognition abilities at Xp11.3, specifically at a single nucleotide polymorphism (SNP rs7055196) within the EFHC2 gene. Based on a strong hypothesis, we investigated an association between variation at SNP rs7055196 and facial fear recognition and theory of mind abilities in males. As predicted, males possessing the G allele had significantly poorer facial fear detection accuracy and theory of mind abilities than males possessing the A allele (with SNP variant accounting for up to 4.6% of variance). Variation in the X-linked EFHC2 gene at SNP rs7055196 is therefore associated with social cognitive abilities in males. PMID:26107779
Takasaki, Shigeru
2012-01-01
This paper first explains how the relations between Japanese Alzheimer's disease (AD) patients and their mitochondrial SNP frequencies at individual mtDNA positions examined using the radial basis function (RBF) network and a method based on RBF network predictions and that Japanese AD patients are associated with the haplogroups G2a and N9b1. It then describes a method for the initial diagnosis of Alzheimer's disease that is based on the mtSNP haplogroups of the AD patients. The method examines the relations between someone's mtDNA mutations and the mtSNPs of AD patients. As the mtSNP haplogroups thus obtained indicate which nucleotides of mtDNA loci are changed in the Alzheimer's patients, a person's probability of becoming an AD patient can be predicted by comparing those mtDNA mutations with that person's mtDNA mutations. The proposed method can also be used to diagnose diseases such as Parkinson's disease and type 2 diabetes and to identify people likely to become centenarians. PMID:22848858
Quantitative phenotyping via deep barcode sequencing.
Smith, Andrew M; Heisler, Lawrence E; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J; Chee, Mark; Roth, Frederick P; Giaever, Guri; Nislow, Corey
2009-10-01
Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or "Bar-seq," outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that approximately 20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene-environment interactions on a genome-wide scale.
Livingstone, Donald; Stack, Conrad; Mustiga, Guiliana M.; Rodezno, Dayana C.; Suarez, Carmen; Amores, Freddy; Feltus, Frank A.; Mockaitis, Keithanne; Cornejo, Omar E.; Motamayor, Juan C.
2017-01-01
Cacao (Theobroma cacao L.) is an important cash crop in tropical regions around the world and has a rich agronomic history in South America. As a key component in the cosmetic and confectionary industries, millions of people worldwide use products made from cacao, ranging from shampoo to chocolate. An Illumina Infinity II array was created using 13,530 SNPs identified within a small diversity panel of cacao. Of these SNPs, 12,643 derive from variation within annotated cacao genes. The genotypes of 3,072 trees were obtained, including two mapping populations from Ecuador. High-density linkage maps for these two populations were generated and compared to the cacao genome assembly. Phenotypic data from these populations were combined with the linkage maps to identify the QTLs for yield and disease resistance. PMID:29259608
Dictionary-driven protein annotation
Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel
2002-01-01
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/. PMID:12202776
Alg, Varinder S; Ke, Xiayi; Grieve, Joan; Bonner, Stephen; Walsh, Daniel C; Bulters, Diederik; Kitchen, Neil; Houlden, Henry; Werring, David J
2018-01-15
Abnormalities in Matrix Metalloproteinase (MMP) genes, which are important in extracellular matrix (ECM) maintenance and therefore arterial wall integrity are a plausible underlying mechanism of intracranial aneurysm (IA) formation, growth and subsequent rupture. We investigated whether the rs243865 C > T SNP (single nucleotide polymorphism) within the MMP-2 gene (which influences gene transcription) is associated with IA compared to matched controls. We conducted a case-control genetic association study, adjusted for known IA risk factors (smoking and hypertension), in a UK Caucasian population of 1409 patients with intracranial aneurysms (IA), and 1290 matched controls, to determine the association of the rs243865 C > T functional MMP-2 gene SNP with IA (overall, and classified as ruptured and unruptured). We also undertook a meta-analysis of two previous studies examining this SNP. The rs243865 T allele was associated with IA presence in univariate (OR 1.18 [95% CI 1.04-1.33], p = .01) and in multi-variable analyses adjusted for smoking and hypertension status (OR 1.16 [95% CI 1.01-1.35], p = .042). Subgroup analysis demonstrated an association of the rs243865 SNP with ruptured IA (OR 1.18 [95% CI 1.03-1.34] p = .017), but, not unruptured IA (OR 1.17 [95% CI 0.97-1.42], p = .11). Our study demonstrated an association between the functional MMP-2 rs243865 variant and IAs. Our findings suggest a genetic role for altered extracellular matrix integrity in the pathogenesis of IA development and rupture.
Genome-wide association and network analysis of lung function in the Framingham Heart Study.
Liao, Shu-Yi; Lin, Xihong; Christiani, David C
2014-09-01
Single nucleotide polymorphisms have been found to be associated with pulmonary function using genome-wide association studies. However, lung function is a complex trait that is likely to be influenced by multiple gene-gene interactions besides individual genes. Our goal is to build a cellular network to explore the relationship between pulmonary function and genotypes by combining SNP level and network analyses using longitudinal lung function data from the Framingham Heart Study. We analyzed 2,698 genotyped participants from the Offspring cohort that had an average of 3.35 spirometry measurements per person for a mean length of 13 years. Repeated forced expiratory volume in one second (FEV1 ) and the ratio of FEV1 to forced vital capacity (FVC) were used as outcomes. Data were analyzed using linear-mixed models for the association between lung function and alleles by accounting for the correlation among repeated measures over time within the same subject and within-family correlation. Network analyses were performed using dmGWAS and validated with data from the Third Generation cohort. Analyses identified SMAD3, TGFBR2, CD44, CTGF, VCAN, CTNNB1, SCGB1A1, PDE4D, NRG1, EPHB1, and LYN as contributors to pulmonary function. Most of these genes were novel that were not found previously using solely SNP-level analysis. These novel genes are involving the transforming growth factor beta (TGFB)-SMAD pathway, Wnt/beta-catenin pathway, etc. Therefore, combining SNP-level and network analyses using longitudinal lung function data is a useful alternative strategy to identify risk genes. © 2014 WILEY PERIODICALS, INC.
Schweizer, Rena M; Robinson, Jacqueline; Harrigan, Ryan; Silva, Pedro; Galverni, Marco; Musiani, Marco; Green, Richard E; Novembre, John; Wayne, Robert K
2016-01-01
In an era of ever-increasing amounts of whole-genome sequence data for individuals and populations, the utility of traditional single nucleotide polymorphisms (SNPs) array-based genome scans is uncertain. We previously performed a SNP array-based genome scan to identify candidate genes under selection in six distinct grey wolf (Canis lupus) ecotypes. Using this information, we designed a targeted capture array for 1040 genes, including all exons and flanking regions, as well as 5000 1-kb nongenic neutral regions, and resequenced these regions in 107 wolves. Selection tests revealed striking patterns of variation within candidate genes relative to noncandidate regions and identified potentially functional variants related to local adaptation. We found 27% and 47% of candidate genes from the previous SNP array study had functional changes that were outliers in sweed and bayenv analyses, respectively. This result verifies the use of genomewide SNP surveys to tag genes that contain functional variants between populations. We highlight nonsynonymous variants in APOB, LIPG and USH2A that occur in functional domains of these proteins, and that demonstrate high correlation with precipitation seasonality and vegetation. We find Arctic and High Arctic wolf ecotypes have higher numbers of genes under selection, which highlight their conservation value and heightened threat due to climate change. This study demonstrates that combining genomewide genotyping arrays with large-scale resequencing and environmental data provides a powerful approach to discern candidate functional variants in natural populations. © 2015 John Wiley & Sons Ltd.
Mendoza Lopez, Pablo; Golby, Paul; Wooff, Esen; Garcia, Javier Nunez; Garcia Pelayo, M. Carmen; Conlon, Kevin; Gema Camacho, Ana; Hewinson, R. Glyn; Polaina, Julio; Suárez García, Antonio; Gordon, Stephen V.
2010-01-01
A number of single-nucleotide polymorphisms (SNPs) have been identified in the genome of Mycobacterium bovis BCG Pasteur compared with the sequenced strain M. bovis 2122/97. The functional consequences of many of these mutations remain to be described; however, mutations in genes encoding regulators may be particularly relevant to global phenotypic changes such as loss of virulence, since alteration of a regulator's function will affect the expression of a wide range of genes. One such SNP falls in bcg3145, encoding a member of the AfsR/DnrI/SARP class of global transcriptional regulators, that replaces a highly conserved glutamic acid residue at position 159 (E159G) with glycine in a tetratricopeptide repeat (TPR) located in the bacterial transcriptional activation (BTA) domain of BCG3145. TPR domains are associated with protein–protein interactions, and a conserved core (helices T1–T7) of the BTA domain seems to be required for proper function of SARP-family proteins. Structural modelling predicted that the E159G mutation perturbs the third α-helix of the BTA domain and could therefore have functional consequences. The E159G SNP was found to be present in all BCG strains, but absent from virulent M. bovis and Mycobacterium tuberculosis strains. By overexpressing BCG3145 and Rv3124 in BCG and H37Rv and monitoring transcriptome changes using microarrays, we determined that BCG3145/Rv3124 acts as a positive transcriptional regulator of the molybdopterin biosynthesis moa1 locus, and we suggest that rv3124 be renamed moaR1. The SNP in bcg3145 was found to have a subtle effect on the activity of MoaR1, suggesting that this mutation is not a key event in the attenuation of BCG. PMID:20378651
Wood, David L. A.; Nones, Katia; Steptoe, Anita; Christ, Angelika; Harliwong, Ivon; Newell, Felicity; Bruxner, Timothy J. C.; Miller, David; Cloonan, Nicole; Grimmond, Sean M.
2015-01-01
Genetic variation modulates gene expression transcriptionally or post-transcriptionally, and can profoundly alter an individual’s phenotype. Measuring allelic differential expression at heterozygous loci within an individual, a phenomenon called allele-specific expression (ASE), can assist in identifying such factors. Massively parallel DNA and RNA sequencing and advances in bioinformatic methodologies provide an outstanding opportunity to measure ASE genome-wide. In this study, matched DNA and RNA sequencing, genotyping arrays and computationally phased haplotypes were integrated to comprehensively and conservatively quantify ASE in a single human brain and liver tissue sample. We describe a methodological evaluation and assessment of common bioinformatic steps for ASE quantification, and recommend a robust approach to accurately measure SNP, gene and isoform ASE through the use of personalized haplotype genome alignment, strict alignment quality control and intragenic SNP aggregation. Our results indicate that accurate ASE quantification requires careful bioinformatic analyses and is adversely affected by sample specific alignment confounders and random sampling even at moderate sequence depths. We identified multiple known and several novel ASE genes in liver, including WDR72, DSP and UBD, as well as genes that contained ASE SNPs with imbalance direction discordant with haplotype phase, explainable by annotated transcript structure, suggesting isoform derived ASE. The methods evaluated in this study will be of use to researchers performing highly conservative quantification of ASE, and the genes and isoforms identified as ASE of interest to researchers studying those loci. PMID:25965996
Gilling, Mette; Rasmussen, Hanne B; Calloe, Kirstine; Sequeira, Ana F; Baretto, Marta; Oliveira, Guiomar; Almeida, Joana; Lauritsen, Marlene B; Ullmann, Reinhard; Boonen, Susanne E; Brondum-Nielsen, Karen; Kalscheuer, Vera M; Tümer, Zeynep; Vicente, Astrid M; Schmitt, Nicole; Tommerup, Niels
2013-01-01
Heterozygous mutations in the KCNQ3 gene on chromosome 8q24 encoding the voltage-gated potassium channel KV7.3 subunit have previously been associated with rolandic epilepsy and idiopathic generalized epilepsy (IGE) including benign neonatal convulsions. We identified a de novo t(3;8) (q21;q24) translocation truncating KCNQ3 in a boy with childhood autism. In addition, we identified a c.1720C > T [p.P574S] nucleotide change in three unrelated individuals with childhood autism and no history of convulsions. This nucleotide change was previously reported in patients with rolandic epilepsy or IGE and has now been annotated as a very rare SNP (rs74582884) in dbSNP. The p.P574S KV7.3 variant significantly reduced potassium current amplitude in Xenopus laevis oocytes when co-expressed with KV7.5 but not with KV7.2 or KV7.4. The nucleotide change did not affect trafficking of heteromeric mutant KV7.3/2, KV7.3/4, or KV7.3/5 channels in HEK 293 cells or primary rat hippocampal neurons. Our results suggest that dysfunction of the heteromeric KV7.3/5 channel is implicated in the pathogenesis of some forms of autism spectrum disorders, epilepsy, and possibly other psychiatric disorders and therefore, KCNQ3 and KCNQ5 are suggested as candidate genes for these disorders.
Gilling, Mette; Rasmussen, Hanne B.; Calloe, Kirstine; Sequeira, Ana F.; Baretto, Marta; Oliveira, Guiomar; Almeida, Joana; Lauritsen, Marlene B.; Ullmann, Reinhard; Boonen, Susanne E.; Brondum-Nielsen, Karen; Kalscheuer, Vera M.; Tümer, Zeynep; Vicente, Astrid M.; Schmitt, Nicole; Tommerup, Niels
2012-01-01
Heterozygous mutations in the KCNQ3 gene on chromosome 8q24 encoding the voltage-gated potassium channel KV7.3 subunit have previously been associated with rolandic epilepsy and idiopathic generalized epilepsy (IGE) including benign neonatal convulsions. We identified a de novo t(3;8) (q21;q24) translocation truncating KCNQ3 in a boy with childhood autism. In addition, we identified a c.1720C > T [p.P574S] nucleotide change in three unrelated individuals with childhood autism and no history of convulsions. This nucleotide change was previously reported in patients with rolandic epilepsy or IGE and has now been annotated as a very rare SNP (rs74582884) in dbSNP. The p.P574S KV7.3 variant significantly reduced potassium current amplitude in Xenopus laevis oocytes when co-expressed with KV7.5 but not with KV7.2 or KV7.4. The nucleotide change did not affect trafficking of heteromeric mutant KV7.3/2, KV7.3/4, or KV7.3/5 channels in HEK 293 cells or primary rat hippocampal neurons. Our results suggest that dysfunction of the heteromeric KV7.3/5 channel is implicated in the pathogenesis of some forms of autism spectrum disorders, epilepsy, and possibly other psychiatric disorders and therefore, KCNQ3 and KCNQ5 are suggested as candidate genes for these disorders. PMID:23596459
Wang, Juan; Xue, Dong-Xiu; Zhang, Bai-Dong; Li, Yu-Long; Liu, Bing-Jian; Liu, Jin-Xian
2016-01-01
Next-generation sequencing and the collection of genome-wide single-nucleotide polymorphisms (SNPs) allow identifying fine-scale population genetic structure and genomic regions under selection. The spotted sea bass (Lateolabrax maculatus) is a non-model species of ecological and commercial importance and widely distributed in northwestern Pacific. A total of 22 648 SNPs was discovered across the genome of L. maculatus by paired-end sequencing of restriction-site associated DNA (RAD-PE) for 30 individuals from two populations. The nucleotide diversity (π) for each population was 0.0028±0.0001 in Dandong and 0.0018±0.0001 in Beihai, respectively. Shallow but significant genetic differentiation was detected between the two populations analyzed by using both the whole data set (FST = 0.0550, P < 0.001) and the putatively neutral SNPs (FST = 0.0347, P < 0.001). However, the two populations were highly differentiated based on the putatively adaptive SNPs (FST = 0.6929, P < 0.001). Moreover, a total of 356 SNPs representing 298 unique loci were detected as outliers putatively under divergent selection by FST-based outlier tests as implemented in BAYESCAN and LOSITAN. Functional annotation of the contigs containing putatively adaptive SNPs yielded hits for 22 of 55 (40%) significant BLASTX matches. Candidate genes for local selection constituted a wide array of functions, including binding, catalytic and metabolic activities, etc. The analyses with the SNPs developed in the present study highlighted the importance of genome-wide genetic variation for inference of population structure and local adaptation in L. maculatus. PMID:27336696
Schreiber, Lena; Nader-Nieto, Anna Camila; Schönhals, Elske Maria; Walkemeier, Birgit; Gebhardt, Christiane
2014-07-31
Starch accumulation and breakdown are vital processes in plant storage organs such as seeds, roots, and tubers. In tubers of potato (Solanum tuberosum L.) a small fraction of starch is converted into the reducing sugars glucose and fructose. Reducing sugars accumulate in response to cold temperatures. Even small quantities of reducing sugars affect negatively the quality of processed products such as chips and French fries. Tuber starch and sugar content are inversely correlated complex traits that are controlled by multiple genetic and environmental factors. Based on in silico annotation of the potato genome sequence, 123 loci are involved in starch-sugar interconversion, approximately half of which have been previously cloned and characterized. By means of candidate gene association mapping, we identified single-nucleotide polymorphisms (SNPs) in eight genes known to have key functions in starch-sugar interconversion, which were diagnostic for increased tuber starch and/or decreased sugar content and vice versa. Most positive or negative effects of SNPs on tuber-reducing sugar content were reproducible in two different collections of potato cultivars. The diagnostic SNP markers are useful for breeding applications. An allele of the plastidic starch phosphorylase PHO1a associated with increased tuber starch content was cloned as full-length cDNA and characterized. The PHO1a-HA allele has several amino acid changes, one of which is unique among all known starch/glycogen phosphorylases. This mutation might cause reduced enzyme activity due to impaired formation of the active dimers, thereby limiting starch breakdown. Copyright © 2014 Schreiber et al.
Schreiber, Lena; Nader-Nieto, Anna Camila; Schönhals, Elske Maria; Walkemeier, Birgit; Gebhardt, Christiane
2014-01-01
Starch accumulation and breakdown are vital processes in plant storage organs such as seeds, roots, and tubers. In tubers of potato (Solanum tuberosum L.) a small fraction of starch is converted into the reducing sugars glucose and fructose. Reducing sugars accumulate in response to cold temperatures. Even small quantities of reducing sugars affect negatively the quality of processed products such as chips and French fries. Tuber starch and sugar content are inversely correlated complex traits that are controlled by multiple genetic and environmental factors. Based on in silico annotation of the potato genome sequence, 123 loci are involved in starch-sugar interconversion, approximately half of which have been previously cloned and characterized. By means of candidate gene association mapping, we identified single-nucleotide polymorphisms (SNPs) in eight genes known to have key functions in starch-sugar interconversion, which were diagnostic for increased tuber starch and/or decreased sugar content and vice versa. Most positive or negative effects of SNPs on tuber-reducing sugar content were reproducible in two different collections of potato cultivars. The diagnostic SNP markers are useful for breeding applications. An allele of the plastidic starch phosphorylase PHO1a associated with increased tuber starch content was cloned as full-length cDNA and characterized. The PHO1a-HA allele has several amino acid changes, one of which is unique among all known starch/glycogen phosphorylases. This mutation might cause reduced enzyme activity due to impaired formation of the active dimers, thereby limiting starch breakdown. PMID:25081979
SZDB: A Database for Schizophrenia Genetic Research
Wu, Yong; Yao, Yong-Gang
2017-01-01
Abstract Schizophrenia (SZ) is a debilitating brain disorder with a complex genetic architecture. Genetic studies, especially recent genome-wide association studies (GWAS), have identified multiple variants (loci) conferring risk to SZ. However, how to efficiently extract meaningful biological information from bulk genetic findings of SZ remains a major challenge. There is a pressing need to integrate multiple layers of data from various sources, eg, genetic findings from GWAS, copy number variations (CNVs), association and linkage studies, gene expression, protein–protein interaction (PPI), co-expression, expression quantitative trait loci (eQTL), and Encyclopedia of DNA Elements (ENCODE) data, to provide a comprehensive resource to facilitate the translation of genetic findings into SZ molecular diagnosis and mechanism study. Here we developed the SZDB database (http://www.szdb.org/), a comprehensive resource for SZ research. SZ genetic data, gene expression data, network-based data, brain eQTL data, and SNP function annotation information were systematically extracted, curated and deposited in SZDB. In-depth analyses and systematic integration were performed to identify top prioritized SZ genes and enriched pathways. Multiple types of data from various layers of SZ research were systematically integrated and deposited in SZDB. In-depth data analyses and integration identified top prioritized SZ genes and enriched pathways. We further showed that genes implicated in SZ are highly co-expressed in human brain and proteins encoded by the prioritized SZ risk genes are significantly interacted. The user-friendly SZDB provides high-confidence candidate variants and genes for further functional characterization. More important, SZDB provides convenient online tools for data search and browse, data integration, and customized data analyses. PMID:27451428
Wang, Juan; Xue, Dong-Xiu; Zhang, Bai-Dong; Li, Yu-Long; Liu, Bing-Jian; Liu, Jin-Xian
2016-01-01
Next-generation sequencing and the collection of genome-wide single-nucleotide polymorphisms (SNPs) allow identifying fine-scale population genetic structure and genomic regions under selection. The spotted sea bass (Lateolabrax maculatus) is a non-model species of ecological and commercial importance and widely distributed in northwestern Pacific. A total of 22 648 SNPs was discovered across the genome of L. maculatus by paired-end sequencing of restriction-site associated DNA (RAD-PE) for 30 individuals from two populations. The nucleotide diversity (π) for each population was 0.0028±0.0001 in Dandong and 0.0018±0.0001 in Beihai, respectively. Shallow but significant genetic differentiation was detected between the two populations analyzed by using both the whole data set (FST = 0.0550, P < 0.001) and the putatively neutral SNPs (FST = 0.0347, P < 0.001). However, the two populations were highly differentiated based on the putatively adaptive SNPs (FST = 0.6929, P < 0.001). Moreover, a total of 356 SNPs representing 298 unique loci were detected as outliers putatively under divergent selection by FST-based outlier tests as implemented in BAYESCAN and LOSITAN. Functional annotation of the contigs containing putatively adaptive SNPs yielded hits for 22 of 55 (40%) significant BLASTX matches. Candidate genes for local selection constituted a wide array of functions, including binding, catalytic and metabolic activities, etc. The analyses with the SNPs developed in the present study highlighted the importance of genome-wide genetic variation for inference of population structure and local adaptation in L. maculatus.
Dwivedi, Upendra N; Tiwari, Sameeksha; Prasanna, Pragya; Awasthi, Manika; Singh, Swati; Pandey, Veda P
2016-08-01
Citrus are among the economically most important fruit tree crops in the world. Citrus variegated chlorosis (CVC), caused by Xylella fastidiosa infection, is a serious disease limiting citrus production at a global scale. With availability of citrus genomic resources, it is now possible to compare citrus expressed sequence tag (EST) data sets and identify single-nucleotide polymorphisms (SNPs) within and among different citrus cultivars that can be exploited for citrus resistance to infections, citrus breeding, among others. We report here, for the first time, SNPs in the EST data sets of X. fastidiosa-infected Citrus sinensis (sweet orange) and their functional annotation that revealed the involvement of eight C. sinensis candidate genes in CVC pathogenesis. Among these genes were xyloglucan endotransglycosylase, myo-inositol-1-phosphate synthase, and peroxidase were found to be involved in plant cell wall metabolism. These have been further investigated by molecular modeling for their role in CVC infection and defense. Molecular docking analyses of the wild and the mutant (SNP containing) types of the selected three enzymes with their respective substrates revealed a significant decrease in the binding affinity of substrates for the mutant enzymes, thus suggesting a decrease in the catalytic efficiency of these enzymes during infection, thereby facilitating a favorable condition for infection by the pathogen. These findings offer novel agrigenomics insights in developing future molecular targets and strategies for citrus fruit cultivation in ways that are resistant to X. fastidiosa infection, and by extension, with greater harvesting efficiency and economic value.
Annotations for the Collaboration of the Health Professionals
Bringay, Sandra; Barry, Catherine; Charlet, Jean
2006-01-01
In the French DocPatient project, we work on documentary functionalities to improve the use of the electronic medical record. We suggest that integration of specific uses for paper medical documents in the design of the electronic medical record will improve its utility, use and acceptance. We propose in this paper to add a functionality of annotations in the electronic medical record to reinforce collaboration, coordination and awareness. PMID:17238309
Sma3s: A universal tool for easy functional annotation of proteomes and transcriptomes.
Casimiro-Soriguer, Carlos S; Muñoz-Mérida, Antonio; Pérez-Pulido, Antonio J
2017-06-01
The current cheapening of next-generation sequencing has led to an enormous growth in the number of sequenced genomes and transcriptomes, allowing wet labs to get the sequences from their organisms of study. To make the most of these data, one of the first things that should be done is the functional annotation of the protein-coding genes. But it used to be a slow and tedious step that can involve the characterization of thousands of sequences. Sma3s is an accurate computational tool for annotating proteins in an unattended way. Now, we have developed a completely new version, which includes functionalities that will be of utility for fundamental and applied science. Currently, the results provide functional categories such as biological processes, which become useful for both characterizing particular sequence datasets and comparing results from different projects. But one of the most important implemented innovations is that it has now low computational requirements, and the complete annotation of a simple proteome or transcriptome usually takes around 24 hours in a personal computer. Sma3s has been tested with a large amount of complete proteomes and transcriptomes, and it has demonstrated its potential in health science and other specific projects. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Masseroli, Marco
2007-07-01
The growing available genomic information provides new opportunities for novel research approaches and original biomedical applications that can provide effective data management and analysis support. In fact, integration and comprehensive evaluation of available controlled data can highlight information patterns leading to unveil new biomedical knowledge. Here, we describe Genome Function INtegrated Discover (GFINDer), a Web-accessible three-tier multidatabase system we developed to automatically enrich lists of user-classified genes with several functional and phenotypic controlled annotations, and to statistically evaluate them in order to identify annotation categories significantly over- or underrepresented in each considered gene class. Genomic controlled annotations from Gene Ontology (GO), KEGG, Pfam, InterPro, and Online Mendelian Inheritance in Man (OMIM) were integrated in GFINDer and several categorical tests were implemented for their analysis. A controlled vocabulary of inherited disorder phenotypes was obtained by normalizing and hierarchically structuring disease accompanying signs and symptoms from OMIM Clinical Synopsis sections. GFINDer modular architecture is well suited for further system expansion and for sustaining increasing workload. Testing results showed that GFINDer analyses can highlight gene functional and phenotypic characteristics and differences, demonstrating its value in supporting genomic biomedical approaches aiming at understanding the complex biomolecular mechanisms underlying patho-physiological phenotypes, and in helping the transfer of genomic results to medical practice.
Identification of Functionally Related Enzymes by Learning-to-Rank Methods.
Stock, Michiel; Fober, Thomas; Hüllermeier, Eyke; Glinca, Serghei; Klebe, Gerhard; Pahikkala, Tapio; Airola, Antti; De Baets, Bernard; Waegeman, Willem
2014-01-01
Enzyme sequences and structures are routinely used in the biological sciences as queries to search for functionally related enzymes in online databases. To this end, one usually departs from some notion of similarity, comparing two enzymes by looking for correspondences in their sequences, structures or surfaces. For a given query, the search operation results in a ranking of the enzymes in the database, from very similar to dissimilar enzymes, while information about the biological function of annotated database enzymes is ignored. In this work, we show that rankings of that kind can be substantially improved by applying kernel-based learning algorithms. This approach enables the detection of statistical dependencies between similarities of the active cleft and the biological function of annotated enzymes. This is in contrast to search-based approaches, which do not take annotated training data into account. Similarity measures based on the active cleft are known to outperform sequence-based or structure-based measures under certain conditions. We consider the Enzyme Commission (EC) classification hierarchy for obtaining annotated enzymes during the training phase. The results of a set of sizeable experiments indicate a consistent and significant improvement for a set of similarity measures that exploit information about small cavities in the surface of enzymes.
Deng, Lei; Wu, Hongjie; Liu, Chuyao; Zhan, Weihua; Zhang, Jingpu
2018-06-01
Long non-coding RNAs (lncRNAs) are involved in many biological processes, such as immune response, development, differentiation and gene imprinting and are associated with diseases and cancers. But the functions of the vast majority of lncRNAs are still unknown. Predicting the biological functions of lncRNAs is one of the key challenges in the post-genomic era. In our work, We first build a global network including a lncRNA similarity network, a lncRNA-protein association network and a protein-protein interaction network according to the expressions and interactions, then extract the topological feature vectors of the global network. Using these features, we present an SVM-based machine learning approach, PLNRGO, to annotate human lncRNAs. In PLNRGO, we construct a training data set according to the proteins with GO annotations and train a binary classifier for each GO term. We assess the performance of PLNRGO on our manually annotated lncRNA benchmark and a protein-coding gene benchmark with known functional annotations. As a result, the performance of our method is significantly better than that of other state-of-the-art methods in terms of maximum F-measure and coverage. Copyright © 2018 Elsevier Ltd. All rights reserved.
BG7: A New Approach for Bacterial Genome Annotation Designed for Next Generation Sequencing Data
Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Pareja, Eduardo; Tobes, Raquel
2012-01-01
BG7 is a new system for de novo bacterial, archaeal and viral genome annotation based on a new approach specifically designed for annotating genomes sequenced with next generation sequencing technologies. The system is versatile and able to annotate genes even in the step of preliminary assembly of the genome. It is especially efficient detecting unexpected genes horizontally acquired from bacterial or archaeal distant genomes, phages, plasmids, and mobile elements. From the initial phases of the gene annotation process, BG7 exploits the massive availability of annotated protein sequences in databases. BG7 predicts ORFs and infers their function based on protein similarity with a wide set of reference proteins, integrating ORF prediction and functional annotation phases in just one step. BG7 is especially tolerant to sequencing errors in start and stop codons, to frameshifts, and to assembly or scaffolding errors. The system is also tolerant to the high level of gene fragmentation which is frequently found in not fully assembled genomes. BG7 current version – which is developed in Java, takes advantage of Amazon Web Services (AWS) cloud computing features, but it can also be run locally in any operating system. BG7 is a fast, automated and scalable system that can cope with the challenge of analyzing the huge amount of genomes that are being sequenced with NGS technologies. Its capabilities and efficiency were demonstrated in the 2011 EHEC Germany outbreak in which BG7 was used to get the first annotations right the next day after the first entero-hemorrhagic E. coli genome sequences were made publicly available. The suitability of BG7 for genome annotation has been proved for Illumina, 454, Ion Torrent, and PacBio sequencing technologies. Besides, thanks to its plasticity, our system could be very easily adapted to work with new technologies in the future. PMID:23185310
Naqvi, Ahmad Abu Turab; Shahbaaz, Mohd; Ahmad, Faizan; Hassan, Md. Imtaiyaz
2015-01-01
Syphilis is a globally occurring venereal disease, and its infection is propagated through sexual contact. The causative agent of syphilis, Treponema pallidum ssp. pallidum, a Gram-negative sphirochaete, is an obligate human parasite. Genome of T. pallidum ssp. pallidum SS14 strain (RefSeq NC_010741.1) encodes 1,027 proteins, of which 444 proteins are known as hypothetical proteins (HPs), i.e., proteins of unknown functions. Here, we performed functional annotation of HPs of T. pallidum ssp. pallidum using various database, domain architecture predictors, protein function annotators and clustering tools. We have analyzed the sequences of 444 HPs of T. pallidum ssp. pallidum and subsequently predicted the function of 207 HPs with a high level of confidence. However, functions of 237 HPs are predicted with less accuracy. We found various enzymes, transporters, binding proteins in the annotated group of HPs that may be possible molecular targets, facilitating for the survival of pathogen. Our comprehensive analysis helps to understand the mechanism of pathogenesis to provide many novel potential therapeutic interventions. PMID:25894582
DynGO: a tool for visualizing and mining of Gene Ontology and its associations
Liu, Hongfang; Hu, Zhang-Zhi; Wu, Cathy H
2005-01-01
Background A large volume of data and information about genes and gene products has been stored in various molecular biology databases. A major challenge for knowledge discovery using these databases is to identify related genes and gene products in disparate databases. The development of Gene Ontology (GO) as a common vocabulary for annotation allows integrated queries across multiple databases and identification of semantically related genes and gene products (i.e., genes and gene products that have similar GO annotations). Meanwhile, dozens of tools have been developed for browsing, mining or editing GO terms, their hierarchical relationships, or their "associated" genes and gene products (i.e., genes and gene products annotated with GO terms). Tools that allow users to directly search and inspect relations among all GO terms and their associated genes and gene products from multiple databases are needed. Results We present a standalone package called DynGO, which provides several advanced functionalities in addition to the standard browsing capability of the official GO browsing tool (AmiGO). DynGO allows users to conduct batch retrieval of GO annotations for a list of genes and gene products, and semantic retrieval of genes and gene products sharing similar GO annotations. The result are shown in an association tree organized according to GO hierarchies and supported with many dynamic display options such as sorting tree nodes or changing orientation of the tree. For GO curators and frequent GO users, DynGO provides fast and convenient access to GO annotation data. DynGO is generally applicable to any data set where the records are annotated with GO terms, as illustrated by two examples. Conclusion We have presented a standalone package DynGO that provides functionalities to search and browse GO and its association databases as well as several additional functions such as batch retrieval and semantic retrieval. The complete documentation and software are freely available for download from the website . PMID:16091147
Semantator: semantic annotator for converting biomedical text to linked data.
Tao, Cui; Song, Dezhao; Sharma, Deepak; Chute, Christopher G
2013-10-01
More than 80% of biomedical data is embedded in plain text. The unstructured nature of these text-based documents makes it challenging to easily browse and query the data of interest in them. One approach to facilitate browsing and querying biomedical text is to convert the plain text to a linked web of data, i.e., converting data originally in free text to structured formats with defined meta-level semantics. In this paper, we introduce Semantator (Semantic Annotator), a semantic-web-based environment for annotating data of interest in biomedical documents, browsing and querying the annotated data, and interactively refining annotation results if needed. Through Semantator, information of interest can be either annotated manually or semi-automatically using plug-in information extraction tools. The annotated results will be stored in RDF and can be queried using the SPARQL query language. In addition, semantic reasoners can be directly applied to the annotated data for consistency checking and knowledge inference. Semantator has been released online and was used by the biomedical ontology community who provided positive feedbacks. Our evaluation results indicated that (1) Semantator can perform the annotation functionalities as designed; (2) Semantator can be adopted in real applications in clinical and transactional research; and (3) the annotated results using Semantator can be easily used in Semantic-web-based reasoning tools for further inference. Copyright © 2013 Elsevier Inc. All rights reserved.
Metabolic activation of sodium nitroprusside to nitric oxide in vascular smooth muscle.
Kowaluk, E A; Seth, P; Fung, H L
1992-09-01
Sodium nitroprusside (SNP) is thought to exert its vasodilating activity, at least in part, by vascular activation to nitric oxide (NO), but the activation mechanism has not been delineated. This study has examined the potential for vascular metabolism of SNP to NO in bovine coronary arterial smooth muscle subcellular fractions using a sensitive and specific redox-chemiluminescence assay for NO. SNP was readily metabolized to NO in subcellular fractions, and the dominant site of metabolism appeared to be located in the membrane fractions. NO-generating activity was significantly enhanced by, but did not absolutely require, the addition of a NADPH-regenerating system, NADPH per se, NADH or cysteine. A correlation analysis of NO-generating activity (in the presence of a NADPH-regenerating system) with marker enzyme activities indicated that the SNP-directed NO-generating activity was primarily membrane-associated. Radiation inactivation target-size analysis revealed that the microsomal SNP-directed NO-generating activity was relatively insensitive to inactivation by radiation exposure, suggesting that the functioning catalytic unit might be quite small. A molecular weight of 5 to 11 kDa was estimated. NO-generating activity could be solubilized from the crude microsomes with 3-[(3-cholamidopropyl)- dimethylammonio]-1-propane sulfonate, and the solubilized extract was subjected to gel filtration chromatography. NO-generating activity was eluted in two peaks: one peak corresponding to an approximate molecular weight of 4 kDa, thus confirming the existence of a small molecular weight NO-generating activity, and a second activity peak corresponding to a molecular weight of 112 to 169 kDa, the functional significance of which is unclear at present.(ABSTRACT TRUNCATED AT 250 WORDS)
Vicchio, Teresa Manuela; Giovinazzo, Salvatore; Certo, Rosaria; Cucinotta, Mariapaola; Micali, Carmelo; Baldari, Sergio; Benvenga, Salvatore; Trimarchi, Francesco; Campennì, Alfredo; Ruggeri, Rosaria Maddalena
2014-07-01
Mutations of the thyrotropin receptor (TSHR) and/or Gαs gene have been found in a number of, but not all, autonomously functioning thyroid nodules (AFTNs). Recently, in a 15-year-old girl with a hyperfunctioning papillary thyroid carcinoma, we found two somatic and germline single nucleotide polymorphisms (SNPs): a SNP of the TSHR gene (exon 7, codon 187) and a SNP of Gαs gene (exon 8, codon 185). The same silent SNP of the TSHR gene had been reported in patients with AFTN or familial non-autoimmune hyperthyroidism. No further data about the prevalence of the two SNPs in AFTNs as well as in the general population are available in the literature. To clarify the possible role of these SNPs in predisposing to AFTN. Germline DNA was extracted from blood leukocytes of 115 patients with AFTNs (43 males and 72 females, aged 31-85 years, mean ± SD = 64 ± 13) and 100 sex-matched healthy individuals from the same geographic area, which is marginally iodine deficient. The genotype distribution of the two SNPs was investigated by restriction fragment length polymorphism-polymerase chain reaction. The prevalence of the two SNPs in our study population was low and not different to that found in healthy individuals: 8 % of patients vs. 9 % of controls were heterozygous for the TSHR SNP and 4 % patients vs. 6 % controls were heterozygous for the Gαs SNP. One patient harbored both SNPs. These results suggest that these two SNPs do not confer susceptibility for the development of AFTN.
Alfred, Tamuno; Ben-Shlomo, Yoav; Cooper, Rachel; Hardy, Rebecca; Cooper, Cyrus; Deary, Ian J; Elliott, Jane; Gunnell, David; Harris, Sarah E; Kivimaki, Mika; Kumari, Meena; Martin, Richard M; Power, Chris; Sayer, Avan Aihie; Starr, John M; Kuh, Diana; Day, Ian N M
2011-06-01
Several age-related traits are associated with shorter telomeres, the structures that cap the end of linear chromosomes. A common polymorphism near the telomere maintenance gene TERT has been associated with several cancers, but relationships with other aging traits such as physical capability have not been reported. As part of the Healthy Ageing across the Life Course (HALCyon) collaborative research programme, men and women aged between 44 and 90 years from nine UK cohorts were genotyped for the single-nucleotide polymorphism (SNP) rs401681. We then investigated relationships between the SNP and 30 age-related phenotypes, including cognitive and physical capability, blood lipid levels and lung function, pooling within-study genotypic effects in meta-analyses. No significant associations were found between the SNP and any of the cognitive performance tests (e.g. pooled beta per T allele for word recall z-score = 0.02, 95% CI: -0.01 to 0.04, P-value = 0.12, n = 18,737), physical performance tests (e.g. pooled beta for grip strength = -0.02, 95% CI: -0.045 to 0.006, P-value = 0.14, n = 11,711), blood pressure, lung function or blood test measures. Similarly, no differences in observations were found when considering follow-up measures of cognitive or physical performance after adjusting for its measure at an earlier assessment. The lack of associations between SNP rs401681 and a wide range of age-related phenotypes investigated in this large multicohort study suggests that while this SNP may be associated with cancer, it is not an important contributor to other markers of aging. © 2011 The Authors. Aging Cell © 2011 Blackwell Publishing Ltd/Anatomical Society of Great Britain and Ireland.
Deviney, Frank A.; Rice, Karen C.; Hornberger, George M.
2006-01-01
Acid rain affects headwater streams by temporarily reducing the acid‐neutralizing capacity (ANC) of the water, a process termed episodic acidification. The increase in acidic components in stream water can have deleterious effects on the aquatic biota. Although acidic deposition is uniform across Shenandoah National Park (SNP) in north central Virginia, the stream water quality response during rain events varies substantially. This response is a function of the catchment's underlying geology and topography. Geologic and topographic data for SNP's 231 catchments are readily available; however, long‐term measurements (tens of years) of ANC and accompanying discharge are not and would be prohibitively expensive to collect. Transfer function time series models were developed to predict hourly ANC from discharge for five SNP catchments with long‐term water‐quality and discharge records. Hourly ANC predictions over short time periods (≤1 week) were averaged, and distributions of the recurrence intervals of annual water‐year minimum ANC values were model‐simulated for periods of 6, 24, 72, and 168 hours. The distributions were extrapolated to the rest of the SNP catchments on the basis of catchment geology and topography. On the basis of the models, large numbers of SNP streams have 6‐ to 168‐hour periods of low‐ANC values, which may stress resident fish populations. Smaller catchments are more vulnerable to episodic acidification than larger catchments underlain by the same bedrock. Catchments with similar topography and size are more vulnerable if underlain by less basaltic/carbonate bedrock. Many catchments are predicted to have successive years of low‐ANC values potentially sufficient to extirpate some species.
Wang, Qinghua; Arighi, Cecilia N; King, Benjamin L; Polson, Shawn W; Vincent, James; Chen, Chuming; Huang, Hongzhan; Kingham, Brewster F; Page, Shallee T; Rendino, Marc Farnum; Thomas, William Kelley; Udwary, Daniel W; Wu, Cathy H
2012-01-01
Recent advances in high-throughput DNA sequencing technologies have equipped biologists with a powerful new set of tools for advancing research goals. The resulting flood of sequence data has made it critically important to train the next generation of scientists to handle the inherent bioinformatic challenges. The North East Bioinformatics Collaborative (NEBC) is undertaking the genome sequencing and annotation of the little skate (Leucoraja erinacea) to promote advancement of bioinformatics infrastructure in our region, with an emphasis on practical education to create a critical mass of informatically savvy life scientists. In support of the Little Skate Genome Project, the NEBC members have developed several annotation workshops and jamborees to provide training in genome sequencing, annotation and analysis. Acting as a nexus for both curation activities and dissemination of project data, a project web portal, SkateBase (http://skatebase.org) has been developed. As a case study to illustrate effective coupling of community annotation with workforce development, we report the results of the Mitochondrial Genome Annotation Jamborees organized to annotate the first completely assembled element of the Little Skate Genome Project, as a culminating experience for participants from our three prior annotation workshops. We are applying the physical/virtual infrastructure and lessons learned from these activities to enhance and streamline the genome annotation workflow, as we look toward our continuing efforts for larger-scale functional and structural community annotation of the L. erinacea genome.
Wang, Qinghua; Arighi, Cecilia N.; King, Benjamin L.; Polson, Shawn W.; Vincent, James; Chen, Chuming; Huang, Hongzhan; Kingham, Brewster F.; Page, Shallee T.; Farnum Rendino, Marc; Thomas, William Kelley; Udwary, Daniel W.; Wu, Cathy H.
2012-01-01
Recent advances in high-throughput DNA sequencing technologies have equipped biologists with a powerful new set of tools for advancing research goals. The resulting flood of sequence data has made it critically important to train the next generation of scientists to handle the inherent bioinformatic challenges. The North East Bioinformatics Collaborative (NEBC) is undertaking the genome sequencing and annotation of the little skate (Leucoraja erinacea) to promote advancement of bioinformatics infrastructure in our region, with an emphasis on practical education to create a critical mass of informatically savvy life scientists. In support of the Little Skate Genome Project, the NEBC members have developed several annotation workshops and jamborees to provide training in genome sequencing, annotation and analysis. Acting as a nexus for both curation activities and dissemination of project data, a project web portal, SkateBase (http://skatebase.org) has been developed. As a case study to illustrate effective coupling of community annotation with workforce development, we report the results of the Mitochondrial Genome Annotation Jamborees organized to annotate the first completely assembled element of the Little Skate Genome Project, as a culminating experience for participants from our three prior annotation workshops. We are applying the physical/virtual infrastructure and lessons learned from these activities to enhance and streamline the genome annotation workflow, as we look toward our continuing efforts for larger-scale functional and structural community annotation of the L. erinacea genome. PMID:22434832
Gene calling and bacterial genome annotation with BG7.
Tobes, Raquel; Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Kovach, Evdokim; Alekhin, Alexey; Pareja, Eduardo
2015-01-01
New massive sequencing technologies are providing many bacterial genome sequences from diverse taxa but a refined annotation of these genomes is crucial for obtaining scientific findings and new knowledge. Thus, bacterial genome annotation has emerged as a key point to investigate in bacteria. Any efficient tool designed specifically to annotate bacterial genomes sequenced with massively parallel technologies has to consider the specific features of bacterial genomes (absence of introns and scarcity of nonprotein-coding sequence) and of next-generation sequencing (NGS) technologies (presence of errors and not perfectly assembled genomes). These features make it convenient to focus on coding regions and, hence, on protein sequences that are the elements directly related with biological functions. In this chapter we describe how to annotate bacterial genomes with BG7, an open-source tool based on a protein-centered gene calling/annotation paradigm. BG7 is specifically designed for the annotation of bacterial genomes sequenced with NGS. This tool is sequence error tolerant maintaining their capabilities for the annotation of highly fragmented genomes or for annotating mixed sequences coming from several genomes (as those obtained through metagenomics samples). BG7 has been designed with scalability as a requirement, with a computing infrastructure completely based on cloud computing (Amazon Web Services).
Towards a complete map of the human long non-coding RNA transcriptome.
Uszczynska-Ratajczak, Barbara; Lagarde, Julien; Frankish, Adam; Guigó, Roderic; Johnson, Rory
2018-05-23
Gene maps, or annotations, enable us to navigate the functional landscape of our genome. They are a resource upon which virtually all studies depend, from single-gene to genome-wide scales and from basic molecular biology to medical genetics. Yet present-day annotations suffer from trade-offs between quality and size, with serious but often unappreciated consequences for downstream studies. This is particularly true for long non-coding RNAs (lncRNAs), which are poorly characterized compared to protein-coding genes. Long-read sequencing technologies promise to improve current annotations, paving the way towards a complete annotation of lncRNAs expressed throughout a human lifetime.
Protein Function Prediction: Problems and Pitfalls.
Pearson, William R
2015-09-03
The characterization of new genomes based on their protein sets has been revolutionized by new sequencing technologies, but biologists seeking to exploit new sequence information are often frustrated by the challenges associated with accurately assigning biological functions to newly identified proteins. Here, we highlight some of the challenges in functional inference from sequence similarity. Investigators can improve the accuracy of function prediction by (1) being conservative about the evolutionary distance to a protein of known function; (2) considering the ambiguous meaning of "functional similarity," and (3) being aware of the limitations of annotations in functional databases. Protein function prediction does not offer "one-size-fits-all" solutions. Prediction strategies work better when the idiosyncrasies of function and functional annotation are better understood. Copyright © 2015 John Wiley & Sons, Inc.
Towards an informative mutant phenotype for every bacterial gene
Deutschbauer, Adam; Price, Morgan N.; Wetmore, Kelly M.; ...
2014-08-11
Mutant phenotypes provide strong clues to the functions of the underlying genes and could allow annotation of the millions of sequenced yet uncharacterized bacterial genes. However, it is not known how many genes have a phenotype under laboratory conditions, how many phenotypes are biologically interpretable for predicting gene function, and what experimental conditions are optimal to maximize the number of genes with a phenotype. To address these issues, we measured the mutant fitness of 1,586 genes of the ethanol-producing bacterium Zymomonas mobilis ZM4 across 492 diverse experiments and found statistically significant phenotypes for 89% of all assayed genes. Thus, inmore » Z. mobilis, most genes have a functional consequence under laboratory conditions. We demonstrate that 41% of Z. mobilis genes have both a strong phenotype and a similar fitness pattern (cofitness) to another gene, and are therefore good candidates for functional annotation using mutant fitness. Among 502 poorly characterized Z. mobilis genes, we identified a significant cofitness relationship for 174. For 57 of these genes without a specific functional annotation, we found additional evidence to support the biological significance of these gene-gene associations, and in 33 instances, we were able to predict specific physiological or biochemical roles for the poorly characterized genes. Last, we identified a set of 79 diverse mutant fitness experiments in Z. mobilis that are nearly as biologically informative as the entire set of 492 experiments. Therefore, our work provides a blueprint for the functional annotation of diverse bacteria using mutant fitness.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Deutschbauer, Adam; Price, Morgan N.; Wetmore, Kelly M.
Mutant phenotypes provide strong clues to the functions of the underlying genes and could allow annotation of the millions of sequenced yet uncharacterized bacterial genes. However, it is not known how many genes have a phenotype under laboratory conditions, how many phenotypes are biologically interpretable for predicting gene function, and what experimental conditions are optimal to maximize the number of genes with a phenotype. To address these issues, we measured the mutant fitness of 1,586 genes of the ethanol-producing bacterium Zymomonas mobilis ZM4 across 492 diverse experiments and found statistically significant phenotypes for 89% of all assayed genes. Thus, inmore » Z. mobilis, most genes have a functional consequence under laboratory conditions. We demonstrate that 41% of Z. mobilis genes have both a strong phenotype and a similar fitness pattern (cofitness) to another gene, and are therefore good candidates for functional annotation using mutant fitness. Among 502 poorly characterized Z. mobilis genes, we identified a significant cofitness relationship for 174. For 57 of these genes without a specific functional annotation, we found additional evidence to support the biological significance of these gene-gene associations, and in 33 instances, we were able to predict specific physiological or biochemical roles for the poorly characterized genes. Last, we identified a set of 79 diverse mutant fitness experiments in Z. mobilis that are nearly as biologically informative as the entire set of 492 experiments. Therefore, our work provides a blueprint for the functional annotation of diverse bacteria using mutant fitness.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Giuliani, Sarah E; Frank, Ashley M; Corgliano, Danielle M
Abstract Background: Transporter proteins are one of an organism s primary interfaces with the environment. The expressed set of transporters mediates cellular metabolic capabilities and influences signal transduction pathways and regulatory networks. The functional annotation of most transporters is currently limited to general classification into families. The development of capabilities to map ligands with specific transporters would improve our knowledge of the function of these proteins, improve the annotation of related genomes, and facilitate predictions for their role in cellular responses to environmental changes. Results: To improve the utility of the functional annotation for ABC transporters, we expressed and purifiedmore » the set of solute binding proteins from Rhodopseudomonas palustris and characterized their ligand-binding specificity. Our approach utilized ligand libraries consisting of environmental and cellular metabolic compounds, and fluorescence thermal shift based high throughput ligand binding screens. This process resulted in the identification of specific binding ligands for approximately 64% of the purified and screened proteins. The collection of binding ligands is representative of common functionalities associated with many bacterial organisms as well as specific capabilities linked to the ecological niche occupied by R. palustris. Conclusion: The functional screen identified specific ligands that bound to ABC transporter periplasmic binding subunits from R. palustris. These assignments provide unique insight for the metabolic capabilities of this organism and are consistent with the ecological niche of strain isolation. This functional insight can be used to improve the annotation of related organisms and provides a route to evaluate the evolution of this important and diverse group of transporter proteins.« less
MaizeGDB, the maize model organism database
USDA-ARS?s Scientific Manuscript database
MaizeGDB is the maize research community's database for maize genetic and genomic information. In this seminar I will outline our current endeavors including a full website redesign, the status of maize genome assembly and annotation projects, and work toward genome functional annotation. Mechanis...
2013-01-01
Analyzing and storing data and results from next-generation sequencing (NGS) experiments is a challenging task, hampered by ever-increasing data volumes and frequent updates of analysis methods and tools. Storage and computation have grown beyond the capacity of personal computers and there is a need for suitable e-infrastructures for processing. Here we describe UPPNEX, an implementation of such an infrastructure, tailored to the needs of data storage and analysis of NGS data in Sweden serving various labs and multiple instruments from the major sequencing technology platforms. UPPNEX comprises resources for high-performance computing, large-scale and high-availability storage, an extensive bioinformatics software suite, up-to-date reference genomes and annotations, a support function with system and application experts as well as a web portal and support ticket system. UPPNEX applications are numerous and diverse, and include whole genome-, de novo- and exome sequencing, targeted resequencing, SNP discovery, RNASeq, and methylation analysis. There are over 300 projects that utilize UPPNEX and include large undertakings such as the sequencing of the flycatcher and Norwegian spruce. We describe the strategic decisions made when investing in hardware, setting up maintenance and support, allocating resources, and illustrate major challenges such as managing data growth. We conclude with summarizing our experiences and observations with UPPNEX to date, providing insights into the successful and less successful decisions made. PMID:23800020
Fuzzy measures on the Gene Ontology for gene product similarity.
Popescu, Mihail; Keller, James M; Mitchell, Joyce A
2006-01-01
One of the most important objects in bioinformatics is a gene product (protein or RNA). For many gene products, functional information is summarized in a set of Gene Ontology (GO) annotations. For these genes, it is reasonable to include similarity measures based on the terms found in the GO or other taxonomy. In this paper, we introduce several novel measures for computing the similarity of two gene products annotated with GO terms. The fuzzy measure similarity (FMS) has the advantage that it takes into consideration the context of both complete sets of annotation terms when computing the similarity between two gene products. When the two gene products are not annotated by common taxonomy terms, we propose a method that avoids a zero similarity result. To account for the variations in the annotation reliability, we propose a similarity measure based on the Choquet integral. These similarity measures provide extra tools for the biologist in search of functional information for gene products. The initial testing on a group of 194 sequences representing three proteins families shows a higher correlation of the FMS and Choquet similarities to the BLAST sequence similarities than the traditional similarity measures such as pairwise average or pairwise maximum.
Wang, Dafei; Liu, Yunguo; Tan, Xiaofei; Liu, Hongyu; Zeng, Guangming; Hu, Xinjiang; Jian, Hao; Gu, Yanling
2015-03-01
Cadmium (Cd)-induced growth inhibition is one of the primary factors limiting phytoremediation effect of Boehmeria nivea (L.) Gaud in contaminated soil. Sodium nitroprusside (SNP), a donor of nitric oxide (NO), has been evidenced to alleviate Cd toxicity in many plants. However, as an important mechanism of NO in orchestrating cellular functions, S-nitrosylation is still poorly understood in its relation with Cd tolerance of plants. In this study, higher exogenous NO levels were found to coincide with higher S-nitrosylation level expressed as content of S-nitrosothiols (SNO). The addition of low concentration (100 μM) SNP increased the SNO content, and it simultaneously induced an alleviating effect against Cd toxicity by enhancing the activities of superoxide dismutase (SOD), ascorbate peroxidase (APX), and glutathione reductase (GR) and reduced the accumulation of H2O2 as compared with Cd alone. Application of S-nitrosoglutathione reductase (GSNOR) inhibitors dodecanoic acid (DA) in 100 μM SNP group brought in an extra elevation in S-nitrosylation level and further reinforced the effect of SNP. While the additions of 400 μM SNP and 400 μM SNP + 50 μM DA further elevated the S-nitrosylation level, it markedly weakened the alleviating effect against Cd toxicity as compared with the addition of 100 μM SNP. This phenomenon could be owing to excess consumption of glutathione (GSH) to form SNO under high S-nitrosylation level. Therefore, the present study indicates that S-nitrosylation is involved in the ameliorating effect of SNP against Cd toxicity. This involvement exhibited a concentration-dependent property.
Kim, Byung-Chul; Kim, Youn-Sub; Lee, Jin-Woo; Seo, Jin-Hee; Ji, Eun-Sang; Lee, Hyejung; Park, Yong-Il
2011-01-01
Nitric oxide (NO) is a reactive free radical and a messenger molecule in many physiological functions. However, excessive NO is believed to be a mediator of neurotoxicity. The medicinal plant Coriolus versicolor is known to possess anti-tumor and immune-potentiating activities. In this study, we investigated whether Coriolus versicolor possesses a protective effect against NO donor sodium nitroprusside (SNP)-induced apoptosis in the human neuroblastoma cell line SK-N-MC. We utilized 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide (MTT) assay, flow cytometry, 4,6-diamidino-2-phenylindole (DAPI) staining, terminal deoxynucleotidyl transferase (TdT)-mediated dUTP nick end labeling (TUNEL) assay, DNA fragmentation assay, reverse transcription-polymerase chain reaction (RT-PCR), Western blot analysis, and caspase-3 enzyme activity assay in SK-N-MC cells. MTT assay showed that SNP treatment significantly reduces the viability of cells, and the viabilities of cells pre-treated with the aqueous extract of Coriolus versicolor cultivated in citrus extract (CVEcitrus) was increased. However, aqueous extract of Coriolus versicolor cultivated in synthetic medium (CVEsynthetic) showed no protective effect and aqueous citrus extract (CE) had a little protective effect. The cell treated with SNP exhibited several apoptotic features, while those pre-treated for 1 h with CVEcitrus prior to SNP expose showed reduced apoptotic features. The cells pre-treated for 1 h with CVEcitrus prior to SNP expose inhibited p53 and Bax expressions and caspase-3 enzyme activity up-regulated by SNP. We showed that CVEcitrus exerts a protective effect against SNP-induced apoptosis in SK-N-MC cells. Our study suggests that CVEcitrus has therapeutic value in the treatment of a variety of NO-induced brain diseases. PMID:22110367
Kim, Byung-Chul; Kim, Youn-Sub; Lee, Jin-Woo; Seo, Jin-Hee; Ji, Eun-Sang; Lee, Hyejung; Park, Yong-Il; Kim, Chang-Ju
2011-06-01
Nitric oxide (NO) is a reactive free radical and a messenger molecule in many physiological functions. However, excessive NO is believed to be a mediator of neurotoxicity. The medicinal plant Coriolus versicolor is known to possess anti-tumor and immune-potentiating activities. In this study, we investigated whether Coriolus versicolor possesses a protective effect against NO donor sodium nitroprusside (SNP)-induced apoptosis in the human neuroblastoma cell line SK-N-MC. We utilized 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide (MTT) assay, flow cytometry, 4,6-diamidino-2-phenylindole (DAPI) staining, terminal deoxynucleotidyl transferase (TdT)-mediated dUTP nick end labeling (TUNEL) assay, DNA fragmentation assay, reverse transcription-polymerase chain reaction (RT-PCR), Western blot analysis, and caspase-3 enzyme activity assay in SK-N-MC cells. MTT assay showed that SNP treatment significantly reduces the viability of cells, and the viabilities of cells pre-treated with the aqueous extract of Coriolus versicolor cultivated in citrus extract (CVE(citrus)) was increased. However, aqueous extract of Coriolus versicolor cultivated in synthetic medium (CVE(synthetic)) showed no protective effect and aqueous citrus extract (CE) had a little protective effect. The cell treated with SNP exhibited several apoptotic features, while those pre-treated for 1 h with CVE(citrus) prior to SNP expose showed reduced apoptotic features. The cells pre-treated for 1 h with CVE(citrus) prior to SNP expose inhibited p53 and Bax expressions and caspase-3 enzyme activity up-regulated by SNP. We showed that CVE(citrus) exerts a protective effect against SNP-induced apoptosis in SK-N-MC cells. Our study suggests that CVE(citrus) has therapeutic value in the treatment of a variety of NO-induced brain diseases.
An integrated SNP mining and utilization (ISMU) pipeline for next generation sequencing data.
Azam, Sarwar; Rathore, Abhishek; Shah, Trushar M; Telluri, Mohan; Amindala, BhanuPrakash; Ruperao, Pradeep; Katta, Mohan A V S K; Varshney, Rajeev K
2014-01-01
Open source single nucleotide polymorphism (SNP) discovery pipelines for next generation sequencing data commonly requires working knowledge of command line interface, massive computational resources and expertise which is a daunting task for biologists. Further, the SNP information generated may not be readily used for downstream processes such as genotyping. Hence, a comprehensive pipeline has been developed by integrating several open source next generation sequencing (NGS) tools along with a graphical user interface called Integrated SNP Mining and Utilization (ISMU) for SNP discovery and their utilization by developing genotyping assays. The pipeline features functionalities such as pre-processing of raw data, integration of open source alignment tools (Bowtie2, BWA, Maq, NovoAlign and SOAP2), SNP prediction (SAMtools/SOAPsnp/CNS2snp and CbCC) methods and interfaces for developing genotyping assays. The pipeline outputs a list of high quality SNPs between all pairwise combinations of genotypes analyzed, in addition to the reference genome/sequence. Visualization tools (Tablet and Flapjack) integrated into the pipeline enable inspection of the alignment and errors, if any. The pipeline also provides a confidence score or polymorphism information content value with flanking sequences for identified SNPs in standard format required for developing marker genotyping (KASP and Golden Gate) assays. The pipeline enables users to process a range of NGS datasets such as whole genome re-sequencing, restriction site associated DNA sequencing and transcriptome sequencing data at a fast speed. The pipeline is very useful for plant genetics and breeding community with no computational expertise in order to discover SNPs and utilize in genomics, genetics and breeding studies. The pipeline has been parallelized to process huge datasets of next generation sequencing. It has been developed in Java language and is available at http://hpc.icrisat.cgiar.org/ISMU as a standalone free software.
Vallenet, David; Calteau, Alexandra; Cruveiller, Stéphane; Gachet, Mathieu; Lajus, Aurélie; Josso, Adrien; Mercier, Jonathan; Renaux, Alexandre; Rollin, Johan; Rouy, Zoe; Roche, David; Scarpelli, Claude; Médigue, Claudine
2017-01-04
The annotation of genomes from NGS platforms needs to be automated and fully integrated. However, maintaining consistency and accuracy in genome annotation is a challenging problem because millions of protein database entries are not assigned reliable functions. This shortcoming limits the knowledge that can be extracted from genomes and metabolic models. Launched in 2005, the MicroScope platform (http://www.genoscope.cns.fr/agc/microscope) is an integrative resource that supports systematic and efficient revision of microbial genome annotation, data management and comparative analysis. Effective comparative analysis requires a consistent and complete view of biological data, and therefore, support for reviewing the quality of functional annotation is critical. MicroScope allows users to analyze microbial (meta)genomes together with post-genomic experiment results if any (i.e. transcriptomics, re-sequencing of evolved strains, mutant collections, phenotype data). It combines tools and graphical interfaces to analyze genomes and to perform the expert curation of gene functions in a comparative context. Starting with a short overview of the MicroScope system, this paper focuses on some major improvements of the Web interface, mainly for the submission of genomic data and on original tools and pipelines that have been developed and integrated in the platform: computation of pan-genomes and prediction of biosynthetic gene clusters. Today the resource contains data for more than 6000 microbial genomes, and among the 2700 personal accounts (65% of which are now from foreign countries), 14% of the users are performing expert annotations, on at least a weekly basis, contributing to improve the quality of microbial genome annotations. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Identification of bovine NPC1 gene cSNPs and their effects on body size traits of Qinchuan cattle.
Dang, Yonglong; Li, Mingxun; Yang, Mingjuan; Cao, Xiukai; Lan, Xianyong; Lei, Chuzhao; Zhang, Chunlei; Lin, Qing; Chen, Hong
2014-05-01
NPC1 gene is an important gene closely related to the Niemann-Pick type C (NPC). Mutations in the NPC1 gene tend to cause Niemann-Pick type C, a lysosomal storage disorder. Previous studies have shown that NPC1 protein plays an important role in subcellular lipid transport, homeostasis, platelet function and formation, which are basic metabolic activities in the process of development. In this study, to explore the association between the NPC1 gene variation and body size traits in Qinchuan cattle, we detected four novel coding single nucleotide polymorphisms (cSNPs) in the bovine NPC1 gene, including one missense mutation (SNP1) and three synonymous mutations (SNP2, SNP3 and SNP4). Population genetic analyses of 518 individuals and association correlations between cSNPs and bovine body size traits were conducted in this research. A missense mutation at SNP1 locus was found to be significantly related to the heart girth, hip width and body weight (P<0.01 or P<0.05, 3.5-year-old). Two synonymous mutations at SNP2 and SNP3 loci also showed significant effects on hip width (P<0.05, 3.5-year-old). One synonymous mutation at SNP4 locus showed significant effect on body weight (P<0.05, 2.0-year-old). Combined haplotypes H2H6 and H6H6 showed significant effects on body size traits such as heart girth, hip width, and body weight (3.5-year-old, P<0.01 or P<0.05). This study provides evidence that the NPC1 gene might be involved in the regulation of bovine growth and body development, and may be considered as a candidate gene for marker assisted selection (MAS) in beef cattle breeding industry. Copyright © 2014. Published by Elsevier B.V.
Chiu, Shih-Hau; Chen, Chien-Chi; Yuan, Gwo-Fang; Lin, Thy-Hou
2006-01-01
Background The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. Results There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. Conclusion The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart. PMID:16776838
Onel, K B; Huo, D; Hastings, D; Fryer-Biggs, J; Crow, M K; Onel, K
2009-01-01
The p53 tumour suppressor is the central regulator of apoptosis. Previously, the functional TP53 Arg72Pro polymorphism was found to be associated with systemic lupus erythematosus (SLE) in Koreans but not Spaniards. MDM2 is the major negative regulator of p53. An intronic polymorphism in MDM2, the SNP309, attenuates p53 activity and is associated with accelerated tumour development in premenopausal women. Polymorphic variation in MDM2 has never been studied in SLE. The aim of this study is to further assess the contribution of p53-pathway genetic variation to SLE by testing the association of the TP53 Arg72Pro polymorphism and the MDM2 SNP309 with SLE in a well-characterised and ethnically diverse cohort of patients with both childhood- and adult-onset SLE (n = 314). No association was found between the TP53 Arg72Pro polymorphism and SLE in patients of European descent, Asian descent or in African Americans, nor was an association found between the MDM2 SNP309 and SLE in patients of European descent or in African Americans. In addition, there was no correlation between either variant and early-onset disease or nephritis, an index of severe disease. It is concluded that neither the TP53 Arg72Pro polymorphism nor the MDM2 SNP309 contributes significantly to either susceptibility or disease severity in SLE.
The Generalized Higher Criticism for Testing SNP-Set Effects in Genetic Association Studies
Barnett, Ian; Mukherjee, Rajarshi; Lin, Xihong
2017-01-01
It is of substantial interest to study the effects of genes, genetic pathways, and networks on the risk of complex diseases. These genetic constructs each contain multiple SNPs, which are often correlated and function jointly, and might be large in number. However, only a sparse subset of SNPs in a genetic construct is generally associated with the disease of interest. In this article, we propose the generalized higher criticism (GHC) to test for the association between an SNP set and a disease outcome. The higher criticism is a test traditionally used in high-dimensional signal detection settings when marginal test statistics are independent and the number of parameters is very large. However, these assumptions do not always hold in genetic association studies, due to linkage disequilibrium among SNPs and the finite number of SNPs in an SNP set in each genetic construct. The proposed GHC overcomes the limitations of the higher criticism by allowing for arbitrary correlation structures among the SNPs in an SNP-set, while performing accurate analytic p-value calculations for any finite number of SNPs in the SNP-set. We obtain the detection boundary of the GHC test. We compared empirically using simulations the power of the GHC method with existing SNP-set tests over a range of genetic regions with varied correlation structures and signal sparsity. We apply the proposed methods to analyze the CGEM breast cancer genome-wide association study. Supplementary materials for this article are available online. PMID:28736464
Functional sequencing read annotation for high precision microbiome analysis
Zhu, Chengsheng; Miller, Maximilian; Marpaka, Srinayani; Vaysberg, Pavel; Rühlemann, Malte C; Wu, Guojun; Heinsen, Femke-Anouska; Tempel, Marie; Zhao, Liping; Lieb, Wolfgang; Franke, Andre; Bromberg, Yana
2018-01-01
Abstract The vast majority of microorganisms on Earth reside in often-inseparable environment-specific communities—microbiomes. Meta-genomic/-transcriptomic sequencing could reveal the otherwise inaccessible functionality of microbiomes. However, existing analytical approaches focus on attributing sequencing reads to known genes/genomes, often failing to make maximal use of available data. We created faser (functional annotation of sequencing reads), an algorithm that is optimized to map reads to molecular functions encoded by the read-correspondent genes. The mi-faser microbiome analysis pipeline, combining faser with our manually curated reference database of protein functions, accurately annotates microbiome molecular functionality. mi-faser’s minutes-per-microbiome processing speed is significantly faster than that of other methods, allowing for large scale comparisons. Microbiome function vectors can be compared between different conditions to highlight environment-specific and/or time-dependent changes in functionality. Here, we identified previously unseen oil degradation-specific functions in BP oil-spill data, as well as functional signatures of individual-specific gut microbiome responses to a dietary intervention in children with Prader–Willi syndrome. Our method also revealed variability in Crohn's Disease patient microbiomes and clearly distinguished them from those of related healthy individuals. Our analysis highlighted the microbiome role in CD pathogenicity, demonstrating enrichment of patient microbiomes in functions that promote inflammation and that help bacteria survive it. PMID:29194524
Eliciting the Functional Taxonomy from protein annotations and taxa
Falda, Marco; Lavezzo, Enrico; Fontana, Paolo; Bianco, Luca; Berselli, Michele; Formentin, Elide; Toppo, Stefano
2016-01-01
The advances of omics technologies have triggered the production of an enormous volume of data coming from thousands of species. Meanwhile, joint international efforts like the Gene Ontology (GO) consortium have worked to provide functional information for a vast amount of proteins. With these data available, we have developed FunTaxIS, a tool that is the first attempt to infer functional taxonomy (i.e. how functions are distributed over taxa) combining functional and taxonomic information. FunTaxIS is able to define a taxon specific functional space by exploiting annotation frequencies in order to establish if a function can or cannot be used to annotate a certain species. The tool generates constraints between GO terms and taxa and then propagates these relations over the taxonomic tree and the GO graph. Since these constraints nearly cover the whole taxonomy, it is possible to obtain the mapping of a function over the taxonomy. FunTaxIS can be used to make functional comparative analyses among taxa, to detect improper associations between taxa and functions, and to discover how functional knowledge is either distributed or missing. A benchmark test set based on six different model species has been devised to get useful insights on the generated taxonomic rules. PMID:27534507
SNPchiMp: a database to disentangle the SNPchip jungle in bovine livestock.
Nicolazzi, Ezequiel Luis; Picciolini, Matteo; Strozzi, Francesco; Schnabel, Robert David; Lawley, Cindy; Pirani, Ali; Brew, Fiona; Stella, Alessandra
2014-02-11
Currently, six commercial whole-genome SNP chips are available for cattle genotyping, produced by two different genotyping platforms. Technical issues need to be addressed to combine data that originates from the different platforms, or different versions of the same array generated by the manufacturer. For example: i) genome coordinates for SNPs may refer to different genome assemblies; ii) reference genome sequences are updated over time changing the positions, or even removing sequences which contain SNPs; iii) not all commercial SNP ID's are searchable within public databases; iv) SNPs can be coded using different formats and referencing different strands (e.g. A/B or A/C/T/G alleles, referencing forward/reverse, top/bottom or plus/minus strand); v) Due to new information being discovered, higher density chips do not necessarily include all the SNPs present in the lower density chips; and, vi) SNP IDs may not be consistent across chips and platforms. Most researchers and breed associations manage SNP data in real-time and thus require tools to standardise data in a user-friendly manner. Here we present SNPchiMp, a MySQL database linked to an open access web-based interface. Features of this interface include, but are not limited to, the following functions: 1) referencing the SNP mapping information to the latest genome assembly, 2) extraction of information contained in dbSNP for SNPs present in all commercially available bovine chips, and 3) identification of SNPs in common between two or more bovine chips (e.g. for SNP imputation from lower to higher density). In addition, SNPchiMp can retrieve this information on subsets of SNPs, accessing such data either via physical position on a supported assembly, or by a list of SNP IDs, rs or ss identifiers. This tool combines many different sources of information, that otherwise are time consuming to obtain and difficult to integrate. The SNPchiMp not only provides the information in a user-friendly format, but also enables researchers to perform a large number of operations with a few clicks of the mouse. This significantly reduces the time needed to execute the large number of operations required to manage SNP data.
Tiezzi, Francesco; Parker-Gaddis, Kristen L.; Cole, John B.; Clay, John S.; Maltecca, Christian
2015-01-01
Clinical mastitis (CM) is one of the health disorders with large impacts on dairy farming profitability and animal welfare. The objective of this study was to perform a genome-wide association study (GWAS) for CM in first-lactation Holstein. Producer-recorded mastitis event information for 103,585 first-lactation cows were used, together with genotype information on 1,361 bulls from the Illumina BovineSNP50 BeadChip. Single-step genomic-BLUP methodology was used to incorporate genomic data into a threshold-liability model. Association analysis confirmed that CM follows a highly polygenic mode of inheritance. However, 10-adjacent-SNP windows showed that regions on chromosomes 2, 14 and 20 have impacts on genetic variation for CM. Some of the genes located on chromosome 14 (LY6K, LY6D, LYNX1, LYPD2, SLURP1, PSCA) are part of the lymphocyte-antigen-6 complex (LY6) known for its neutrophil regulation function linked to the major histocompatibility complex. Other genes on chromosome 2 were also involved in regulating immune response (IFIH1, LY75, and DPP4), or are themselves regulated in the presence of specific pathogens (ITGB6, NR4A2). Other genes annotated on chromosome 20 are involved in mammary gland metabolism (GHR, OXCT1), antibody production and phagocytosis of bacterial cells (C6, C7, C9, C1QTNF3), tumor suppression (DAB2), involution of mammary epithelium (OSMR) and cytokine regulation (PRLR). DAVID enrichment analysis revealed 5 KEGG pathways. The JAK-STAT signaling pathway (cell proliferation and apoptosis) and the ‘Cytokine-cytokine receptor interaction’ (cytokine and interleukines response to infectious agents) are co-regulated and linked to the ‘ABC transporters’ pathway also found here. Gene network analysis performed using GeneMania revealed a co-expression network where 665 interactions existed among 145 of the genes reported above. Clinical mastitis is a complex trait and the different genes regulating immune response are known to be pathogen-specific. Despite the lack of information in this study, candidate QTL for CM were identified in the US Holstein population. PMID:25658712
Highlighting the Need for Systems-Level Experimental Characterization of Plant Metabolic Enzymes.
Engqvist, Martin K M
2016-01-01
The biology of living organisms is determined by the action and interaction of a large number of individual gene products, each with specific functions. Discovering and annotating the function of gene products is key to our understanding of these organisms. Controlled experiments and bioinformatic predictions both contribute to functional gene annotation. For most species it is difficult to gain an overview of what portion of gene annotations are based on experiments and what portion represent predictions. Here, I survey the current state of experimental knowledge of enzymes and metabolism in Arabidopsis thaliana as well as eleven economically important crops and forestry trees - with a particular focus on reactions involving organic acids in central metabolism. I illustrate the limited availability of experimental data for functional annotation of enzymes in most of these species. Many enzymes involved in metabolism of citrate, malate, fumarate, lactate, and glycolate in crops and forestry trees have not been characterized. Furthermore, enzymes involved in key biosynthetic pathways which shape important traits in crops and forestry trees have not been characterized. I argue for the development of novel high-throughput platforms with which limited functional characterization of gene products can be performed quickly and relatively cheaply. I refer to this approach as systems-level experimental characterization. The data collected from such platforms would form a layer intermediate between bioinformatic gene function predictions and in-depth experimental studies of these functions. Such a data layer would greatly aid in the pursuit of understanding a multiplicity of biological processes in living organisms.
Sikora, Klaudia M; Magee, David A; Berkowicz, Erik W; Berry, Donagh P; Howard, Dawn J; Mullen, Michael P; Evans, Ross D; Machugh, David E; Spillane, Charles
2011-01-07
Genes which are epigenetically regulated via genomic imprinting can be potential targets for artificial selection during animal breeding. Indeed, imprinted loci have been shown to underlie some important quantitative traits in domestic mammals, most notably muscle mass and fat deposition. In this candidate gene study, we have identified novel associations between six validated single nucleotide polymorphisms (SNPs) spanning a 97.6 kb region within the bovine guanine nucleotide-binding protein Gs subunit alpha gene (GNAS) domain on bovine chromosome 13 and genetic merit for a range of performance traits in 848 progeny-tested Holstein-Friesian sires. The mammalian GNAS domain consists of a number of reciprocally-imprinted, alternatively-spliced genes which can play a major role in growth, development and disease in mice and humans. Based on the current annotation of the bovine GNAS domain, four of the SNPs analysed (rs43101491, rs43101493, rs43101485 and rs43101486) were located upstream of the GNAS gene, while one SNP (rs41694646) was located in the second intron of the GNAS gene. The final SNP (rs41694656) was located in the first exon of transcripts encoding the putative bovine neuroendocrine-specific protein NESP55, resulting in an aspartic acid-to-asparagine amino acid substitution at amino acid position 192. SNP genotype-phenotype association analyses indicate that the single intronic GNAS SNP (rs41694646) is associated (P ≤ 0.05) with a range of performance traits including milk yield, milk protein yield, the content of fat and protein in milk, culled cow carcass weight and progeny carcass conformation, measures of animal body size, direct calving difficulty (i.e. difficulty in calving due to the size of the calf) and gestation length. Association (P ≤ 0.01) with direct calving difficulty (i.e. due to calf size) and maternal calving difficulty (i.e. due to the maternal pelvic width size) was also observed at the rs43101491 SNP. Following adjustment for multiple-testing, significant association (q ≤ 0.05) remained between the rs41694646 SNP and four traits (animal stature, body depth, direct calving difficulty and milk yield) only. Notably, the single SNP in the bovine NESP55 gene (rs41694656) was associated (P ≤ 0.01) with somatic cell count--an often-cited indicator of resistance to mastitis and overall health status of the mammary system--and previous studies have demonstrated that the chromosomal region to where the GNAS domain maps underlies an important quantitative trait locus for this trait. This association, however, was not significant after adjustment for multiple testing. The three remaining SNPs assayed were not associated with any of the performance traits analysed in this study. Analysis of all pairwise linkage disequilibrium (r2) values suggests that most allele substitution effects for the assayed SNPs observed are independent. Finally, the polymorphic coding SNP in the putative bovine NESP55 gene was used to test the imprinting status of this gene across a range of foetal bovine tissues. Previous studies in other mammalian species have shown that DNA sequence variation within the imprinted GNAS gene cluster contributes to several physiological and metabolic disorders, including obesity in humans and mice. Similarly, the results presented here indicate an important role for the imprinted GNAS cluster in underlying complex performance traits in cattle such as animal growth, calving, fertility and health. These findings suggest that GNAS domain-associated polymorphisms may serve as important genetic markers for future livestock breeding programs and support previous studies that candidate imprinted loci may act as molecular targets for the genetic improvement of agricultural populations. In addition, we present new evidence that the bovine NESP55 gene is epigenetically regulated as a maternally expressed imprinted gene in placental and intestinal tissues from 8-10 week old bovine foetuses.
2011-01-01
Background Genes which are epigenetically regulated via genomic imprinting can be potential targets for artificial selection during animal breeding. Indeed, imprinted loci have been shown to underlie some important quantitative traits in domestic mammals, most notably muscle mass and fat deposition. In this candidate gene study, we have identified novel associations between six validated single nucleotide polymorphisms (SNPs) spanning a 97.6 kb region within the bovine guanine nucleotide-binding protein Gs subunit alpha gene (GNAS) domain on bovine chromosome 13 and genetic merit for a range of performance traits in 848 progeny-tested Holstein-Friesian sires. The mammalian GNAS domain consists of a number of reciprocally-imprinted, alternatively-spliced genes which can play a major role in growth, development and disease in mice and humans. Based on the current annotation of the bovine GNAS domain, four of the SNPs analysed (rs43101491, rs43101493, rs43101485 and rs43101486) were located upstream of the GNAS gene, while one SNP (rs41694646) was located in the second intron of the GNAS gene. The final SNP (rs41694656) was located in the first exon of transcripts encoding the putative bovine neuroendocrine-specific protein NESP55, resulting in an aspartic acid-to-asparagine amino acid substitution at amino acid position 192. Results SNP genotype-phenotype association analyses indicate that the single intronic GNAS SNP (rs41694646) is associated (P ≤ 0.05) with a range of performance traits including milk yield, milk protein yield, the content of fat and protein in milk, culled cow carcass weight and progeny carcass conformation, measures of animal body size, direct calving difficulty (i.e. difficulty in calving due to the size of the calf) and gestation length. Association (P ≤ 0.01) with direct calving difficulty (i.e. due to calf size) and maternal calving difficulty (i.e. due to the maternal pelvic width size) was also observed at the rs43101491 SNP. Following adjustment for multiple-testing, significant association (q ≤ 0.05) remained between the rs41694646 SNP and four traits (animal stature, body depth, direct calving difficulty and milk yield) only. Notably, the single SNP in the bovine NESP55 gene (rs41694656) was associated (P ≤ 0.01) with somatic cell count--an often-cited indicator of resistance to mastitis and overall health status of the mammary system--and previous studies have demonstrated that the chromosomal region to where the GNAS domain maps underlies an important quantitative trait locus for this trait. This association, however, was not significant after adjustment for multiple testing. The three remaining SNPs assayed were not associated with any of the performance traits analysed in this study. Analysis of all pairwise linkage disequilibrium (r2) values suggests that most allele substitution effects for the assayed SNPs observed are independent. Finally, the polymorphic coding SNP in the putative bovine NESP55 gene was used to test the imprinting status of this gene across a range of foetal bovine tissues. Conclusions Previous studies in other mammalian species have shown that DNA sequence variation within the imprinted GNAS gene cluster contributes to several physiological and metabolic disorders, including obesity in humans and mice. Similarly, the results presented here indicate an important role for the imprinted GNAS cluster in underlying complex performance traits in cattle such as animal growth, calving, fertility and health. These findings suggest that GNAS domain-associated polymorphisms may serve as important genetic markers for future livestock breeding programs and support previous studies that candidate imprinted loci may act as molecular targets for the genetic improvement of agricultural populations. In addition, we present new evidence that the bovine NESP55 gene is epigenetically regulated as a maternally expressed imprinted gene in placental and intestinal tissues from 8-10 week old bovine foetuses. PMID:21214909
High precision multi-genome scale reannotation of enzyme function by EFICAz
Arakaki, Adrian K; Tian, Weidong; Skolnick, Jeffrey
2006-01-01
Background The functional annotation of most genes in newly sequenced genomes is inferred from similarity to previously characterized sequences, an annotation strategy that often leads to erroneous assignments. We have performed a reannotation of 245 genomes using an updated version of EFICAz, a highly precise method for enzyme function prediction. Results Based on our three-field EC number predictions, we have obtained lower-bound estimates for the average enzyme content in Archaea (29%), Bacteria (30%) and Eukarya (18%). Most annotations added in KEGG from 2005 to 2006 agree with EFICAz predictions made in 2005. The coverage of EFICAz predictions is significantly higher than that of KEGG, especially for eukaryotes. Thousands of our novel predictions correspond to hypothetical proteins. We have identified a subset of 64 hypothetical proteins with low sequence identity to EFICAz training enzymes, whose biochemical functions have been recently characterized and find that in 96% (84%) of the cases we correctly identified their three-field (four-field) EC numbers. For two of the 64 hypothetical proteins: PA1167 from Pseudomonas aeruginosa, an alginate lyase (EC 4.2.2.3) and Rv1700 of Mycobacterium tuberculosis H37Rv, an ADP-ribose diphosphatase (EC 3.6.1.13), we have detected annotation lag of more than two years in databases. Two examples are presented where EFICAz predictions act as hypothesis generators for understanding the functional roles of hypothetical proteins: FLJ11151, a human protein overexpressed in cancer that EFICAz identifies as an endopolyphosphatase (EC 3.6.1.10), and MW0119, a protein of Staphylococcus aureus strain MW2 that we propose as candidate virulence factor based on its EFICAz predicted activity, sphingomyelin phosphodiesterase (EC 3.1.4.12). Conclusion Our results suggest that we have generated enzyme function annotations of high precision and recall. These predictions can be mined and correlated with other information sources to generate biologically significant hypotheses and can be useful for comparative genome analysis and automated metabolic pathway reconstruction. PMID:17166279