Sample records for genomic scale analysis

  1. The Genomic HyperBrowser: an analysis web server for genome-scale data

    PubMed Central

    Sandve, Geir K.; Gundersen, Sveinung; Johansen, Morten; Glad, Ingrid K.; Gunathasan, Krishanthi; Holden, Lars; Holden, Marit; Liestøl, Knut; Nygård, Ståle; Nygaard, Vegard; Paulsen, Jonas; Rydbeck, Halfdan; Trengereid, Kai; Clancy, Trevor; Drabløs, Finn; Ferkingstad, Egil; Kalaš, Matúš; Lien, Tonje; Rye, Morten B.; Frigessi, Arnoldo; Hovig, Eivind

    2013-01-01

    The immense increase in availability of genomic scale datasets, such as those provided by the ENCODE and Roadmap Epigenomics projects, presents unprecedented opportunities for individual researchers to pose novel falsifiable biological questions. With this opportunity, however, researchers are faced with the challenge of how to best analyze and interpret their genome-scale datasets. A powerful way of representing genome-scale data is as feature-specific coordinates relative to reference genome assemblies, i.e. as genomic tracks. The Genomic HyperBrowser (http://hyperbrowser.uio.no) is an open-ended web server for the analysis of genomic track data. Through the provision of several highly customizable components for processing and statistical analysis of genomic tracks, the HyperBrowser opens for a range of genomic investigations, related to, e.g., gene regulation, disease association or epigenetic modifications of the genome. PMID:23632163

  2. The Genomic HyperBrowser: an analysis web server for genome-scale data.

    PubMed

    Sandve, Geir K; Gundersen, Sveinung; Johansen, Morten; Glad, Ingrid K; Gunathasan, Krishanthi; Holden, Lars; Holden, Marit; Liestøl, Knut; Nygård, Ståle; Nygaard, Vegard; Paulsen, Jonas; Rydbeck, Halfdan; Trengereid, Kai; Clancy, Trevor; Drabløs, Finn; Ferkingstad, Egil; Kalas, Matús; Lien, Tonje; Rye, Morten B; Frigessi, Arnoldo; Hovig, Eivind

    2013-07-01

    The immense increase in availability of genomic scale datasets, such as those provided by the ENCODE and Roadmap Epigenomics projects, presents unprecedented opportunities for individual researchers to pose novel falsifiable biological questions. With this opportunity, however, researchers are faced with the challenge of how to best analyze and interpret their genome-scale datasets. A powerful way of representing genome-scale data is as feature-specific coordinates relative to reference genome assemblies, i.e. as genomic tracks. The Genomic HyperBrowser (http://hyperbrowser.uio.no) is an open-ended web server for the analysis of genomic track data. Through the provision of several highly customizable components for processing and statistical analysis of genomic tracks, the HyperBrowser opens for a range of genomic investigations, related to, e.g., gene regulation, disease association or epigenetic modifications of the genome.

  3. A De-Novo Genome Analysis Pipeline (DeNoGAP) for large-scale comparative prokaryotic genomics studies.

    PubMed

    Thakur, Shalabh; Guttman, David S

    2016-06-30

    Comparative analysis of whole genome sequence data from closely related prokaryotic species or strains is becoming an increasingly important and accessible approach for addressing both fundamental and applied biological questions. While there are number of excellent tools developed for performing this task, most scale poorly when faced with hundreds of genome sequences, and many require extensive manual curation. We have developed a de-novo genome analysis pipeline (DeNoGAP) for the automated, iterative and high-throughput analysis of data from comparative genomics projects involving hundreds of whole genome sequences. The pipeline is designed to perform reference-assisted and de novo gene prediction, homolog protein family assignment, ortholog prediction, functional annotation, and pan-genome analysis using a range of proven tools and databases. While most existing methods scale quadratically with the number of genomes since they rely on pairwise comparisons among predicted protein sequences, DeNoGAP scales linearly since the homology assignment is based on iteratively refined hidden Markov models. This iterative clustering strategy enables DeNoGAP to handle a very large number of genomes using minimal computational resources. Moreover, the modular structure of the pipeline permits easy updates as new analysis programs become available. DeNoGAP integrates bioinformatics tools and databases for comparative analysis of a large number of genomes. The pipeline offers tools and algorithms for annotation and analysis of completed and draft genome sequences. The pipeline is developed using Perl, BioPerl and SQLite on Ubuntu Linux version 12.04 LTS. Currently, the software package accompanies script for automated installation of necessary external programs on Ubuntu Linux; however, the pipeline should be also compatible with other Linux and Unix systems after necessary external programs are installed. DeNoGAP is freely available at https://sourceforge.net/projects/denogap/ .

  4. Ensembl Genomes 2013: scaling up access to genome-wide data

    USDA-ARS?s Scientific Manuscript database

    Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species. The project exploits and extends technologies for genome annotation, analysis and dissemination, developed in the context of the vertebrate-focused Ensembl project, and provi...

  5. Scalable Parameter Estimation for Genome-Scale Biochemical Reaction Networks

    PubMed Central

    Kaltenbacher, Barbara; Hasenauer, Jan

    2017-01-01

    Mechanistic mathematical modeling of biochemical reaction networks using ordinary differential equation (ODE) models has improved our understanding of small- and medium-scale biological processes. While the same should in principle hold for large- and genome-scale processes, the computational methods for the analysis of ODE models which describe hundreds or thousands of biochemical species and reactions are missing so far. While individual simulations are feasible, the inference of the model parameters from experimental data is computationally too intensive. In this manuscript, we evaluate adjoint sensitivity analysis for parameter estimation in large scale biochemical reaction networks. We present the approach for time-discrete measurement and compare it to state-of-the-art methods used in systems and computational biology. Our comparison reveals a significantly improved computational efficiency and a superior scalability of adjoint sensitivity analysis. The computational complexity is effectively independent of the number of parameters, enabling the analysis of large- and genome-scale models. Our study of a comprehensive kinetic model of ErbB signaling shows that parameter estimation using adjoint sensitivity analysis requires a fraction of the computation time of established methods. The proposed method will facilitate mechanistic modeling of genome-scale cellular processes, as required in the age of omics. PMID:28114351

  6. Fungal Genomics Program

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Grigoriev, Igor

    The JGI Fungal Genomics Program aims to scale up sequencing and analysis of fungal genomes to explore the diversity of fungi important for energy and the environment, and to promote functional studies on a system level. Combining new sequencing technologies and comparative genomics tools, JGI is now leading the world in fungal genome sequencing and analysis. Over 120 sequenced fungal genomes with analytical tools are available via MycoCosm (www.jgi.doe.gov/fungi), a web-portal for fungal biologists. Our model of interacting with user communities, unique among other sequencing centers, helps organize these communities, improves genome annotation and analysis work, and facilitates new larger-scalemore » genomic projects. This resulted in 20 high-profile papers published in 2011 alone and contributing to the Genomics Encyclopedia of Fungi, which targets fungi related to plant health (symbionts, pathogens, and biocontrol agents) and biorefinery processes (cellulose degradation, sugar fermentation, industrial hosts). Our next grand challenges include larger scale exploration of fungal diversity (1000 fungal genomes), developing molecular tools for DOE-relevant model organisms, and analysis of complex systems and metagenomes.« less

  7. Home - The Cancer Genome Atlas - Cancer Genome - TCGA

    Cancer.gov

    The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.

  8. BiGG: a Biochemical Genetic and Genomic knowledgebase of large scale metabolic reconstructions

    PubMed Central

    2010-01-01

    Background Genome-scale metabolic reconstructions under the Constraint Based Reconstruction and Analysis (COBRA) framework are valuable tools for analyzing the metabolic capabilities of organisms and interpreting experimental data. As the number of such reconstructions and analysis methods increases, there is a greater need for data uniformity and ease of distribution and use. Description We describe BiGG, a knowledgebase of Biochemically, Genetically and Genomically structured genome-scale metabolic network reconstructions. BiGG integrates several published genome-scale metabolic networks into one resource with standard nomenclature which allows components to be compared across different organisms. BiGG can be used to browse model content, visualize metabolic pathway maps, and export SBML files of the models for further analysis by external software packages. Users may follow links from BiGG to several external databases to obtain additional information on genes, proteins, reactions, metabolites and citations of interest. Conclusions BiGG addresses a need in the systems biology community to have access to high quality curated metabolic models and reconstructions. It is freely available for academic use at http://bigg.ucsd.edu. PMID:20426874

  9. Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics.

    PubMed

    Kelly, Benjamin J; Fitch, James R; Hu, Yangqiu; Corsmeier, Donald J; Zhong, Huachun; Wetzel, Amy N; Nordquist, Russell D; Newsom, David L; White, Peter

    2015-01-20

    While advances in genome sequencing technology make population-scale genomics a possibility, current approaches for analysis of these data rely upon parallelization strategies that have limited scalability, complex implementation and lack reproducibility. Churchill, a balanced regional parallelization strategy, overcomes these challenges, fully automating the multiple steps required to go from raw sequencing reads to variant discovery. Through implementation of novel deterministic parallelization techniques, Churchill allows computationally efficient analysis of a high-depth whole genome sample in less than two hours. The method is highly scalable, enabling full analysis of the 1000 Genomes raw sequence dataset in a week using cloud resources. http://churchill.nchri.org/.

  10. Genome-scale approaches to the epigenetics of common human disease

    PubMed Central

    2011-01-01

    Traditionally, the pathology of human disease has been focused on microscopic examination of affected tissues, chemical and biochemical analysis of biopsy samples, other available samples of convenience, such as blood, and noninvasive or invasive imaging of varying complexity, in order to classify disease and illuminate its mechanistic basis. The molecular age has complemented this armamentarium with gene expression arrays and selective analysis of individual genes. However, we are entering a new era of epigenomic profiling, i.e., genome-scale analysis of cell-heritable nonsequence genetic change, such as DNA methylation. The epigenome offers access to stable measurements of cellular state and to biobanked material for large-scale epidemiological studies. Some of these genome-scale technologies are beginning to be applied to create the new field of epigenetic epidemiology. PMID:19844740

  11. Ensembl Genomes: an integrative resource for genome-scale data from non-vertebrate species.

    PubMed

    Kersey, Paul J; Staines, Daniel M; Lawson, Daniel; Kulesha, Eugene; Derwent, Paul; Humphrey, Jay C; Hughes, Daniel S T; Keenan, Stephan; Kerhornou, Arnaud; Koscielny, Gautier; Langridge, Nicholas; McDowall, Mark D; Megy, Karine; Maheswari, Uma; Nuhn, Michael; Paulini, Michael; Pedro, Helder; Toneva, Iliana; Wilson, Derek; Yates, Andrew; Birney, Ewan

    2012-01-01

    Ensembl Genomes (http://www.ensemblgenomes.org) is an integrative resource for genome-scale data from non-vertebrate species. The project exploits and extends technology (for genome annotation, analysis and dissemination) developed in the context of the (vertebrate-focused) Ensembl project and provides a complementary set of resources for non-vertebrate species through a consistent set of programmatic and interactive interfaces. These provide access to data including reference sequence, gene models, transcriptional data, polymorphisms and comparative analysis. Since its launch in 2009, Ensembl Genomes has undergone rapid expansion, with the goal of providing coverage of all major experimental organisms, and additionally including taxonomic reference points to provide the evolutionary context in which genes can be understood. Against the backdrop of a continuing increase in genome sequencing activities in all parts of the tree of life, we seek to work, wherever possible, with the communities actively generating and using data, and are participants in a growing range of collaborations involved in the annotation and analysis of genomes.

  12. A review of genome-wide approaches to study the genetic basis for spermatogenic defects.

    PubMed

    Aston, Kenneth I; Conrad, Donald F

    2013-01-01

    Rapidly advancing tools for genetic analysis on a genome-wide scale have been instrumental in identifying the genetic bases for many complex diseases. About half of male infertility cases are of unknown etiology in spite of tremendous efforts to characterize the genetic basis for the disorder. Advancing our understanding of the genetic basis for male infertility will require the application of established and emerging genomic tools. This chapter introduces many of the tools available for genetic studies on a genome-wide scale along with principles of study design and data analysis.

  13. The OME Framework for genome-scale systems biology

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Palsson, Bernhard O.; Ebrahim, Ali; Federowicz, Steve

    The life sciences are undergoing continuous and accelerating integration with computational and engineering sciences. The biology that many in the field have been trained on may be hardly recognizable in ten to twenty years. One of the major drivers for this transformation is the blistering pace of advancements in DNA sequencing and synthesis. These advances have resulted in unprecedented amounts of new data, information, and knowledge. Many software tools have been developed to deal with aspects of this transformation and each is sorely needed [1-3]. However, few of these tools have been forced to deal with the full complexity ofmore » genome-scale models along with high throughput genome- scale data. This particular situation represents a unique challenge, as it is simultaneously necessary to deal with the vast breadth of genome-scale models and the dizzying depth of high-throughput datasets. It has been observed time and again that as the pace of data generation continues to accelerate, the pace of analysis significantly lags behind [4]. It is also evident that, given the plethora of databases and software efforts [5-12], it is still a significant challenge to work with genome-scale metabolic models, let alone next-generation whole cell models [13-15]. We work at the forefront of model creation and systems scale data generation [16-18]. The OME Framework was borne out of a practical need to enable genome-scale modeling and data analysis under a unified framework to drive the next generation of genome-scale biological models. Here we present the OME Framework. It exists as a set of Python classes. However, we want to emphasize the importance of the underlying design as an addition to the discussions on specifications of a digital cell. A great deal of work and valuable progress has been made by a number of communities [13, 19-24] towards interchange formats and implementations designed to achieve similar goals. While many software tools exist for handling genome-scale metabolic models or for genome-scale data analysis, no implementations exist that explicitly handle data and models concurrently. The OME Framework structures data in a connected loop with models and the components those models are composed of. This results in the first full, practical implementation of a framework that can enable genome-scale design-build-test. Over the coming years many more software packages will be developed and tools will necessarily change. However, we hope that the underlying designs shared here can help to inform the design of future software.« less

  14. Harnessing Whole Genome Sequencing in Medical Mycology.

    PubMed

    Cuomo, Christina A

    2017-01-01

    Comparative genome sequencing studies of human fungal pathogens enable identification of genes and variants associated with virulence and drug resistance. This review describes current approaches, resources, and advances in applying whole genome sequencing to study clinically important fungal pathogens. Genomes for some important fungal pathogens were only recently assembled, revealing gene family expansions in many species and extreme gene loss in one obligate species. The scale and scope of species sequenced is rapidly expanding, leveraging technological advances to assemble and annotate genomes with higher precision. By using iteratively improved reference assemblies or those generated de novo for new species, recent studies have compared the sequence of isolates representing populations or clinical cohorts. Whole genome approaches provide the resolution necessary for comparison of closely related isolates, for example, in the analysis of outbreaks or sampled across time within a single host. Genomic analysis of fungal pathogens has enabled both basic research and diagnostic studies. The increased scale of sequencing can be applied across populations, and new metagenomic methods allow direct analysis of complex samples.

  15. Phylogenomic Reconstruction of the Oomycete Phylogeny Derived from 37 Genomes

    PubMed Central

    McCarthy, Charley G. P.

    2017-01-01

    ABSTRACT The oomycetes are a class of microscopic, filamentous eukaryotes within the Stramenopiles-Alveolata-Rhizaria (SAR) supergroup which includes ecologically significant animal and plant pathogens, most infamously the causative agent of potato blight Phytophthora infestans. Single-gene and concatenated phylogenetic studies both of individual oomycete genera and of members of the larger class have resulted in conflicting conclusions concerning species phylogenies within the oomycetes, particularly for the large Phytophthora genus. Genome-scale phylogenetic studies have successfully resolved many eukaryotic relationships by using supertree methods, which combine large numbers of potentially disparate trees to determine evolutionary relationships that cannot be inferred from individual phylogenies alone. With a sufficient amount of genomic data now available, we have undertaken the first whole-genome phylogenetic analysis of the oomycetes using data from 37 oomycete species and 6 SAR species. In our analysis, we used established supertree methods to generate phylogenies from 8,355 homologous oomycete and SAR gene families and have complemented those analyses with both phylogenomic network and concatenated supermatrix analyses. Our results show that a genome-scale approach to oomycete phylogeny resolves oomycete classes and individual clades within the problematic Phytophthora genus. Support for the resolution of the inferred relationships between individual Phytophthora clades varies depending on the methodology used. Our analysis represents an important first step in large-scale phylogenomic analysis of the oomycetes. IMPORTANCE The oomycetes are a class of eukaryotes and include ecologically significant animal and plant pathogens. Single-gene and multigene phylogenetic studies of individual oomycete genera and of members of the larger classes have resulted in conflicting conclusions concerning interspecies relationships among these species, particularly for the Phytophthora genus. The onset of next-generation sequencing techniques now means that a wealth of oomycete genomic data is available. For the first time, we have used genome-scale phylogenetic methods to resolve oomycete phylogenetic relationships. We used supertree methods to generate single-gene and multigene species phylogenies. Overall, our supertree analyses utilized phylogenetic data from 8,355 oomycete gene families. We have also complemented our analyses with superalignment phylogenies derived from 131 single-copy ubiquitous gene families. Our results show that a genome-scale approach to oomycete phylogeny resolves oomycete classes and clades. Our analysis represents an important first step in large-scale phylogenomic analysis of the oomycetes. PMID:28435885

  16. iAK692: A genome-scale metabolic model of Spirulina platensis C1

    PubMed Central

    2012-01-01

    Background Spirulina (Arthrospira) platensis is a well-known filamentous cyanobacterium used in the production of many industrial products, including high value compounds, healthy food supplements, animal feeds, pharmaceuticals and cosmetics, for example. It has been increasingly studied around the world for scientific purposes, especially for its genome, biology, physiology, and also for the analysis of its small-scale metabolic network. However, the overall description of the metabolic and biotechnological capabilities of S. platensis requires the development of a whole cellular metabolism model. Recently, the S. platensis C1 (Arthrospira sp. PCC9438) genome sequence has become available, allowing systems-level studies of this commercial cyanobacterium. Results In this work, we present the genome-scale metabolic network analysis of S. platensis C1, iAK692, its topological properties, and its metabolic capabilities and functions. The network was reconstructed from the S. platensis C1 annotated genomic sequence using Pathway Tools software to generate a preliminary network. Then, manual curation was performed based on a collective knowledge base and a combination of genomic, biochemical, and physiological information. The genome-scale metabolic model consists of 692 genes, 837 metabolites, and 875 reactions. We validated iAK692 by conducting fermentation experiments and simulating the model under autotrophic, heterotrophic, and mixotrophic growth conditions using COBRA toolbox. The model predictions under these growth conditions were consistent with the experimental results. The iAK692 model was further used to predict the unique active reactions and essential genes for each growth condition. Additionally, the metabolic states of iAK692 during autotrophic and mixotrophic growths were described by phenotypic phase plane (PhPP) analysis. Conclusions This study proposes the first genome-scale model of S. platensis C1, iAK692, which is a predictive metabolic platform for a global understanding of physiological behaviors and metabolic engineering. This platform could accelerate the integrative analysis of various “-omics” data, leading to strain improvement towards a diverse range of desired industrial products from Spirulina. PMID:22703714

  17. iAK692: a genome-scale metabolic model of Spirulina platensis C1.

    PubMed

    Klanchui, Amornpan; Khannapho, Chiraphan; Phodee, Atchara; Cheevadhanarak, Supapon; Meechai, Asawin

    2012-06-15

    Spirulina (Arthrospira) platensis is a well-known filamentous cyanobacterium used in the production of many industrial products, including high value compounds, healthy food supplements, animal feeds, pharmaceuticals and cosmetics, for example. It has been increasingly studied around the world for scientific purposes, especially for its genome, biology, physiology, and also for the analysis of its small-scale metabolic network. However, the overall description of the metabolic and biotechnological capabilities of S. platensis requires the development of a whole cellular metabolism model. Recently, the S. platensis C1 (Arthrospira sp. PCC9438) genome sequence has become available, allowing systems-level studies of this commercial cyanobacterium. In this work, we present the genome-scale metabolic network analysis of S. platensis C1, iAK692, its topological properties, and its metabolic capabilities and functions. The network was reconstructed from the S. platensis C1 annotated genomic sequence using Pathway Tools software to generate a preliminary network. Then, manual curation was performed based on a collective knowledge base and a combination of genomic, biochemical, and physiological information. The genome-scale metabolic model consists of 692 genes, 837 metabolites, and 875 reactions. We validated iAK692 by conducting fermentation experiments and simulating the model under autotrophic, heterotrophic, and mixotrophic growth conditions using COBRA toolbox. The model predictions under these growth conditions were consistent with the experimental results. The iAK692 model was further used to predict the unique active reactions and essential genes for each growth condition. Additionally, the metabolic states of iAK692 during autotrophic and mixotrophic growths were described by phenotypic phase plane (PhPP) analysis. This study proposes the first genome-scale model of S. platensis C1, iAK692, which is a predictive metabolic platform for a global understanding of physiological behaviors and metabolic engineering. This platform could accelerate the integrative analysis of various "-omics" data, leading to strain improvement towards a diverse range of desired industrial products from Spirulina.

  18. GIGGLE: a search engine for large-scale integrated genome analysis.

    PubMed

    Layer, Ryan M; Pedersen, Brent S; DiSera, Tonya; Marth, Gabor T; Gertz, Jason; Quinlan, Aaron R

    2018-02-01

    GIGGLE is a genomics search engine that identifies and ranks the significance of genomic loci shared between query features and thousands of genome interval files. GIGGLE (https://github.com/ryanlayer/giggle) scales to billions of intervals and is over three orders of magnitude faster than existing methods. Its speed extends the accessibility and utility of resources such as ENCODE, Roadmap Epigenomics, and GTEx by facilitating data integration and hypothesis generation.

  19. GIGGLE: a search engine for large-scale integrated genome analysis

    PubMed Central

    Layer, Ryan M; Pedersen, Brent S; DiSera, Tonya; Marth, Gabor T; Gertz, Jason; Quinlan, Aaron R

    2018-01-01

    GIGGLE is a genomics search engine that identifies and ranks the significance of genomic loci shared between query features and thousands of genome interval files. GIGGLE (https://github.com/ryanlayer/giggle) scales to billions of intervals and is over three orders of magnitude faster than existing methods. Its speed extends the accessibility and utility of resources such as ENCODE, Roadmap Epigenomics, and GTEx by facilitating data integration and hypothesis generation. PMID:29309061

  20. Genome-Wide Fine-Scale Recombination Rate Variation in Drosophila melanogaster

    PubMed Central

    Song, Yun S.

    2012-01-01

    Estimating fine-scale recombination maps of Drosophila from population genomic data is a challenging problem, in particular because of the high background recombination rate. In this paper, a new computational method is developed to address this challenge. Through an extensive simulation study, it is demonstrated that the method allows more accurate inference, and exhibits greater robustness to the effects of natural selection and noise, compared to a well-used previous method developed for studying fine-scale recombination rate variation in the human genome. As an application, a genome-wide analysis of genetic variation data is performed for two Drosophila melanogaster populations, one from North America (Raleigh, USA) and the other from Africa (Gikongoro, Rwanda). It is shown that fine-scale recombination rate variation is widespread throughout the D. melanogaster genome, across all chromosomes and in both populations. At the fine-scale, a conservative, systematic search for evidence of recombination hotspots suggests the existence of a handful of putative hotspots each with at least a tenfold increase in intensity over the background rate. A wavelet analysis is carried out to compare the estimated recombination maps in the two populations and to quantify the extent to which recombination rates are conserved. In general, similarity is observed at very broad scales, but substantial differences are seen at fine scales. The average recombination rate of the X chromosome appears to be higher than that of the autosomes in both populations, and this pattern is much more pronounced in the African population than the North American population. The correlation between various genomic features—including recombination rates, diversity, divergence, GC content, gene content, and sequence quality—is examined using the wavelet analysis, and it is shown that the most notable difference between D. melanogaster and humans is in the correlation between recombination and diversity. PMID:23284288

  1. Translational bioinformatics in the cloud: an affordable alternative

    PubMed Central

    2010-01-01

    With the continued exponential expansion of publicly available genomic data and access to low-cost, high-throughput molecular technologies for profiling patient populations, computational technologies and informatics are becoming vital considerations in genomic medicine. Although cloud computing technology is being heralded as a key enabling technology for the future of genomic research, available case studies are limited to applications in the domain of high-throughput sequence data analysis. The goal of this study was to evaluate the computational and economic characteristics of cloud computing in performing a large-scale data integration and analysis representative of research problems in genomic medicine. We find that the cloud-based analysis compares favorably in both performance and cost in comparison to a local computational cluster, suggesting that cloud computing technologies might be a viable resource for facilitating large-scale translational research in genomic medicine. PMID:20691073

  2. A case study for cloud based high throughput analysis of NGS data using the globus genomics system

    PubMed Central

    Bhuvaneshwar, Krithika; Sulakhe, Dinanath; Gauba, Robinder; Rodriguez, Alex; Madduri, Ravi; Dave, Utpal; Lacinski, Lukasz; Foster, Ian; Gusev, Yuriy; Madhavan, Subha

    2014-01-01

    Next generation sequencing (NGS) technologies produce massive amounts of data requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools. We present a case study of a practical solution to this data management and analysis challenge that simplifies terabyte scale data handling and provides advanced tools for NGS data analysis. These capabilities are implemented using the “Globus Genomics” system, which is an enhanced Galaxy workflow system made available as a service that offers users the capability to process and transfer data easily, reliably and quickly to address end-to-endNGS analysis requirements. The Globus Genomics system is built on Amazon 's cloud computing infrastructure. The system takes advantage of elastic scaling of compute resources to run multiple workflows in parallel and it also helps meet the scale-out analysis needs of modern translational genomics research. PMID:26925205

  3. Major soybean maturity gene haplotypes revealed by SNPViz analysis of 72 sequenced soybean genomes

    USDA-ARS?s Scientific Manuscript database

    In this Genomics Era, vast amounts of next generation sequencing data have become publicly-available for multiple genomes across hundreds of species. Analysis of these large-scale datasets can become cumbersome, especially when comparing nucleotide polymorphisms across many samples within a dataset...

  4. [Genome-scale sequence data processing and epigenetic analysis of DNA methylation].

    PubMed

    Wang, Ting-Zhang; Shan, Gao; Xu, Jian-Hong; Xue, Qing-Zhong

    2013-06-01

    A new approach recently developed for detecting cytosine DNA methylation (mC) and analyzing the genome-scale DNA methylation profiling, is called BS-Seq which is based on bisulfite conversion of genomic DNA combined with next-generation sequencing. The method can not only provide an insight into the difference of genome-scale DNA methylation among different organisms, but also reveal the conservation of DNA methylation in all contexts and nucleotide preference for different genomic regions, including genes, exons, and repetitive DNA sequences. It will be helpful to under-stand the epigenetic impacts of cytosine DNA methylation on the regulation of gene expression and maintaining silence of repetitive sequences, such as transposable elements. In this paper, we introduce the preprocessing steps of DNA methylation data, by which cytosine (C) and guanine (G) in the reference sequence are transferred to thymine (T) and adenine (A), and cytosine in reads is transferred to thymine, respectively. We also comprehensively review the main content of the DNA methylation analysis on the genomic scale: (1) the cytosine methylation under the context of different sequences; (2) the distribution of genomic methylcytosine; (3) DNA methylation context and the preference for the nucleotides; (4) DNA- protein interaction sites of DNA methylation; (5) degree of methylation of cytosine in the different structural elements of genes. DNA methylation analysis technique provides a powerful tool for the epigenome study in human and other species, and genes and environment interaction, and founds the theoretical basis for further development of disease diagnostics and therapeutics in human.

  5. BactoGeNIE: A large-scale comparative genome visualization for big displays

    DOE PAGES

    Aurisano, Jillian; Reda, Khairi; Johnson, Andrew; ...

    2015-08-13

    The volume of complete bacterial genome sequence data available to comparative genomics researchers is rapidly increasing. However, visualizations in comparative genomics--which aim to enable analysis tasks across collections of genomes--suffer from visual scalability issues. While large, multi-tiled and high-resolution displays have the potential to address scalability issues, new approaches are needed to take advantage of such environments, in order to enable the effective visual analysis of large genomics datasets. In this paper, we present Bacterial Gene Neighborhood Investigation Environment, or BactoGeNIE, a novel and visually scalable design for comparative gene neighborhood analysis on large display environments. We evaluate BactoGeNIE throughmore » a case study on close to 700 draft Escherichia coli genomes, and present lessons learned from our design process. In conclusion, BactoGeNIE accommodates comparative tasks over substantially larger collections of neighborhoods than existing tools and explicitly addresses visual scalability. Given current trends in data generation, scalable designs of this type may inform visualization design for large-scale comparative research problems in genomics.« less

  6. BactoGeNIE: a large-scale comparative genome visualization for big displays

    PubMed Central

    2015-01-01

    Background The volume of complete bacterial genome sequence data available to comparative genomics researchers is rapidly increasing. However, visualizations in comparative genomics--which aim to enable analysis tasks across collections of genomes--suffer from visual scalability issues. While large, multi-tiled and high-resolution displays have the potential to address scalability issues, new approaches are needed to take advantage of such environments, in order to enable the effective visual analysis of large genomics datasets. Results In this paper, we present Bacterial Gene Neighborhood Investigation Environment, or BactoGeNIE, a novel and visually scalable design for comparative gene neighborhood analysis on large display environments. We evaluate BactoGeNIE through a case study on close to 700 draft Escherichia coli genomes, and present lessons learned from our design process. Conclusions BactoGeNIE accommodates comparative tasks over substantially larger collections of neighborhoods than existing tools and explicitly addresses visual scalability. Given current trends in data generation, scalable designs of this type may inform visualization design for large-scale comparative research problems in genomics. PMID:26329021

  7. BactoGeNIE: A large-scale comparative genome visualization for big displays

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Aurisano, Jillian; Reda, Khairi; Johnson, Andrew

    The volume of complete bacterial genome sequence data available to comparative genomics researchers is rapidly increasing. However, visualizations in comparative genomics--which aim to enable analysis tasks across collections of genomes--suffer from visual scalability issues. While large, multi-tiled and high-resolution displays have the potential to address scalability issues, new approaches are needed to take advantage of such environments, in order to enable the effective visual analysis of large genomics datasets. In this paper, we present Bacterial Gene Neighborhood Investigation Environment, or BactoGeNIE, a novel and visually scalable design for comparative gene neighborhood analysis on large display environments. We evaluate BactoGeNIE throughmore » a case study on close to 700 draft Escherichia coli genomes, and present lessons learned from our design process. In conclusion, BactoGeNIE accommodates comparative tasks over substantially larger collections of neighborhoods than existing tools and explicitly addresses visual scalability. Given current trends in data generation, scalable designs of this type may inform visualization design for large-scale comparative research problems in genomics.« less

  8. The Use of Weighted Graphs for Large-Scale Genome Analysis

    PubMed Central

    Zhou, Fang; Toivonen, Hannu; King, Ross D.

    2014-01-01

    There is an acute need for better tools to extract knowledge from the growing flood of sequence data. For example, thousands of complete genomes have been sequenced, and their metabolic networks inferred. Such data should enable a better understanding of evolution. However, most existing network analysis methods are based on pair-wise comparisons, and these do not scale to thousands of genomes. Here we propose the use of weighted graphs as a data structure to enable large-scale phylogenetic analysis of networks. We have developed three types of weighted graph for enzymes: taxonomic (these summarize phylogenetic importance), isoenzymatic (these summarize enzymatic variety/redundancy), and sequence-similarity (these summarize sequence conservation); and we applied these types of weighted graph to survey prokaryotic metabolism. To demonstrate the utility of this approach we have compared and contrasted the large-scale evolution of metabolism in Archaea and Eubacteria. Our results provide evidence for limits to the contingency of evolution. PMID:24619061

  9. About TCGA - TCGA

    Cancer.gov

    Find out about The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.

  10. Theobroma cacao: A genetically integrated physical map and genome-scale comparative synteny analysis

    USDA-ARS?s Scientific Manuscript database

    A comprehensive integrated genomic framework is considered a centerpiece of genomic research. In collaboration with the USDA-ARS (SHRS) and Mars Inc., the Clemson University Genomics Institute (CUGI) has developed a genetically anchored physical map of the T. cacao genome. Three BAC libraries contai...

  11. Ensembl Genomes 2013: scaling up access to genome-wide data.

    PubMed

    Kersey, Paul Julian; Allen, James E; Christensen, Mikkel; Davis, Paul; Falin, Lee J; Grabmueller, Christoph; Hughes, Daniel Seth Toney; Humphrey, Jay; Kerhornou, Arnaud; Khobova, Julia; Langridge, Nicholas; McDowall, Mark D; Maheswari, Uma; Maslen, Gareth; Nuhn, Michael; Ong, Chuang Kee; Paulini, Michael; Pedro, Helder; Toneva, Iliana; Tuli, Mary Ann; Walts, Brandon; Williams, Gareth; Wilson, Derek; Youens-Clark, Ken; Monaco, Marcela K; Stein, Joshua; Wei, Xuehong; Ware, Doreen; Bolser, Daniel M; Howe, Kevin Lee; Kulesha, Eugene; Lawson, Daniel; Staines, Daniel Michael

    2014-01-01

    Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species. The project exploits and extends technologies for genome annotation, analysis and dissemination, developed in the context of the vertebrate-focused Ensembl project, and provides a complementary set of resources for non-vertebrate species through a consistent set of programmatic and interactive interfaces. These provide access to data including reference sequence, gene models, transcriptional data, polymorphisms and comparative analysis. This article provides an update to the previous publications about the resource, with a focus on recent developments. These include the addition of important new genomes (and related data sets) including crop plants, vectors of human disease and eukaryotic pathogens. In addition, the resource has scaled up its representation of bacterial genomes, and now includes the genomes of over 9000 bacteria. Specific extensions to the web and programmatic interfaces have been developed to support users in navigating these large data sets. Looking forward, analytic tools to allow targeted selection of data for visualization and download are likely to become increasingly important in future as the number of available genomes increases within all domains of life, and some of the challenges faced in representing bacterial data are likely to become commonplace for eukaryotes in future.

  12. Globus | Informatics Technology for Cancer Research (ITCR)

    Cancer.gov

    Globus software services provide secure cancer research data transfer, synchronization, and sharing in distributed environments at large scale. These services can be integrated into applications and research data gateways, leveraging Globus identity management, single sign-on, search, and authorization capabilities. Globus Genomics integrates Globus with the Galaxy genomics workflow engine and Amazon Web Services to enable cancer genomics analysis that can elastically scale compute resources with demand.

  13. Using relational databases for improved sequence similarity searching and large-scale genomic analyses.

    PubMed

    Mackey, Aaron J; Pearson, William R

    2004-10-01

    Relational databases are designed to integrate diverse types of information and manage large sets of search results, greatly simplifying genome-scale analyses. Relational databases are essential for management and analysis of large-scale sequence analyses, and can also be used to improve the statistical significance of similarity searches by focusing on subsets of sequence libraries most likely to contain homologs. This unit describes using relational databases to improve the efficiency of sequence similarity searching and to demonstrate various large-scale genomic analyses of homology-related data. This unit describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. These include basic use of the database to generate a novel sequence library subset, how to extend and use seqdb_demo for the storage of sequence similarity search results and making use of various kinds of stored search results to address aspects of comparative genomic analysis.

  14. Guidelines for Genome-Scale Analysis of Biological Rhythms.

    PubMed

    Hughes, Michael E; Abruzzi, Katherine C; Allada, Ravi; Anafi, Ron; Arpat, Alaaddin Bulak; Asher, Gad; Baldi, Pierre; de Bekker, Charissa; Bell-Pedersen, Deborah; Blau, Justin; Brown, Steve; Ceriani, M Fernanda; Chen, Zheng; Chiu, Joanna C; Cox, Juergen; Crowell, Alexander M; DeBruyne, Jason P; Dijk, Derk-Jan; DiTacchio, Luciano; Doyle, Francis J; Duffield, Giles E; Dunlap, Jay C; Eckel-Mahan, Kristin; Esser, Karyn A; FitzGerald, Garret A; Forger, Daniel B; Francey, Lauren J; Fu, Ying-Hui; Gachon, Frédéric; Gatfield, David; de Goede, Paul; Golden, Susan S; Green, Carla; Harer, John; Harmer, Stacey; Haspel, Jeff; Hastings, Michael H; Herzel, Hanspeter; Herzog, Erik D; Hoffmann, Christy; Hong, Christian; Hughey, Jacob J; Hurley, Jennifer M; de la Iglesia, Horacio O; Johnson, Carl; Kay, Steve A; Koike, Nobuya; Kornacker, Karl; Kramer, Achim; Lamia, Katja; Leise, Tanya; Lewis, Scott A; Li, Jiajia; Li, Xiaodong; Liu, Andrew C; Loros, Jennifer J; Martino, Tami A; Menet, Jerome S; Merrow, Martha; Millar, Andrew J; Mockler, Todd; Naef, Felix; Nagoshi, Emi; Nitabach, Michael N; Olmedo, Maria; Nusinow, Dmitri A; Ptáček, Louis J; Rand, David; Reddy, Akhilesh B; Robles, Maria S; Roenneberg, Till; Rosbash, Michael; Ruben, Marc D; Rund, Samuel S C; Sancar, Aziz; Sassone-Corsi, Paolo; Sehgal, Amita; Sherrill-Mix, Scott; Skene, Debra J; Storch, Kai-Florian; Takahashi, Joseph S; Ueda, Hiroki R; Wang, Han; Weitz, Charles; Westermark, Pål O; Wijnen, Herman; Xu, Ying; Wu, Gang; Yoo, Seung-Hee; Young, Michael; Zhang, Eric Erquan; Zielinski, Tomasz; Hogenesch, John B

    2017-10-01

    Genome biology approaches have made enormous contributions to our understanding of biological rhythms, particularly in identifying outputs of the clock, including RNAs, proteins, and metabolites, whose abundance oscillates throughout the day. These methods hold significant promise for future discovery, particularly when combined with computational modeling. However, genome-scale experiments are costly and laborious, yielding "big data" that are conceptually and statistically difficult to analyze. There is no obvious consensus regarding design or analysis. Here we discuss the relevant technical considerations to generate reproducible, statistically sound, and broadly useful genome-scale data. Rather than suggest a set of rigid rules, we aim to codify principles by which investigators, reviewers, and readers of the primary literature can evaluate the suitability of different experimental designs for measuring different aspects of biological rhythms. We introduce CircaInSilico, a web-based application for generating synthetic genome biology data to benchmark statistical methods for studying biological rhythms. Finally, we discuss several unmet analytical needs, including applications to clinical medicine, and suggest productive avenues to address them.

  15. Guidelines for Genome-Scale Analysis of Biological Rhythms

    PubMed Central

    Hughes, Michael E.; Abruzzi, Katherine C.; Allada, Ravi; Anafi, Ron; Arpat, Alaaddin Bulak; Asher, Gad; Baldi, Pierre; de Bekker, Charissa; Bell-Pedersen, Deborah; Blau, Justin; Brown, Steve; Ceriani, M. Fernanda; Chen, Zheng; Chiu, Joanna C.; Cox, Juergen; Crowell, Alexander M.; DeBruyne, Jason P.; Dijk, Derk-Jan; DiTacchio, Luciano; Doyle, Francis J.; Duffield, Giles E.; Dunlap, Jay C.; Eckel-Mahan, Kristin; Esser, Karyn A.; FitzGerald, Garret A.; Forger, Daniel B.; Francey, Lauren J.; Fu, Ying-Hui; Gachon, Frédéric; Gatfield, David; de Goede, Paul; Golden, Susan S.; Green, Carla; Harer, John; Harmer, Stacey; Haspel, Jeff; Hastings, Michael H.; Herzel, Hanspeter; Herzog, Erik D.; Hoffmann, Christy; Hong, Christian; Hughey, Jacob J.; Hurley, Jennifer M.; de la Iglesia, Horacio O.; Johnson, Carl; Kay, Steve A.; Koike, Nobuya; Kornacker, Karl; Kramer, Achim; Lamia, Katja; Leise, Tanya; Lewis, Scott A.; Li, Jiajia; Li, Xiaodong; Liu, Andrew C.; Loros, Jennifer J.; Martino, Tami A.; Menet, Jerome S.; Merrow, Martha; Millar, Andrew J.; Mockler, Todd; Naef, Felix; Nagoshi, Emi; Nitabach, Michael N.; Olmedo, Maria; Nusinow, Dmitri A.; Ptáček, Louis J.; Rand, David; Reddy, Akhilesh B.; Robles, Maria S.; Roenneberg, Till; Rosbash, Michael; Ruben, Marc D.; Rund, Samuel S.C.; Sancar, Aziz; Sassone-Corsi, Paolo; Sehgal, Amita; Sherrill-Mix, Scott; Skene, Debra J.; Storch, Kai-Florian; Takahashi, Joseph S.; Ueda, Hiroki R.; Wang, Han; Weitz, Charles; Westermark, Pål O.; Wijnen, Herman; Xu, Ying; Wu, Gang; Yoo, Seung-Hee; Young, Michael; Zhang, Eric Erquan; Zielinski, Tomasz; Hogenesch, John B.

    2017-01-01

    Genome biology approaches have made enormous contributions to our understanding of biological rhythms, particularly in identifying outputs of the clock, including RNAs, proteins, and metabolites, whose abundance oscillates throughout the day. These methods hold significant promise for future discovery, particularly when combined with computational modeling. However, genome-scale experiments are costly and laborious, yielding “big data” that are conceptually and statistically difficult to analyze. There is no obvious consensus regarding design or analysis. Here we discuss the relevant technical considerations to generate reproducible, statistically sound, and broadly useful genome-scale data. Rather than suggest a set of rigid rules, we aim to codify principles by which investigators, reviewers, and readers of the primary literature can evaluate the suitability of different experimental designs for measuring different aspects of biological rhythms. We introduce CircaInSilico, a web-based application for generating synthetic genome biology data to benchmark statistical methods for studying biological rhythms. Finally, we discuss several unmet analytical needs, including applications to clinical medicine, and suggest productive avenues to address them. PMID:29098954

  16. TCGA4U: A Web-Based Genomic Analysis Platform To Explore And Mine TCGA Genomic Data For Translational Research.

    PubMed

    Huang, Zhenzhen; Duan, Huilong; Li, Haomin

    2015-01-01

    Large-scale human cancer genomics projects, such as TCGA, generated large genomics data for further study. Exploring and mining these data to obtain meaningful analysis results can help researchers find potential genomics alterations that intervene the development and metastasis of tumors. We developed a web-based gene analysis platform, named TCGA4U, which used statistics methods and models to help translational investigators explore, mine and visualize human cancer genomic characteristic information from the TCGA datasets. Furthermore, through Gene Ontology (GO) annotation and clinical data integration, the genomic data were transformed into biological process, molecular function, cellular component and survival curves to help researchers identify potential driver genes. Clinical researchers without expertise in data analysis will benefit from such a user-friendly genomic analysis platform.

  17. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray)

    Treesearch

    G.A. Tuskan; S. DiFazio; S. Jansson; J. Bohlmann; I. Grigoriev; U. Hellsten; N. Putnam; S. Ralph; S. Rombauts; A. Salamov; J. Schein; L. Sterck; A. Aerts; R.R. Bhalerao; R.P. Bhalerao; D. Blaudez; W. Boerjan; A. Brun; A. Brunner; V. Busov; M. Campbell; J. Carlson; M. Chalot; J. Chapman; G.-L. Chen; D. Cooper; P.M. Coutinho; J. Couturier; S. Covert; Q. Cronk; R. Cunningham; J. Davis; S. Degroeve; A. Dejardin; C. dePamphilis; J. Detter; B. Dirks; U. Dubchak; S. Duplessis; J. Ehlting; B. Ellis; K. Gendler; D. Goodstein; M. Gribskov; J. Grimwood; A. Groover; L. Gunter; B. Hamberger; B. Heinze; Y. Helariutta; B. Henrissat; D. Holligan; R. Holt; W. Huang; N. Islam-Faridi; S. Jones; M. Jones-Rhoades; R. Jorgensen; C. Joshi; J. Kangasjarvi; J. Karlsson; C. Kelleher; R. Kirkpatrick; M. Kirst; A. Kohler; U. Kalluri; F. Larimer; J. Leebens-Mack; J.-C. Leple; P. Locascio; Y. Lou; S. Lucas; F. Martin; B. Montanini; C. Napoli; D.R. Nelson; C. Nelson; K. Nieminen; O. Nilsson; V. Pereda; G. Peter; R. Philippe; G. Pilate; A. Poliakov; J. Razumovskaya; P. Richardson; C. Rinaldi; K. Ritland; P. Rouze; D. Ryaboy; J. Schumtz; J. Schrader; B. Segerman; H. Shin; A. Siddiqui; F. Sterky; A. Terry; C.-J. Tsai; E. Uberbacher; P. Unneberg; J. Vahala; K. Wall; S. Wessler; G. Yang; T. Yin; C. Douglas; M. Marra; G. Sandberg; Y. Van de Peer; D. Rokhsar

    2006-01-01

    We report the draft genome of the black cottonwood tree, Populus trichocarpa. Integration of shotgun sequence assembly with genetic mapping enabled chromosome-scale reconstruction of the genome. More than 45,000 putative protein-coding genes were identified. Analysis of the assembled genome revealed a whole-genome duplication event; about 8000 pairs...

  18. Newborn Sequencing in Genomic Medicine and Public Health

    PubMed Central

    Agrawal, Pankaj B.; Bailey, Donald B.; Beggs, Alan H.; Brenner, Steven E.; Brower, Amy M.; Cakici, Julie A.; Ceyhan-Birsoy, Ozge; Chan, Kee; Chen, Flavia; Currier, Robert J.; Dukhovny, Dmitry; Green, Robert C.; Harris-Wai, Julie; Holm, Ingrid A.; Iglesias, Brenda; Joseph, Galen; Kingsmore, Stephen F.; Koenig, Barbara A.; Kwok, Pui-Yan; Lantos, John; Leeder, Steven J.; Lewis, Megan A.; McGuire, Amy L.; Milko, Laura V.; Mooney, Sean D.; Parad, Richard B.; Pereira, Stacey; Petrikin, Joshua; Powell, Bradford C.; Powell, Cynthia M.; Puck, Jennifer M.; Rehm, Heidi L.; Risch, Neil; Roche, Myra; Shieh, Joseph T.; Veeraraghavan, Narayanan; Watson, Michael S.; Willig, Laurel; Yu, Timothy W.; Urv, Tiina; Wise, Anastasia L.

    2017-01-01

    The rapid development of genomic sequencing technologies has decreased the cost of genetic analysis to the extent that it seems plausible that genome-scale sequencing could have widespread availability in pediatric care. Genomic sequencing provides a powerful diagnostic modality for patients who manifest symptoms of monogenic disease and an opportunity to detect health conditions before their development. However, many technical, clinical, ethical, and societal challenges should be addressed before such technology is widely deployed in pediatric practice. This article provides an overview of the Newborn Sequencing in Genomic Medicine and Public Health Consortium, which is investigating the application of genome-scale sequencing in newborns for both diagnosis and screening. PMID:28096516

  19. Tomato functional genomics database (TFGD): a comprehensive collection and analysis package for tomato functional genomics

    USDA-ARS?s Scientific Manuscript database

    Tomato Functional Genomics Database (TFGD; http://ted.bti.cornell.edu) provides a comprehensive systems biology resource to store, mine, analyze, visualize and integrate large-scale tomato functional genomics datasets. The database is expanded from the previously described Tomato Expression Database...

  20. GPU Accelerated Browser for Neuroimaging Genomics.

    PubMed

    Zigon, Bob; Li, Huang; Yao, Xiaohui; Fang, Shiaofen; Hasan, Mohammad Al; Yan, Jingwen; Moore, Jason H; Saykin, Andrew J; Shen, Li

    2018-04-25

    Neuroimaging genomics is an emerging field that provides exciting opportunities to understand the genetic basis of brain structure and function. The unprecedented scale and complexity of the imaging and genomics data, however, have presented critical computational bottlenecks. In this work we present our initial efforts towards building an interactive visual exploratory system for mining big data in neuroimaging genomics. A GPU accelerated browsing tool for neuroimaging genomics is created that implements the ANOVA algorithm for single nucleotide polymorphism (SNP) based analysis and the VEGAS algorithm for gene-based analysis, and executes them at interactive rates. The ANOVA algorithm is 110 times faster than the 4-core OpenMP version, while the VEGAS algorithm is 375 times faster than its 4-core OpenMP counter part. This approach lays a solid foundation for researchers to address the challenges of mining large-scale imaging genomics datasets via interactive visual exploration.

  1. BiGG Models: A platform for integrating, standardizing and sharing genome-scale models

    PubMed Central

    King, Zachary A.; Lu, Justin; Dräger, Andreas; Miller, Philip; Federowicz, Stephen; Lerman, Joshua A.; Ebrahim, Ali; Palsson, Bernhard O.; Lewis, Nathan E.

    2016-01-01

    Genome-scale metabolic models are mathematically-structured knowledge bases that can be used to predict metabolic pathway usage and growth phenotypes. Furthermore, they can generate and test hypotheses when integrated with experimental data. To maximize the value of these models, centralized repositories of high-quality models must be established, models must adhere to established standards and model components must be linked to relevant databases. Tools for model visualization further enhance their utility. To meet these needs, we present BiGG Models (http://bigg.ucsd.edu), a completely redesigned Biochemical, Genetic and Genomic knowledge base. BiGG Models contains more than 75 high-quality, manually-curated genome-scale metabolic models. On the website, users can browse, search and visualize models. BiGG Models connects genome-scale models to genome annotations and external databases. Reaction and metabolite identifiers have been standardized across models to conform to community standards and enable rapid comparison across models. Furthermore, BiGG Models provides a comprehensive application programming interface for accessing BiGG Models with modeling and analysis tools. As a resource for highly curated, standardized and accessible models of metabolism, BiGG Models will facilitate diverse systems biology studies and support knowledge-based analysis of diverse experimental data. PMID:26476456

  2. Pattern Analysis and Decision Support for Cancer through Clinico-Genomic Profiles

    NASA Astrophysics Data System (ADS)

    Exarchos, Themis P.; Giannakeas, Nikolaos; Goletsis, Yorgos; Papaloukas, Costas; Fotiadis, Dimitrios I.

    Advances in genome technology are playing a growing role in medicine and healthcare. With the development of new technologies and opportunities for large-scale analysis of the genome, genomic data have a clear impact on medicine. Cancer prognostics and therapeutics are among the first major test cases for genomic medicine, given that all types of cancer are related with genomic instability. In this paper we present a novel system for pattern analysis and decision support in cancer. The system integrates clinical data from electronic health records and genomic data. Pattern analysis and data mining methods are applied to these integrated data and the discovered knowledge is used for cancer decision support. Through this integration, conclusions can be drawn for early diagnosis, staging and cancer treatment.

  3. The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows

    PubMed Central

    O'Connor, Brian D.; Yuen, Denis; Chung, Vincent; Duncan, Andrew G.; Liu, Xiang Kun; Patricia, Janice; Paten, Benedict; Stein, Lincoln; Ferretti, Vincent

    2017-01-01

    As genomic datasets continue to grow, the feasibility of downloading data to a local organization and running analysis on a traditional compute environment is becoming increasingly problematic. Current large-scale projects, such as the ICGC PanCancer Analysis of Whole Genomes (PCAWG), the Data Platform for the U.S. Precision Medicine Initiative, and the NIH Big Data to Knowledge Center for Translational Genomics, are using cloud-based infrastructure to both host and perform analysis across large data sets. In PCAWG, over 5,800 whole human genomes were aligned and variant called across 14 cloud and HPC environments; the processed data was then made available on the cloud for further analysis and sharing. If run locally, an operation at this scale would have monopolized a typical academic data centre for many months, and would have presented major challenges for data storage and distribution. However, this scale is increasingly typical for genomics projects and necessitates a rethink of how analytical tools are packaged and moved to the data. For PCAWG, we embraced the use of highly portable Docker images for encapsulating and sharing complex alignment and variant calling workflows across highly variable environments. While successful, this endeavor revealed a limitation in Docker containers, namely the lack of a standardized way to describe and execute the tools encapsulated inside the container. As a result, we created the Dockstore ( https://dockstore.org), a project that brings together Docker images with standardized, machine-readable ways of describing and running the tools contained within. This service greatly improves the sharing and reuse of genomics tools and promotes interoperability with similar projects through emerging web service standards developed by the Global Alliance for Genomics and Health (GA4GH). PMID:28344774

  4. The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows.

    PubMed

    O'Connor, Brian D; Yuen, Denis; Chung, Vincent; Duncan, Andrew G; Liu, Xiang Kun; Patricia, Janice; Paten, Benedict; Stein, Lincoln; Ferretti, Vincent

    2017-01-01

    As genomic datasets continue to grow, the feasibility of downloading data to a local organization and running analysis on a traditional compute environment is becoming increasingly problematic. Current large-scale projects, such as the ICGC PanCancer Analysis of Whole Genomes (PCAWG), the Data Platform for the U.S. Precision Medicine Initiative, and the NIH Big Data to Knowledge Center for Translational Genomics, are using cloud-based infrastructure to both host and perform analysis across large data sets. In PCAWG, over 5,800 whole human genomes were aligned and variant called across 14 cloud and HPC environments; the processed data was then made available on the cloud for further analysis and sharing. If run locally, an operation at this scale would have monopolized a typical academic data centre for many months, and would have presented major challenges for data storage and distribution. However, this scale is increasingly typical for genomics projects and necessitates a rethink of how analytical tools are packaged and moved to the data. For PCAWG, we embraced the use of highly portable Docker images for encapsulating and sharing complex alignment and variant calling workflows across highly variable environments. While successful, this endeavor revealed a limitation in Docker containers, namely the lack of a standardized way to describe and execute the tools encapsulated inside the container. As a result, we created the Dockstore ( https://dockstore.org), a project that brings together Docker images with standardized, machine-readable ways of describing and running the tools contained within. This service greatly improves the sharing and reuse of genomics tools and promotes interoperability with similar projects through emerging web service standards developed by the Global Alliance for Genomics and Health (GA4GH).

  5. Extreme-Scale De Novo Genome Assembly

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Georganas, Evangelos; Hofmeyr, Steven; Egan, Rob

    De novo whole genome assembly reconstructs genomic sequence from short, overlapping, and potentially erroneous DNA segments and is one of the most important computations in modern genomics. This work presents HipMER, a high-quality end-to-end de novo assembler designed for extreme scale analysis, via efficient parallelization of the Meraculous code. Genome assembly software has many components, each of which stresses different components of a computer system. This chapter explains the computational challenges involved in each step of the HipMer pipeline, the key distributed data structures, and communication costs in detail. We present performance results of assembling the human genome and themore » large hexaploid wheat genome on large supercomputers up to tens of thousands of cores.« less

  6. Newborn Sequencing in Genomic Medicine and Public Health.

    PubMed

    Berg, Jonathan S; Agrawal, Pankaj B; Bailey, Donald B; Beggs, Alan H; Brenner, Steven E; Brower, Amy M; Cakici, Julie A; Ceyhan-Birsoy, Ozge; Chan, Kee; Chen, Flavia; Currier, Robert J; Dukhovny, Dmitry; Green, Robert C; Harris-Wai, Julie; Holm, Ingrid A; Iglesias, Brenda; Joseph, Galen; Kingsmore, Stephen F; Koenig, Barbara A; Kwok, Pui-Yan; Lantos, John; Leeder, Steven J; Lewis, Megan A; McGuire, Amy L; Milko, Laura V; Mooney, Sean D; Parad, Richard B; Pereira, Stacey; Petrikin, Joshua; Powell, Bradford C; Powell, Cynthia M; Puck, Jennifer M; Rehm, Heidi L; Risch, Neil; Roche, Myra; Shieh, Joseph T; Veeraraghavan, Narayanan; Watson, Michael S; Willig, Laurel; Yu, Timothy W; Urv, Tiina; Wise, Anastasia L

    2017-02-01

    The rapid development of genomic sequencing technologies has decreased the cost of genetic analysis to the extent that it seems plausible that genome-scale sequencing could have widespread availability in pediatric care. Genomic sequencing provides a powerful diagnostic modality for patients who manifest symptoms of monogenic disease and an opportunity to detect health conditions before their development. However, many technical, clinical, ethical, and societal challenges should be addressed before such technology is widely deployed in pediatric practice. This article provides an overview of the Newborn Sequencing in Genomic Medicine and Public Health Consortium, which is investigating the application of genome-scale sequencing in newborns for both diagnosis and screening. Copyright © 2017 by the American Academy of Pediatrics.

  7. Scalable and cost-effective NGS genotyping in the cloud.

    PubMed

    Souilmi, Yassine; Lancaster, Alex K; Jung, Jae-Yoon; Rizzo, Ettore; Hawkins, Jared B; Powles, Ryan; Amzazi, Saaïd; Ghazal, Hassan; Tonellato, Peter J; Wall, Dennis P

    2015-10-15

    While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data can be accurately rendered to medically actionable reports within a time window of hours and at scales of economy in the 10's of dollars. We take a step towards addressing this challenge, by using COSMOS, a cloud-enabled workflow management system, to develop GenomeKey, an NGS whole genome analysis workflow. COSMOS implements complex workflows making optimal use of high-performance compute clusters. Here we show that the Amazon Web Service (AWS) implementation of GenomeKey via COSMOS provides a fast, scalable, and cost-effective analysis of both public benchmarking and large-scale heterogeneous clinical NGS datasets. Our systematic benchmarking reveals important new insights and considerations to produce clinical turn-around of whole genome analysis optimization and workflow management including strategic batching of individual genomes and efficient cluster resource configuration.

  8. Large-Scale Genomic Analysis of Codon Usage in Dengue Virus and Evaluation of Its Phylogenetic Dependence

    PubMed Central

    Lara-Ramírez, Edgar E.; Salazar, Ma Isabel; López-López, María de Jesús; Salas-Benito, Juan Santiago; Sánchez-Varela, Alejandro

    2014-01-01

    The increasing number of dengue virus (DENV) genome sequences available allows identifying the contributing factors to DENV evolution. In the present study, the codon usage in serotypes 1–4 (DENV1–4) has been explored for 3047 sequenced genomes using different statistics methods. The correlation analysis of total GC content (GC) with GC content at the three nucleotide positions of codons (GC1, GC2, and GC3) as well as the effective number of codons (ENC, ENCp) versus GC3 plots revealed mutational bias and purifying selection pressures as the major forces influencing the codon usage, but with distinct pressure on specific nucleotide position in the codon. The correspondence analysis (CA) and clustering analysis on relative synonymous codon usage (RSCU) within each serotype showed similar clustering patterns to the phylogenetic analysis of nucleotide sequences for DENV1–4. These clustering patterns are strongly related to the virus geographic origin. The phylogenetic dependence analysis also suggests that stabilizing selection acts on the codon usage bias. Our analysis of a large scale reveals new feature on DENV genomic evolution. PMID:25136631

  9. Genomic structural differences between cattle and river buffalo identified through a combination and genomic and transcriptomic analysis

    USDA-ARS?s Scientific Manuscript database

    Water buffalo (Bubalus bubalis L.) is an important livestock species worldwide. Like many other livestock species, water buffalo lacks high quality and continuous reference genome assembly required for fine-scale comparative genomics studies. In this work, we present a dataset, which characterizes g...

  10. Elucidating the triplicated ancestral genome structure of radish based on chromosome-level comparison with the Brassica genomes.

    PubMed

    Jeong, Young-Min; Kim, Namshin; Ahn, Byung Ohg; Oh, Mijin; Chung, Won-Hyong; Chung, Hee; Jeong, Seongmun; Lim, Ki-Byung; Hwang, Yoon-Jung; Kim, Goon-Bo; Baek, Seunghoon; Choi, Sang-Bong; Hyung, Dae-Jin; Lee, Seung-Won; Sohn, Seong-Han; Kwon, Soo-Jin; Jin, Mina; Seol, Young-Joo; Chae, Won Byoung; Choi, Keun Jin; Park, Beom-Seok; Yu, Hee-Ju; Mun, Jeong-Hwan

    2016-07-01

    This study presents a chromosome-scale draft genome sequence of radish that is assembled into nine chromosomal pseudomolecules. A comprehensive comparative genome analysis with the Brassica genomes provides genomic evidences on the evolution of the mesohexaploid radish genome. Radish (Raphanus sativus L.) is an agronomically important root vegetable crop and its origin and phylogenetic position in the tribe Brassiceae is controversial. Here we present a comprehensive analysis of the radish genome based on the chromosome sequences of R. sativus cv. WK10039. The radish genome was sequenced and assembled into 426.2 Mb spanning >98 % of the gene space, of which 344.0 Mb were integrated into nine chromosome pseudomolecules. Approximately 36 % of the genome was repetitive sequences and 46,514 protein-coding genes were predicted and annotated. Comparative mapping of the tPCK-like ancestral genome revealed that the radish genome has intermediate characteristics between the Brassica A/C and B genomes in the triplicated segments, suggesting an internal origin from the genus Brassica. The evolutionary characteristics shared between radish and other Brassica species provided genomic evidences that the current form of nine chromosomes in radish was rearranged from the chromosomes of hexaploid progenitor. Overall, this study provides a chromosome-scale draft genome sequence of radish as well as novel insight into evolution of the mesohexaploid genomes in the tribe Brassiceae.

  11. Genomic insights into the Acidobacteria reveal strategies for their success in terrestrial environments

    PubMed Central

    Trojan, Daniela; Roux, Simon; Herbold, Craig; Rattei, Thomas; Woebken, Dagmar

    2018-01-01

    Summary Members of the phylum Acidobacteria are abundant and ubiquitous across soils. We performed a large‐scale comparative genome analysis spanning subdivisions 1, 3, 4, 6, 8 and 23 (n = 24) with the goal to identify features to help explain their prevalence in soils and understand their ecophysiology. Our analysis revealed that bacteriophage integration events along with transposable and mobile elements influenced the structure and plasticity of these genomes. Low‐ and high‐affinity respiratory oxygen reductases were detected in multiple genomes, suggesting the capacity for growing across different oxygen gradients. Among many genomes, the capacity to use a diverse collection of carbohydrates, as well as inorganic and organic nitrogen sources (such as via extracellular peptidases), was detected – both advantageous traits in environments with fluctuating nutrient environments. We also identified multiple soil acidobacteria with the potential to scavenge atmospheric concentrations of H2, now encompassing mesophilic soil strains within the subdivision 1 and 3, in addition to a previously identified thermophilic strain in subdivision 4. This large‐scale acidobacteria genome analysis reveal traits that provide genomic, physiological and metabolic versatility, presumably allowing flexibility and versatility in the challenging and fluctuating soil environment. PMID:29327410

  12. Efficient population-scale variant analysis and prioritization with VAPr.

    PubMed

    Birmingham, Amanda; Mark, Adam M; Mazzaferro, Carlo; Xu, Guorong; Fisch, Kathleen M

    2018-04-06

    With the growing availability of population-scale whole-exome and whole-genome sequencing, demand for reproducible, scalable variant analysis has spread within genomic research communities. To address this need, we introduce the Python package VAPr (Variant Analysis and Prioritization). VAPr leverages existing annotation tools ANNOVAR and MyVariant.info with MongoDB-based flexible storage and filtering functionality. It offers biologists and bioinformatics generalists easy-to-use and scalable analysis and prioritization of genomic variants from large cohort studies. VAPr is developed in Python and is available for free use and extension under the MIT License. An install package is available on PyPi at https://pypi.python.org/pypi/VAPr, while source code and extensive documentation are on GitHub at https://github.com/ucsd-ccbb/VAPr. kfisch@ucsd.edu.

  13. Quantitative phenotyping via deep barcode sequencing.

    PubMed

    Smith, Andrew M; Heisler, Lawrence E; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J; Chee, Mark; Roth, Frederick P; Giaever, Guri; Nislow, Corey

    2009-10-01

    Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or "Bar-seq," outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that approximately 20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene-environment interactions on a genome-wide scale.

  14. Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline.

    PubMed

    Reid, Jeffrey G; Carroll, Andrew; Veeraraghavan, Narayanan; Dahdouli, Mahmoud; Sundquist, Andreas; English, Adam; Bainbridge, Matthew; White, Simon; Salerno, William; Buhay, Christian; Yu, Fuli; Muzny, Donna; Daly, Richard; Duyk, Geoff; Gibbs, Richard A; Boerwinkle, Eric

    2014-01-29

    Massively parallel DNA sequencing generates staggering amounts of data. Decreasing cost, increasing throughput, and improved annotation have expanded the diversity of genomics applications in research and clinical practice. This expanding scale creates analytical challenges: accommodating peak compute demand, coordinating secure access for multiple analysts, and sharing validated tools and results. To address these challenges, we have developed the Mercury analysis pipeline and deployed it in local hardware and the Amazon Web Services cloud via the DNAnexus platform. Mercury is an automated, flexible, and extensible analysis workflow that provides accurate and reproducible genomic results at scales ranging from individuals to large cohorts. By taking advantage of cloud computing and with Mercury implemented on the DNAnexus platform, we have demonstrated a powerful combination of a robust and fully validated software pipeline and a scalable computational resource that, to date, we have applied to more than 10,000 whole genome and whole exome samples.

  15. Genome scale engineering techniques for metabolic engineering.

    PubMed

    Liu, Rongming; Bassalo, Marcelo C; Zeitoun, Ramsey I; Gill, Ryan T

    2015-11-01

    Metabolic engineering has expanded from a focus on designs requiring a small number of genetic modifications to increasingly complex designs driven by advances in genome-scale engineering technologies. Metabolic engineering has been generally defined by the use of iterative cycles of rational genome modifications, strain analysis and characterization, and a synthesis step that fuels additional hypothesis generation. This cycle mirrors the Design-Build-Test-Learn cycle followed throughout various engineering fields that has recently become a defining aspect of synthetic biology. This review will attempt to summarize recent genome-scale design, build, test, and learn technologies and relate their use to a range of metabolic engineering applications. Copyright © 2015 International Metabolic Engineering Society. Published by Elsevier Inc. All rights reserved.

  16. SWARM : a scientific workflow for supporting Bayesian approaches to improve metabolic models.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shi, X.; Stevens, R.; Mathematics and Computer Science

    2008-01-01

    With the exponential growth of complete genome sequences, the analysis of these sequences is becoming a powerful approach to build genome-scale metabolic models. These models can be used to study individual molecular components and their relationships, and eventually study cells as systems. However, constructing genome-scale metabolic models manually is time-consuming and labor-intensive. This property of manual model-building process causes the fact that much fewer genome-scale metabolic models are available comparing to hundreds of genome sequences available. To tackle this problem, we design SWARM, a scientific workflow that can be utilized to improve genome-scale metabolic models in high-throughput fashion. SWARM dealsmore » with a range of issues including the integration of data across distributed resources, data format conversions, data update, and data provenance. Putting altogether, SWARM streamlines the whole modeling process that includes extracting data from various resources, deriving training datasets to train a set of predictors and applying Bayesian techniques to assemble the predictors, inferring on the ensemble of predictors to insert missing data, and eventually improving draft metabolic networks automatically. By the enhancement of metabolic model construction, SWARM enables scientists to generate many genome-scale metabolic models within a short period of time and with less effort.« less

  17. Genome-wide analysis of tandem repeats in plants and green algae

    Treesearch

    Zhixin Zhao; Cheng Guo; Sreeskandarajan Sutharzan; Pei Li; Craig Echt; Jie Zhang; Chun Liang

    2014-01-01

    Tandem repeats (TRs) extensively exist in the genomes of prokaryotes and eukaryotes. Based on the sequenced genomes and gene annotations of 31 plant and algal species in Phytozome version 8.0 (http://www.phytozome.net/), we examined TRs in a genome-wide scale, characterized their distributions and motif features, and explored their putative biological functions. Among...

  18. A high-quality carrot genome assembly provides new insights into carotenoid accumulation and asterid genome evolution

    USDA-ARS?s Scientific Manuscript database

    We report a chromosome-scale assembly and analysis of the Daucus carota genome, an important source of provitamin A in the human diet and the first sequenced genome among members of the Euasterid II clade. We characterized two new polyploidization events, both occurring after the divergence of carro...

  19. Comparative sequence analysis of Sordaria macrospora and Neurospora crassa as a means to improve genome annotation.

    PubMed

    Nowrousian, Minou; Würtz, Christian; Pöggeler, Stefanie; Kück, Ulrich

    2004-03-01

    One of the most challenging parts of large scale sequencing projects is the identification of functional elements encoded in a genome. Recently, studies of genomes of up to six different Saccharomyces species have demonstrated that a comparative analysis of genome sequences from closely related species is a powerful approach to identify open reading frames and other functional regions within genomes [Science 301 (2003) 71, Nature 423 (2003) 241]. Here, we present a comparison of selected sequences from Sordaria macrospora to their corresponding Neurospora crassa orthologous regions. Our analysis indicates that due to the high degree of sequence similarity and conservation of overall genomic organization, S. macrospora sequence information can be used to simplify the annotation of the N. crassa genome.

  20. Fungal Genomics for Energy and Environment

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Grigoriev, Igor V.

    2013-03-11

    Genomes of fungi relevant to energy and environment are in focus of the Fungal Genomic Program at the US Department of Energy Joint Genome Institute (JGI). One of its projects, the Genomics Encyclopedia of Fungi, targets fungi related to plant health (symbionts, pathogens, and biocontrol agents) and biorefinery processes (cellulose degradation, sugar fermentation, industrial hosts) by means of genome sequencing and analysis. New chapters of the Encyclopedia can be opened with user proposals to the JGI Community Sequencing Program (CSP). Another JGI project, the 1000 fungal genomes, explores fungal diversity on genome level at scale and is open for usersmore » to nominate new species for sequencing. Over 200 fungal genomes have been sequenced by JGI to date and released through MycoCosm (www.jgi.doe.gov/fungi), a fungal web-portal, which integrates sequence and functional data with genome analysis tools for user community. Sequence analysis supported by functional genomics leads to developing parts list for complex systems ranging from ecosystems of biofuel crops to biorefineries. Recent examples of such parts suggested by comparative genomics and functional analysis in these areas are presented here.« less

  1. Toward the automated generation of genome-scale metabolic networks in the SEED.

    PubMed

    DeJongh, Matthew; Formsma, Kevin; Boillot, Paul; Gould, John; Rycenga, Matthew; Best, Aaron

    2007-04-26

    Current methods for the automated generation of genome-scale metabolic networks focus on genome annotation and preliminary biochemical reaction network assembly, but do not adequately address the process of identifying and filling gaps in the reaction network, and verifying that the network is suitable for systems level analysis. Thus, current methods are only sufficient for generating draft-quality networks, and refinement of the reaction network is still largely a manual, labor-intensive process. We have developed a method for generating genome-scale metabolic networks that produces substantially complete reaction networks, suitable for systems level analysis. Our method partitions the reaction space of central and intermediary metabolism into discrete, interconnected components that can be assembled and verified in isolation from each other, and then integrated and verified at the level of their interconnectivity. We have developed a database of components that are common across organisms, and have created tools for automatically assembling appropriate components for a particular organism based on the metabolic pathways encoded in the organism's genome. This focuses manual efforts on that portion of an organism's metabolism that is not yet represented in the database. We have demonstrated the efficacy of our method by reverse-engineering and automatically regenerating the reaction network from a published genome-scale metabolic model for Staphylococcus aureus. Additionally, we have verified that our method capitalizes on the database of common reaction network components created for S. aureus, by using these components to generate substantially complete reconstructions of the reaction networks from three other published metabolic models (Escherichia coli, Helicobacter pylori, and Lactococcus lactis). We have implemented our tools and database within the SEED, an open-source software environment for comparative genome annotation and analysis. Our method sets the stage for the automated generation of substantially complete metabolic networks for over 400 complete genome sequences currently in the SEED. With each genome that is processed using our tools, the database of common components grows to cover more of the diversity of metabolic pathways. This increases the likelihood that components of reaction networks for subsequently processed genomes can be retrieved from the database, rather than assembled and verified manually.

  2. Initial sequencing and comparative analysis of the mouse genome

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Waterston, Robert H.; Lindblad-Toh, Kerstin; Birney, Ewan

    2002-12-15

    The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of themore » genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.« less

  3. TARGET Publication Guidelines | Office of Cancer Genomics

    Cancer.gov

    Like other NCI large-scale genomics initiatives, TARGET is a community resource project and data are made available rapidly after validation for use by other researchers. To act in accord with the Fort Lauderdale principles and support the continued prompt public release of large-scale genomic data prior to publication, researchers who plan to prepare manuscripts containing descriptions of TARGET pediatric cancer data that would be of comparable scope to an initial TARGET disease-specific comprehensive, global analysis publication, and journal editors who receive such manuscripts, are

  4. Comparative analysis and visualization of multiple collinear genomes

    PubMed Central

    2012-01-01

    Background Genome browsers are a common tool used by biologists to visualize genomic features including genes, polymorphisms, and many others. However, existing genome browsers and visualization tools are not well-suited to perform meaningful comparative analysis among a large number of genomes. With the increasing quantity and availability of genomic data, there is an increased burden to provide useful visualization and analysis tools for comparison of multiple collinear genomes such as the large panels of model organisms which are the basis for much of the current genetic research. Results We have developed a novel web-based tool for visualizing and analyzing multiple collinear genomes. Our tool illustrates genome-sequence similarity through a mosaic of intervals representing local phylogeny, subspecific origin, and haplotype identity. Comparative analysis is facilitated through reordering and clustering of tracks, which can vary throughout the genome. In addition, we provide local phylogenetic trees as an alternate visualization to assess local variations. Conclusions Unlike previous genome browsers and viewers, ours allows for simultaneous and comparative analysis. Our browser provides intuitive selection and interactive navigation about features of interest. Dynamic visualizations adjust to scale and data content making analysis at variable resolutions and of multiple data sets more informative. We demonstrate our genome browser for an extensive set of genomic data sets composed of almost 200 distinct mouse laboratory strains. PMID:22536897

  5. Fueling the Future with Fungal Genomes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Grigoriev, Igor V.

    2014-10-27

    Genomes of fungi relevant to energy and environment are in focus of the JGI Fungal Genomic Program. One of its projects, the Genomics Encyclopedia of Fungi, targets fungi related to plant health (symbionts and pathogens) and biorefinery processes (cellulose degradation and sugar fermentation) by means of genome sequencing and analysis. New chapters of the Encyclopedia can be opened with user proposals to the JGI Community Science Program (CSP). Another JGI project, the 1000 fungal genomes, explores fungal diversity on genome level at scale and is open for users to nominate new species for sequencing. Over 400 fungal genomes have beenmore » sequenced by JGI to date and released through MycoCosm (www.jgi.doe.gov/fungi), a fungal web-portal, which integrates sequence and functional data with genome analysis tools for user community. Sequence analysis supported by functional genomics will lead to developing parts list for complex systems ranging from ecosystems of biofuel crops to biorefineries. Recent examples of such ‘parts’ suggested by comparative genomics and functional analysis in these areas are presented here.« less

  6. Genome-wide computational prediction and analysis of core promoter elements across plant monocots and dicots

    USDA-ARS?s Scientific Manuscript database

    Transcription initiation, essential to gene expression regulation, involves recruitment of basal transcription factors to the core promoter elements (CPEs). The distribution of currently known CPEs across plant genomes is largely unknown. This is the first large scale genome-wide report on the compu...

  7. FULL-GENOME ANALYSIS OF ALTERNATIVE SPLICING IN MOUSE LIVER AFTER HEPATOTOXICANT EXPOSURE

    EPA Science Inventory

    Alternative splicing plays a role in determining gene function and protein diversity. We have employed whole genome exon profiling using Affymetrix Mouse Exon 1.0 ST arrays to understand the significance of alternative splicing on a genome-wide scale in response to multiple toxic...

  8. BIOMONITORING THE TOXICOGENOMIC RESPONSE TO ENDOCRINE DISRUPTING CHEMICALS IN HUMANS, LABORATORY SPECIES AND WILDLIFE

    EPA Science Inventory

    With the advent of sequence information for entire eukaryotic genomes, it is now possible to analyze gene expression on a genomic scale. The primary tool for genomic analysis of gene expression is the gene microarray. We have used commercially available and custom cDNA microarray...

  9. BiGG Models: A platform for integrating, standardizing and sharing genome-scale models

    DOE PAGES

    King, Zachary A.; Lu, Justin; Drager, Andreas; ...

    2015-10-17

    In this study, genome-scale metabolic models are mathematically structured knowledge bases that can be used to predict metabolic pathway usage and growth phenotypes. Furthermore, they can generate and test hypotheses when integrated with experimental data. To maximize the value of these models, centralized repositories of high-quality models must be established, models must adhere to established standards and model components must be linked to relevant databases. Tools for model visualization further enhance their utility. To meet these needs, we present BiGG Models (http://bigg.ucsd.edu), a completely redesigned Biochemical, Genetic and Genomic knowledge base. BiGG Models contains more than 75 high-quality, manually-curated genome-scalemore » metabolic models. On the website, users can browse, search and visualize models. BiGG Models connects genome-scale models to genome annotations and external databases. Reaction and metabolite identifiers have been standardized across models to conform to community standards and enable rapid comparison across models. Furthermore, BiGG Models provides a comprehensive application programming interface for accessing BiGG Models with modeling and analysis tools. As a resource for highly curated, standardized and accessible models of metabolism, BiGG Models will facilitate diverse systems biology studies and support knowledge-based analysis of diverse experimental data.« less

  10. ScreenBEAM: a novel meta-analysis algorithm for functional genomics screens via Bayesian hierarchical modeling | Office of Cancer Genomics

    Cancer.gov

    Functional genomics (FG) screens, using RNAi or CRISPR technology, have become a standard tool for systematic, genome-wide loss-of-function studies for therapeutic target discovery. As in many large-scale assays, however, off-target effects, variable reagents' potency and experimental noise must be accounted for appropriately control for false positives.

  11. Investigating host-pathogen behavior and their interaction using genome-scale metabolic network models.

    PubMed

    Sadhukhan, Priyanka P; Raghunathan, Anu

    2014-01-01

    Genome Scale Metabolic Modeling methods represent one way to compute whole cell function starting from the genome sequence of an organism and contribute towards understanding and predicting the genotype-phenotype relationship. About 80 models spanning all the kingdoms of life from archaea to eukaryotes have been built till date and used to interrogate cell phenotype under varying conditions. These models have been used to not only understand the flux distribution in evolutionary conserved pathways like glycolysis and the Krebs cycle but also in applications ranging from value added product formation in Escherichia coli to predicting inborn errors of Homo sapiens metabolism. This chapter describes a protocol that delineates the process of genome scale metabolic modeling for analysing host-pathogen behavior and interaction using flux balance analysis (FBA). The steps discussed in the process include (1) reconstruction of a metabolic network from the genome sequence, (2) its representation in a precise mathematical framework, (3) its translation to a model, and (4) the analysis using linear algebra and optimization. The methods for biological interpretations of computed cell phenotypes in the context of individual host and pathogen models and their integration are also discussed.

  12. Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate evolution.

    PubMed

    Gu, Xun; Wang, Yufeng; Gu, Jianying

    2002-06-01

    The classical (two-round) hypothesis of vertebrate genome duplication proposes two successive whole-genome duplication(s) (polyploidizations) predating the origin of fishes, a view now being seriously challenged. As the debate largely concerns the relative merits of the 'big-bang mode' theory (large-scale duplication) and the 'continuous mode' theory (constant creation by small-scale duplications), we tested whether a significant proportion of paralogous genes in the contemporary human genome was indeed generated in the early stage of vertebrate evolution. After an extensive search of major databases, we dated 1,739 gene duplication events from the phylogenetic analysis of 749 vertebrate gene families. We found a pattern characterized by two waves (I, II) and an ancient component. Wave I represents a recent gene family expansion by tandem or segmental duplications, whereas wave II, a rapid paralogous gene increase in the early stage of vertebrate evolution, supports the idea of genome duplication(s) (the big-bang mode). Further analysis indicated that large- and small-scale gene duplications both make a significant contribution during the early stage of vertebrate evolution to build the current hierarchy of the human proteome.

  13. Genome-scale metabolic modeling of Mucor circinelloides and comparative analysis with other oleaginous species.

    PubMed

    Vongsangnak, Wanwipa; Klanchui, Amornpan; Tawornsamretkit, Iyarest; Tatiyaborwornchai, Witthawin; Laoteng, Kobkul; Meechai, Asawin

    2016-06-01

    We present a novel genome-scale metabolic model iWV1213 of Mucor circinelloides, which is an oleaginous fungus for industrial applications. The model contains 1213 genes, 1413 metabolites and 1326 metabolic reactions across different compartments. We demonstrate that iWV1213 is able to accurately predict the growth rates of M. circinelloides on various nutrient sources and culture conditions using Flux Balance Analysis and Phenotypic Phase Plane analysis. Comparative analysis of three oleaginous genome-scale models, including M. circinelloides (iWV1213), Mortierella alpina (iCY1106) and Yarrowia lipolytica (iYL619_PCP) revealed that iWV1213 possesses a higher number of genes involved in carbohydrate, amino acid, and lipid metabolisms that might contribute to its versatility in nutrient utilization. Moreover, the identification of unique and common active reactions among the Zygomycetes oleaginous models using Flux Variability Analysis unveiled a set of gene/enzyme candidates as metabolic engineering targets for cellular improvement. Thus, iWV1213 offers a powerful metabolic engineering tool for multi-level omics analysis, enabling strain optimization as a cell factory platform of lipid-based production. Copyright © 2016 Elsevier B.V. All rights reserved.

  14. Genome wide analysis of flowering time trait in multiple environments via high-throughput genotyping technique in Brassica napus L.

    PubMed

    Li, Lun; Long, Yan; Zhang, Libin; Dalton-Morgan, Jessica; Batley, Jacqueline; Yu, Longjiang; Meng, Jinling; Li, Maoteng

    2015-01-01

    The prediction of the flowering time (FT) trait in Brassica napus based on genome-wide markers and the detection of underlying genetic factors is important not only for oilseed producers around the world but also for the other crop industry in the rotation system in China. In previous studies the low density and mixture of biomarkers used obstructed genomic selection in B. napus and comprehensive mapping of FT related loci. In this study, a high-density genome-wide SNP set was genotyped from a double-haploid population of B. napus. We first performed genomic prediction of FT traits in B. napus using SNPs across the genome under ten environments of three geographic regions via eight existing genomic predictive models. The results showed that all the models achieved comparably high accuracies, verifying the feasibility of genomic prediction in B. napus. Next, we performed a large-scale mapping of FT related loci among three regions, and found 437 associated SNPs, some of which represented known FT genes, such as AP1 and PHYE. The genes tagged by the associated SNPs were enriched in biological processes involved in the formation of flowers. Epistasis analysis showed that significant interactions were found between detected loci, even among some known FT related genes. All the results showed that our large scale and high-density genotype data are of great practical and scientific values for B. napus. To our best knowledge, this is the first evaluation of genomic selection models in B. napus based on a high-density SNP dataset and large-scale mapping of FT loci.

  15. Quantitative phenotyping via deep barcode sequencing

    PubMed Central

    Smith, Andrew M.; Heisler, Lawrence E.; Mellor, Joseph; Kaper, Fiona; Thompson, Michael J.; Chee, Mark; Roth, Frederick P.; Giaever, Guri; Nislow, Corey

    2009-01-01

    Next-generation DNA sequencing technologies have revolutionized diverse genomics applications, including de novo genome sequencing, SNP detection, chromatin immunoprecipitation, and transcriptome analysis. Here we apply deep sequencing to genome-scale fitness profiling to evaluate yeast strain collections in parallel. This method, Barcode analysis by Sequencing, or “Bar-seq,” outperforms the current benchmark barcode microarray assay in terms of both dynamic range and throughput. When applied to a complex chemogenomic assay, Bar-seq quantitatively identifies drug targets, with performance superior to the benchmark microarray assay. We also show that Bar-seq is well-suited for a multiplex format. We completely re-sequenced and re-annotated the yeast deletion collection using deep sequencing, found that ∼20% of the barcodes and common priming sequences varied from expectation, and used this revised list of barcode sequences to improve data quality. Together, this new assay and analysis routine provide a deep-sequencing-based toolkit for identifying gene–environment interactions on a genome-wide scale. PMID:19622793

  16. Launching genomics into the cloud: deployment of Mercury, a next generation sequence analysis pipeline

    PubMed Central

    2014-01-01

    Background Massively parallel DNA sequencing generates staggering amounts of data. Decreasing cost, increasing throughput, and improved annotation have expanded the diversity of genomics applications in research and clinical practice. This expanding scale creates analytical challenges: accommodating peak compute demand, coordinating secure access for multiple analysts, and sharing validated tools and results. Results To address these challenges, we have developed the Mercury analysis pipeline and deployed it in local hardware and the Amazon Web Services cloud via the DNAnexus platform. Mercury is an automated, flexible, and extensible analysis workflow that provides accurate and reproducible genomic results at scales ranging from individuals to large cohorts. Conclusions By taking advantage of cloud computing and with Mercury implemented on the DNAnexus platform, we have demonstrated a powerful combination of a robust and fully validated software pipeline and a scalable computational resource that, to date, we have applied to more than 10,000 whole genome and whole exome samples. PMID:24475911

  17. Global MLST of Salmonella Typhi Revisited in Post-genomic Era: Genetic Conservation, Population Structure, and Comparative Genomics of Rare Sequence Types.

    PubMed

    Yap, Kien-Pong; Ho, Wing S; Gan, Han M; Chai, Lay C; Thong, Kwai L

    2016-01-01

    Typhoid fever, caused by Salmonella enterica serovar Typhi, remains an important public health burden in Southeast Asia and other endemic countries. Various genotyping methods have been applied to study the genetic variations of this human-restricted pathogen. Multilocus sequence typing (MLST) is one of the widely accepted methods, and recently, there is a growing interest in the re-application of MLST in the post-genomic era. In this study, we provide the global MLST distribution of S. Typhi utilizing both publicly available 1,826 S. Typhi genome sequences in addition to performing conventional MLST on S. Typhi strains isolated from various endemic regions spanning over a century. Our global MLST analysis confirms the predominance of two sequence types (ST1 and ST2) co-existing in the endemic regions. Interestingly, S. Typhi strains with ST8 are currently confined within the African continent. Comparative genomic analyses of ST8 and other rare STs with genomes of ST1/ST2 revealed unique mutations in important virulence genes such as flhB, sipC, and tviD that may explain the variations that differentiate between seemingly successful (widespread) and unsuccessful (poor dissemination) S. Typhi populations. Large scale whole-genome phylogeny demonstrated evidence of phylogeographical structuring and showed that ST8 may have diverged from the earlier ancestral population of ST1 and ST2, which later lost some of its fitness advantages, leading to poor worldwide dissemination. In response to the unprecedented increase in genomic data, this study demonstrates and highlights the utility of large-scale genome-based MLST as a quick and effective approach to narrow the scope of in-depth comparative genomic analysis and consequently provide new insights into the fine scale of pathogen evolution and population structure.

  18. AlgaGEM – a genome-scale metabolic reconstruction of algae based on the Chlamydomonas reinhardtii genome

    PubMed Central

    2011-01-01

    Background Microalgae have the potential to deliver biofuels without the associated competition for land resources. In order to realise the rates and titres necessary for commercial production, however, system-level metabolic engineering will be required. Genome scale metabolic reconstructions have revolutionized microbial metabolic engineering and are used routinely for in silico analysis and design. While genome scale metabolic reconstructions have been developed for many prokaryotes and model eukaryotes, the application to less well characterized eukaryotes such as algae is challenging not at least due to a lack of compartmentalization data. Results We have developed a genome-scale metabolic network model (named AlgaGEM) covering the metabolism for a compartmentalized algae cell based on the Chlamydomonas reinhardtii genome. AlgaGEM is a comprehensive literature-based genome scale metabolic reconstruction that accounts for the functions of 866 unique ORFs, 1862 metabolites, 2249 gene-enzyme-reaction-association entries, and 1725 unique reactions. The reconstruction was compartmentalized into the cytoplasm, mitochondrion, plastid and microbody using available data for algae complemented with compartmentalisation data for Arabidopsis thaliana. AlgaGEM describes a functional primary metabolism of Chlamydomonas and significantly predicts distinct algal behaviours such as the catabolism or secretion rather than recycling of phosphoglycolate in photorespiration. AlgaGEM was validated through the simulation of growth and algae metabolic functions inferred from literature. Using efficient resource utilisation as the optimality criterion, AlgaGEM predicted observed metabolic effects under autotrophic, heterotrophic and mixotrophic conditions. AlgaGEM predicts increased hydrogen production when cyclic electron flow is disrupted as seen in a high producing mutant derived from mutational studies. The model also predicted the physiological pathway for H2 production and identified new targets to further improve H2 yield. Conclusions AlgaGEM is a viable and comprehensive framework for in silico functional analysis and can be used to derive new, non-trivial hypotheses for exploring this metabolically versatile organism. Flux balance analysis can be used to identify bottlenecks and new targets to metabolically engineer microalgae for production of biofuels. PMID:22369158

  19. mySyntenyPortal: an application package to construct websites for synteny block analysis.

    PubMed

    Lee, Jongin; Lee, Daehwan; Sim, Mikang; Kwon, Daehong; Kim, Juyeon; Ko, Younhee; Kim, Jaebum

    2018-06-05

    Advances in sequencing technologies have facilitated large-scale comparative genomics based on whole genome sequencing. Constructing and investigating conserved genomic regions among multiple species (called synteny blocks) are essential in the comparative genomics. However, they require significant amounts of computational resources and time in addition to bioinformatics skills. Many web interfaces have been developed to make such tasks easier. However, these web interfaces cannot be customized for users who want to use their own set of genome sequences or definition of synteny blocks. To resolve this limitation, we present mySyntenyPortal, a stand-alone application package to construct websites for synteny block analyses by using users' own genome data. mySyntenyPortal provides both command line and web-based interfaces to build and manage websites for large-scale comparative genomic analyses. The websites can be also easily published and accessed by other users. To demonstrate the usability of mySyntenyPortal, we present an example study for building websites to compare genomes of three mammalian species (human, mouse, and cow) and show how they can be easily utilized to identify potential genes affected by genome rearrangements. mySyntenyPortal will contribute for extended comparative genomic analyses based on large-scale whole genome sequences by providing unique functionality to support the easy creation of interactive websites for synteny block analyses from user's own genome data.

  20. Using the Saccharomyces Genome Database (SGD) for analysis of genomic information

    PubMed Central

    Skrzypek, Marek S.; Hirschman, Jodi

    2011-01-01

    Analysis of genomic data requires access to software tools that place the sequence-derived information in the context of biology. The Saccharomyces Genome Database (SGD) integrates functional information about budding yeast genes and their products with a set of analysis tools that facilitate exploring their biological details. This unit describes how the various types of functional data available at SGD can be searched, retrieved, and analyzed. Starting with the guided tour of the SGD Home page and Locus Summary page, this unit highlights how to retrieve data using YeastMine, how to visualize genomic information with GBrowse, how to explore gene expression patterns with SPELL, and how to use Gene Ontology tools to characterize large-scale datasets. PMID:21901739

  1. Genomic Diversity and Evolution of the Lyssaviruses

    PubMed Central

    Delmas, Olivier; Holmes, Edward C.; Talbi, Chiraz; Larrous, Florence; Dacheux, Laurent; Bouchier, Christiane; Bourhy, Hervé

    2008-01-01

    Lyssaviruses are RNA viruses with single-strand, negative-sense genomes responsible for rabies-like diseases in mammals. To date, genomic and evolutionary studies have most often utilized partial genome sequences, particularly of the nucleoprotein and glycoprotein genes, with little consideration of genome-scale evolution. Herein, we report the first genomic and evolutionary analysis using complete genome sequences of all recognised lyssavirus genotypes, including 14 new complete genomes of field isolates from 6 genotypes and one genotype that is completely sequenced for the first time. In doing so we significantly increase the extent of genome sequence data available for these important viruses. Our analysis of these genome sequence data reveals that all lyssaviruses have the same genomic organization. A phylogenetic analysis reveals strong geographical structuring, with the greatest genetic diversity in Africa, and an independent origin for the two known genotypes that infect European bats. We also suggest that multiple genotypes may exist within the diversity of viruses currently classified as ‘Lagos Bat’. In sum, we show that rigorous phylogenetic techniques based on full length genome sequence provide the best discriminatory power for genotype classification within the lyssaviruses. PMID:18446239

  2. Company profile: Complete Genomics Inc.

    PubMed

    Reid, Clifford

    2011-02-01

    Complete Genomics Inc. is a life sciences company that focuses on complete human genome sequencing. It is taking a completely different approach to DNA sequencing than other companies in the industry. Rather than building a general-purpose platform for sequencing all organisms and all applications, it has focused on a single application - complete human genome sequencing. The company's Complete Genomics Analysis Platform (CGA™ Platform) comprises an integrated package of biochemistry, instrumentation and software that sequences human genomes at the highest quality, lowest cost and largest scale available. Complete Genomics offers a turnkey service that enables customers to outsource their human genome sequencing to the company's genome sequencing center in Mountain View, CA, USA. Customers send in their DNA samples, the company does all the library preparation, DNA sequencing, assembly and variant analysis, and customers receive research-ready data that they can use for biological discovery.

  3. Genome-scale metabolic network of Cordyceps militaris useful for comparative analysis of entomopathogenic fungi.

    PubMed

    Vongsangnak, Wanwipa; Raethong, Nachon; Mujchariyakul, Warasinee; Nguyen, Nam Ninh; Leong, Hon Wai; Laoteng, Kobkul

    2017-08-30

    The first genome-scale metabolic network of Cordyceps militaris (iWV1170) was constructed representing its whole metabolisms, which consisted of 894 metabolites and 1,267 metabolic reactions across five compartments, including the plasma membrane, cytoplasm, mitochondria, peroxisome and extracellular space. The iWV1170 could be exploited to explain its phenotypes of growth ability, cordycepin and other metabolites production on various substrates. A high number of genes encoding extracellular enzymes for degradation of complex carbohydrates, lipids and proteins were existed in C. militaris genome. By comparative genome-scale analysis, the adenine metabolic pathway towards putative cordycepin biosynthesis was reconstructed, indicating their evolutionary relationships across eleven species of entomopathogenic fungi. The overall metabolic routes involved in the putative cordycepin biosynthesis were also identified in C. militaris, including central carbon metabolism, amino acid metabolism (glycine, l-glutamine and l-aspartate) and nucleotide metabolism (adenosine and adenine). Interestingly, a lack of the sequence coding for ribonucleotide reductase inhibitor was observed in C. militaris that might contribute to its over-production of cordycepin. Copyright © 2017. Published by Elsevier B.V.

  4. CPTAC researchers report first large-scale integrated proteomic and genomic analysis of a human cancer | Office of Cancer Clinical Proteomics Research

    Cancer.gov

    Investigators from the National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium (CPTAC) who comprehensively analyzed 95 human colorectal tumor samples, have determined how gene alterations identified in previous analyses of the same samples are expressed at the protein level. The integration of proteomic and genomic data, or proteogenomics, provides a more comprehensive view of the biological features that drive cancer than genomic analysis alone and may help identify the most important targets for cancer detection and intervention.

  5. Techniques for Large-Scale Bacterial Genome Manipulation and Characterization of the Mutants with Respect to In Silico Metabolic Reconstructions.

    PubMed

    diCenzo, George C; Finan, Turlough M

    2018-01-01

    The rate at which all genes within a bacterial genome can be identified far exceeds the ability to characterize these genes. To assist in associating genes with cellular functions, a large-scale bacterial genome deletion approach can be employed to rapidly screen tens to thousands of genes for desired phenotypes. Here, we provide a detailed protocol for the generation of deletions of large segments of bacterial genomes that relies on the activity of a site-specific recombinase. In this procedure, two recombinase recognition target sequences are introduced into known positions of a bacterial genome through single cross-over plasmid integration. Subsequent expression of the site-specific recombinase mediates recombination between the two target sequences, resulting in the excision of the intervening region and its loss from the genome. We further illustrate how this deletion system can be readily adapted to function as a large-scale in vivo cloning procedure, in which the region excised from the genome is captured as a replicative plasmid. We next provide a procedure for the metabolic analysis of bacterial large-scale genome deletion mutants using the Biolog Phenotype MicroArray™ system. Finally, a pipeline is described, and a sample Matlab script is provided, for the integration of the obtained data with a draft metabolic reconstruction for the refinement of the reactions and gene-protein-reaction relationships in a metabolic reconstruction.

  6. Cloud computing for genomic data analysis and collaboration.

    PubMed

    Langmead, Ben; Nellore, Abhinav

    2018-04-01

    Next-generation sequencing has made major strides in the past decade. Studies based on large sequencing data sets are growing in number, and public archives for raw sequencing data have been doubling in size every 18 months. Leveraging these data requires researchers to use large-scale computational resources. Cloud computing, a model whereby users rent computers and storage from large data centres, is a solution that is gaining traction in genomics research. Here, we describe how cloud computing is used in genomics for research and large-scale collaborations, and argue that its elasticity, reproducibility and privacy features make it ideally suited for the large-scale reanalysis of publicly available archived data, including privacy-protected data.

  7. Consistency Analysis of Genome-Scale Models of Bacterial Metabolism: A Metamodel Approach

    PubMed Central

    Ponce-de-Leon, Miguel; Calle-Espinosa, Jorge; Peretó, Juli; Montero, Francisco

    2015-01-01

    Genome-scale metabolic models usually contain inconsistencies that manifest as blocked reactions and gap metabolites. With the purpose to detect recurrent inconsistencies in metabolic models, a large-scale analysis was performed using a previously published dataset of 130 genome-scale models. The results showed that a large number of reactions (~22%) are blocked in all the models where they are present. To unravel the nature of such inconsistencies a metamodel was construed by joining the 130 models in a single network. This metamodel was manually curated using the unconnected modules approach, and then, it was used as a reference network to perform a gap-filling on each individual genome-scale model. Finally, a set of 36 models that had not been considered during the construction of the metamodel was used, as a proof of concept, to extend the metamodel with new biochemical information, and to assess its impact on gap-filling results. The analysis performed on the metamodel allowed to conclude: 1) the recurrent inconsistencies found in the models were already present in the metabolic database used during the reconstructions process; 2) the presence of inconsistencies in a metabolic database can be propagated to the reconstructed models; 3) there are reactions not manifested as blocked which are active as a consequence of some classes of artifacts, and; 4) the results of an automatic gap-filling are highly dependent on the consistency and completeness of the metamodel or metabolic database used as the reference network. In conclusion the consistency analysis should be applied to metabolic databases in order to detect and fill gaps as well as to detect and remove artifacts and redundant information. PMID:26629901

  8. A case study for cloud based high throughput analysis of NGS data using the globus genomics system

    DOE PAGES

    Bhuvaneshwar, Krithika; Sulakhe, Dinanath; Gauba, Robinder; ...

    2015-01-01

    Next generation sequencing (NGS) technologies produce massive amounts of data requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools. We present a case study of a practical solution to this data management and analysis challenge that simplifies terabyte scale data handling and provides advanced tools for NGS data analysis. These capabilities are implemented using the “Globus Genomics” system, which is an enhanced Galaxy workflow system made available as a service that offers users the capability to process and transfer data easily, reliably and quickly to address end-to-end NGS analysis requirements. The Globus Genomicsmore » system is built on Amazon's cloud computing infrastructure. The system takes advantage of elastic scaling of compute resources to run multiple workflows in parallel and it also helps meet the scale-out analysis needs of modern translational genomics research.« less

  9. A Chromosome-Scale Assembly of the Bactrocera cucurbitae Genome Provides Insight to the Genetic Basis of white pupae

    PubMed Central

    Sim, Sheina B.; Geib, Scott M.

    2017-01-01

    Genetic sexing strains (GSS) used in sterile insect technique (SIT) programs are textbook examples of how classical Mendelian genetics can be directly implemented in the management of agricultural insect pests. Although the foundation of traditionally developed GSS are single locus, autosomal recessive traits, their genetic basis are largely unknown. With the advent of modern genomic techniques, the genetic basis of sexing traits in GSS can now be further investigated. This study is the first of its kind to integrate traditional genetic techniques with emerging genomics to characterize a GSS using the tephritid fruit fly pest Bactrocera cucurbitae as a model. These techniques include whole-genome sequencing, the development of a mapping population and linkage map, and quantitative trait analysis. The experiment designed to map the genetic sexing trait in B. cucurbitae, white pupae (wp), also enabled the generation of a chromosome-scale genome assembly by integrating the linkage map with the assembly. Quantitative trait loci analysis revealed SNP loci near position 42 MB on chromosome 3 to be tightly linked to wp. Gene annotation and synteny analysis show a near perfect relationship between chromosomes in B. cucurbitae and Muller elements A–E in Drosophila melanogaster. This chromosome-scale genome assembly is complete, has high contiguity, was generated using a minimal input DNA, and will be used to further characterize the genetic mechanisms underlying wp. Knowledge of the genetic basis of genetic sexing traits can be used to improve SIT in this species and expand it to other economically important Diptera. PMID:28450369

  10. Large-scale chromosome folding versus genomic DNA sequences: A discrete double Fourier transform technique.

    PubMed

    Chechetkin, V R; Lobzin, V V

    2017-08-07

    Using state-of-the-art techniques combining imaging methods and high-throughput genomic mapping tools leaded to the significant progress in detailing chromosome architecture of various organisms. However, a gap still remains between the rapidly growing structural data on the chromosome folding and the large-scale genome organization. Could a part of information on the chromosome folding be obtained directly from underlying genomic DNA sequences abundantly stored in the databanks? To answer this question, we developed an original discrete double Fourier transform (DDFT). DDFT serves for the detection of large-scale genome regularities associated with domains/units at the different levels of hierarchical chromosome folding. The method is versatile and can be applied to both genomic DNA sequences and corresponding physico-chemical parameters such as base-pairing free energy. The latter characteristic is closely related to the replication and transcription and can also be used for the assessment of temperature or supercoiling effects on the chromosome folding. We tested the method on the genome of E. coli K-12 and found good correspondence with the annotated domains/units established experimentally. As a brief illustration of further abilities of DDFT, the study of large-scale genome organization for bacteriophage PHIX174 and bacterium Caulobacter crescentus was also added. The combined experimental, modeling, and bioinformatic DDFT analysis should yield more complete knowledge on the chromosome architecture and genome organization. Copyright © 2017 Elsevier Ltd. All rights reserved.

  11. Ensembl Genomes 2016: more genomes, more complexity.

    PubMed

    Kersey, Paul Julian; Allen, James E; Armean, Irina; Boddu, Sanjay; Bolt, Bruce J; Carvalho-Silva, Denise; Christensen, Mikkel; Davis, Paul; Falin, Lee J; Grabmueller, Christoph; Humphrey, Jay; Kerhornou, Arnaud; Khobova, Julia; Aranganathan, Naveen K; Langridge, Nicholas; Lowy, Ernesto; McDowall, Mark D; Maheswari, Uma; Nuhn, Michael; Ong, Chuang Kee; Overduin, Bert; Paulini, Michael; Pedro, Helder; Perry, Emily; Spudich, Giulietta; Tapanari, Electra; Walts, Brandon; Williams, Gareth; Tello-Ruiz, Marcela; Stein, Joshua; Wei, Sharon; Ware, Doreen; Bolser, Daniel M; Howe, Kevin L; Kulesha, Eugene; Lawson, Daniel; Maslen, Gareth; Staines, Daniel M

    2016-01-04

    Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the context of the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of programmatic and interactive interfaces to a rich range of data including reference sequence, gene models, transcriptional data, genetic variation and comparative analysis. This paper provides an update to the previous publications about the resource, with a focus on recent developments. These include the development of new analyses and views to represent polyploid genomes (of which bread wheat is the primary exemplar); and the continued up-scaling of the resource, which now includes over 23 000 bacterial genomes, 400 fungal genomes and 100 protist genomes, in addition to 55 genomes from invertebrate metazoa and 39 genomes from plants. This dramatic increase in the number of included genomes is one part of a broader effort to automate the integration of archival data (genome sequence, but also associated RNA sequence data and variant calls) within the context of reference genomes and make it available through the Ensembl user interfaces. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  12. Ensembl Genomes 2016: more genomes, more complexity

    PubMed Central

    Kersey, Paul Julian; Allen, James E.; Armean, Irina; Boddu, Sanjay; Bolt, Bruce J.; Carvalho-Silva, Denise; Christensen, Mikkel; Davis, Paul; Falin, Lee J.; Grabmueller, Christoph; Humphrey, Jay; Kerhornou, Arnaud; Khobova, Julia; Aranganathan, Naveen K.; Langridge, Nicholas; Lowy, Ernesto; McDowall, Mark D.; Maheswari, Uma; Nuhn, Michael; Ong, Chuang Kee; Overduin, Bert; Paulini, Michael; Pedro, Helder; Perry, Emily; Spudich, Giulietta; Tapanari, Electra; Walts, Brandon; Williams, Gareth; Tello–Ruiz, Marcela; Stein, Joshua; Wei, Sharon; Ware, Doreen; Bolser, Daniel M.; Howe, Kevin L.; Kulesha, Eugene; Lawson, Daniel; Maslen, Gareth; Staines, Daniel M.

    2016-01-01

    Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the context of the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of programmatic and interactive interfaces to a rich range of data including reference sequence, gene models, transcriptional data, genetic variation and comparative analysis. This paper provides an update to the previous publications about the resource, with a focus on recent developments. These include the development of new analyses and views to represent polyploid genomes (of which bread wheat is the primary exemplar); and the continued up-scaling of the resource, which now includes over 23 000 bacterial genomes, 400 fungal genomes and 100 protist genomes, in addition to 55 genomes from invertebrate metazoa and 39 genomes from plants. This dramatic increase in the number of included genomes is one part of a broader effort to automate the integration of archival data (genome sequence, but also associated RNA sequence data and variant calls) within the context of reference genomes and make it available through the Ensembl user interfaces. PMID:26578574

  13. Genome-wide association analysis based on multiple imputation with low-depth GBS data: application to biofuel traits in reed canarygrass

    USDA-ARS?s Scientific Manuscript database

    Genotyping-by-sequencing allows for large-scale genetic analyses in plant species with no reference genome, creating the challenge of sound inference in the presence of uncertain genotypes. Here we report an imputation-based genome-wide association study (GWAS) in reed canarygrass (Phalaris arundina...

  14. Gene Expression Analysis: Teaching Students to Do 30,000 Experiments at Once with Microarray

    ERIC Educational Resources Information Center

    Carvalho, Felicia I.; Johns, Christopher; Gillespie, Marc E.

    2012-01-01

    Genome scale experiments routinely produce large data sets that require computational analysis, yet there are few student-based labs that illustrate the design and execution of these experiments. In order for students to understand and participate in the genomic world, teaching labs must be available where students generate and analyze large data…

  15. Genome-scale CRISPR-Cas9 Knockout and Transcriptional Activation Screening

    PubMed Central

    Joung, Julia; Konermann, Silvana; Gootenberg, Jonathan S.; Abudayyeh, Omar O.; Platt, Randall J.; Brigham, Mark D.; Sanjana, Neville E.; Zhang, Feng

    2017-01-01

    Forward genetic screens are powerful tools for the unbiased discovery and functional characterization of specific genetic elements associated with a phenotype of interest. Recently, the RNA-guided endonuclease Cas9 from the microbial CRISPR (clustered regularly interspaced short palindromic repeats) immune system has been adapted for genome-scale screening by combining Cas9 with pooled guide RNA libraries. Here we describe a protocol for genome-scale knockout and transcriptional activation screening using the CRISPR-Cas9 system. Custom- or ready-made guide RNA libraries are constructed and packaged into lentiviral vectors for delivery into cells for screening. As each screen is unique, we provide guidelines for determining screening parameters and maintaining sufficient coverage. To validate candidate genes identified from the screen, we further describe strategies for confirming the screening phenotype as well as genetic perturbation through analysis of indel rate and transcriptional activation. Beginning with library design, a genome-scale screen can be completed in 9–15 weeks followed by 4–5 weeks of validation. PMID:28333914

  16. Genome-scale CRISPR-Cas9 knockout and transcriptional activation screening.

    PubMed

    Joung, Julia; Konermann, Silvana; Gootenberg, Jonathan S; Abudayyeh, Omar O; Platt, Randall J; Brigham, Mark D; Sanjana, Neville E; Zhang, Feng

    2017-04-01

    Forward genetic screens are powerful tools for the unbiased discovery and functional characterization of specific genetic elements associated with a phenotype of interest. Recently, the RNA-guided endonuclease Cas9 from the microbial CRISPR (clustered regularly interspaced short palindromic repeats) immune system has been adapted for genome-scale screening by combining Cas9 with pooled guide RNA libraries. Here we describe a protocol for genome-scale knockout and transcriptional activation screening using the CRISPR-Cas9 system. Custom- or ready-made guide RNA libraries are constructed and packaged into lentiviral vectors for delivery into cells for screening. As each screen is unique, we provide guidelines for determining screening parameters and maintaining sufficient coverage. To validate candidate genes identified by the screen, we further describe strategies for confirming the screening phenotype, as well as genetic perturbation, through analysis of indel rate and transcriptional activation. Beginning with library design, a genome-scale screen can be completed in 9-15 weeks, followed by 4-5 weeks of validation.

  17. Enhanced annotations and features for comparing thousands of Pseudomonas genomes in the Pseudomonas genome database.

    PubMed

    Winsor, Geoffrey L; Griffiths, Emma J; Lo, Raymond; Dhillon, Bhavjinder K; Shay, Julie A; Brinkman, Fiona S L

    2016-01-04

    The Pseudomonas Genome Database (http://www.pseudomonas.com) is well known for the application of community-based annotation approaches for producing a high-quality Pseudomonas aeruginosa PAO1 genome annotation, and facilitating whole-genome comparative analyses with other Pseudomonas strains. To aid analysis of potentially thousands of complete and draft genome assemblies, this database and analysis platform was upgraded to integrate curated genome annotations and isolate metadata with enhanced tools for larger scale comparative analysis and visualization. Manually curated gene annotations are supplemented with improved computational analyses that help identify putative drug targets and vaccine candidates or assist with evolutionary studies by identifying orthologs, pathogen-associated genes and genomic islands. The database schema has been updated to integrate isolate metadata that will facilitate more powerful analysis of genomes across datasets in the future. We continue to place an emphasis on providing high-quality updates to gene annotations through regular review of the scientific literature and using community-based approaches including a major new Pseudomonas community initiative for the assignment of high-quality gene ontology terms to genes. As we further expand from thousands of genomes, we plan to provide enhancements that will aid data visualization and analysis arising from whole-genome comparative studies including more pan-genome and population-based approaches. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  18. The next generation of melanocyte data: Genetic, epigenetic, and transcriptional resource datasets and analysis tools.

    PubMed

    Loftus, Stacie K

    2018-05-01

    The number of melanocyte- and melanoma-derived next generation sequence genome-scale datasets have rapidly expanded over the past several years. This resource guide provides a summary of publicly available sources of melanocyte cell derived whole genome, exome, mRNA and miRNA transcriptome, chromatin accessibility and epigenetic datasets. Also highlighted are bioinformatic resources and tools for visualization and data queries which allow researchers a genome-scale view of the melanocyte. Published 2018. This article is a U.S. Government work and is in the public domain in the USA.

  19. A multi-objective constraint-based approach for modeling genome-scale microbial ecosystems.

    PubMed

    Budinich, Marko; Bourdon, Jérémie; Larhlimi, Abdelhalim; Eveillard, Damien

    2017-01-01

    Interplay within microbial communities impacts ecosystems on several scales, and elucidation of the consequent effects is a difficult task in ecology. In particular, the integration of genome-scale data within quantitative models of microbial ecosystems remains elusive. This study advocates the use of constraint-based modeling to build predictive models from recent high-resolution -omics datasets. Following recent studies that have demonstrated the accuracy of constraint-based models (CBMs) for simulating single-strain metabolic networks, we sought to study microbial ecosystems as a combination of single-strain metabolic networks that exchange nutrients. This study presents two multi-objective extensions of CBMs for modeling communities: multi-objective flux balance analysis (MO-FBA) and multi-objective flux variability analysis (MO-FVA). Both methods were applied to a hot spring mat model ecosystem. As a result, multiple trade-offs between nutrients and growth rates, as well as thermodynamically favorable relative abundances at community level, were emphasized. We expect this approach to be used for integrating genomic information in microbial ecosystems. Following models will provide insights about behaviors (including diversity) that take place at the ecosystem scale.

  20. Single cell Hi-C reveals cell-to-cell variability in chromosome structure

    PubMed Central

    Schoenfelder, Stefan; Yaffe, Eitan; Dean, Wendy; Laue, Ernest D.; Tanay, Amos; Fraser, Peter

    2013-01-01

    Large-scale chromosome structure and spatial nuclear arrangement have been linked to control of gene expression and DNA replication and repair. Genomic techniques based on chromosome conformation capture assess contacts for millions of loci simultaneously, but do so by averaging chromosome conformations from millions of nuclei. Here we introduce single cell Hi-C, combined with genome-wide statistical analysis and structural modeling of single copy X chromosomes, to show that individual chromosomes maintain domain organisation at the megabase scale, but show variable cell-to-cell chromosome territory structures at larger scales. Despite this structural stochasticity, localisation of active gene domains to boundaries of territories is a hallmark of chromosomal conformation. Single cell Hi-C data bridge current gaps between genomics and microscopy studies of chromosomes, demonstrating how modular organisation underlies dynamic chromosome structure, and how this structure is probabilistically linked with genome activity patterns. PMID:24067610

  1. GWAMA: software for genome-wide association meta-analysis.

    PubMed

    Mägi, Reedik; Morris, Andrew P

    2010-05-28

    Despite the recent success of genome-wide association studies in identifying novel loci contributing effects to complex human traits, such as type 2 diabetes and obesity, much of the genetic component of variation in these phenotypes remains unexplained. One way to improving power to detect further novel loci is through meta-analysis of studies from the same population, increasing the sample size over any individual study. Although statistical software analysis packages incorporate routines for meta-analysis, they are ill equipped to meet the challenges of the scale and complexity of data generated in genome-wide association studies. We have developed flexible, open-source software for the meta-analysis of genome-wide association studies. The software incorporates a variety of error trapping facilities, and provides a range of meta-analysis summary statistics. The software is distributed with scripts that allow simple formatting of files containing the results of each association study and generate graphical summaries of genome-wide meta-analysis results. The GWAMA (Genome-Wide Association Meta-Analysis) software has been developed to perform meta-analysis of summary statistics generated from genome-wide association studies of dichotomous phenotypes or quantitative traits. Software with source files, documentation and example data files are freely available online at http://www.well.ox.ac.uk/GWAMA.

  2. Biofilm Formation Mechanisms of Pseudomonas aeruginosa Predicted via Genome-Scale Kinetic Models of Bacterial Metabolism

    DTIC Science & Technology

    2016-03-15

    mutants hisC1 (PA4447), hisD (PA4448), hutH (PA5098), and PA0006. We predicted that uro - canate was depleted in these high biofilm-producing mutants and...Lam DK, Fleming L, Lo R, Whiteside MD, Yu NY, et al. PseudomonasGenome Database: improved comparative analysis and population genomics capability for

  3. Family-specific scaling laws in bacterial genomes.

    PubMed

    De Lazzari, Eleonora; Grilli, Jacopo; Maslov, Sergei; Cosentino Lagomarsino, Marco

    2017-07-27

    Among several quantitative invariants found in evolutionary genomics, one of the most striking is the scaling of the overall abundance of proteins, or protein domains, sharing a specific functional annotation across genomes of given size. The size of these functional categories change, on average, as power-laws in the total number of protein-coding genes. Here, we show that such regularities are not restricted to the overall behavior of high-level functional categories, but also exist systematically at the level of single evolutionary families of protein domains. Specifically, the number of proteins within each family follows family-specific scaling laws with genome size. Functionally similar sets of families tend to follow similar scaling laws, but this is not always the case. To understand this systematically, we provide a comprehensive classification of families based on their scaling properties. Additionally, we develop a quantitative score for the heterogeneity of the scaling of families belonging to a given category or predefined group. Under the common reasonable assumption that selection is driven solely or mainly by biological function, these findings point to fine-tuned and interdependent functional roles of specific protein domains, beyond our current functional annotations. This analysis provides a deeper view on the links between evolutionary expansion of protein families and the functional constraints shaping the gene repertoire of bacterial genomes. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  4. Factors affecting reproducibility between genome-scale siRNA-based screens

    PubMed Central

    Barrows, Nicholas J.; Le Sommer, Caroline; Garcia-Blanco, Mariano A.; Pearson, James L.

    2011-01-01

    RNA interference-based screening is a powerful new genomic technology which addresses gene function en masse. To evaluate factors influencing hit list composition and reproducibility, we performed two identically designed small interfering RNA (siRNA)-based, whole genome screens for host factors supporting yellow fever virus infection. These screens represent two separate experiments completed five months apart and allow the direct assessment of the reproducibility of a given siRNA technology when performed in the same environment. Candidate hit lists generated by sum rank, median absolute deviation, z-score, and strictly standardized mean difference were compared within and between whole genome screens. Application of these analysis methodologies within a single screening dataset using a fixed threshold equivalent to a p-value ≤ 0.001 resulted in hit lists ranging from 82 to 1,140 members and highlighted the tremendous impact analysis methodology has on hit list composition. Intra- and inter-screen reproducibility was significantly influenced by the analysis methodology and ranged from 32% to 99%. This study also highlighted the power of testing at least two independent siRNAs for each gene product in primary screens. To facilitate validation we conclude by suggesting methods to reduce false discovery at the primary screening stage. In this study we present the first comprehensive comparison of multiple analysis strategies, and demonstrate the impact of the analysis methodology on the composition of the “hit list”. Therefore, we propose that the entire dataset derived from functional genome-scale screens, especially if publicly funded, should be made available as is done with data derived from gene expression and genome-wide association studies. PMID:20625183

  5. Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr.

    PubMed

    Privé, Florian; Aschard, Hugues; Ziyatdinov, Andrey; Blum, Michael G B

    2017-03-30

    Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses, leading to some software becoming obsolete and researchers having limited access to diverse analysis tools. Here we present two R packages, bigstatsr and bigsnpr, allowing for the analysis of large scale genomic data to be performed within R. To address large data size, the packages use memory-mapping for accessing data matrices stored on disk instead of in RAM. To perform data pre-processing and data analysis, the packages integrate most of the tools that are commonly used, either through transparent system calls to existing software, or through updated or improved implementation of existing methods. In particular, the packages implement fast and accurate computations of principal component analysis and association studies, functions to remove SNPs in linkage disequilibrium and algorithms to learn polygenic risk scores on millions of SNPs. We illustrate applications of the two R packages by analyzing a case-control genomic dataset for celiac disease, performing an association study and computing Polygenic Risk Scores. Finally, we demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500,000 individuals and 1 million markers on a single desktop computer. https://privefl.github.io/bigstatsr/ & https://privefl.github.io/bigsnpr/. florian.prive@univ-grenoble-alpes.fr & michael.blum@univ-grenoble-alpes.fr. Supplementary materials are available at Bioinformatics online.

  6. Genome-based microbial ecology of anammox granules in a full-scale wastewater treatment system.

    PubMed

    Speth, Daan R; In 't Zandt, Michiel H; Guerrero-Cruz, Simon; Dutilh, Bas E; Jetten, Mike S M

    2016-03-31

    Partial-nitritation anammox (PNA) is a novel wastewater treatment procedure for energy-efficient ammonium removal. Here we use genome-resolved metagenomics to build a genome-based ecological model of the microbial community in a full-scale PNA reactor. Sludge from the bioreactor examined here is used to seed reactors in wastewater treatment plants around the world; however, the role of most of its microbial community in ammonium removal remains unknown. Our analysis yielded 23 near-complete draft genomes that together represent the majority of the microbial community. We assign these genomes to distinct anaerobic and aerobic microbial communities. In the aerobic community, nitrifying organisms and heterotrophs predominate. In the anaerobic community, widespread potential for partial denitrification suggests a nitrite loop increases treatment efficiency. Of our genomes, 19 have no previously cultivated or sequenced close relatives and six belong to bacterial phyla without any cultivated members, including the most complete Omnitrophica (formerly OP3) genome to date.

  7. Genome-based microbial ecology of anammox granules in a full-scale wastewater treatment system

    PubMed Central

    Speth, Daan R.; in 't Zandt, Michiel H.; Guerrero-Cruz, Simon; Dutilh, Bas E.; Jetten, Mike S. M.

    2016-01-01

    Partial-nitritation anammox (PNA) is a novel wastewater treatment procedure for energy-efficient ammonium removal. Here we use genome-resolved metagenomics to build a genome-based ecological model of the microbial community in a full-scale PNA reactor. Sludge from the bioreactor examined here is used to seed reactors in wastewater treatment plants around the world; however, the role of most of its microbial community in ammonium removal remains unknown. Our analysis yielded 23 near-complete draft genomes that together represent the majority of the microbial community. We assign these genomes to distinct anaerobic and aerobic microbial communities. In the aerobic community, nitrifying organisms and heterotrophs predominate. In the anaerobic community, widespread potential for partial denitrification suggests a nitrite loop increases treatment efficiency. Of our genomes, 19 have no previously cultivated or sequenced close relatives and six belong to bacterial phyla without any cultivated members, including the most complete Omnitrophica (formerly OP3) genome to date. PMID:27029554

  8. Metabolic Network Modeling of Microbial Communities

    PubMed Central

    Biggs, Matthew B.; Medlock, Gregory L.; Kolling, Glynis L.

    2015-01-01

    Genome-scale metabolic network reconstructions and constraint-based analysis are powerful methods that have the potential to make functional predictions about microbial communities. Current use of genome-scale metabolic networks to characterize the metabolic functions of microbial communities includes species compartmentalization, separating species-level and community-level objectives, dynamic analysis, the “enzyme-soup” approach, multi-scale modeling, and others. There are many challenges inherent to the field, including a need for tools that accurately assign high-level omics signals to individual community members, new automated reconstruction methods that rival manual curation, and novel algorithms for integrating omics data and engineering communities. As technologies and modeling frameworks improve, we expect that there will be proportional advances in the fields of ecology, health science, and microbial community engineering. PMID:26109480

  9. Genome-scale analysis of positionally relocated genes

    PubMed Central

    Bhutkar, Arjun; Russo, Susan M.; Smith, Temple F.; Gelbart, William M.

    2007-01-01

    During evolution, genome reorganization includes large-scale events such as inversions, translocations, and segmental or even whole-genome duplications, as well as fine-scale events such as the relocation of individual genes. This latter category, which we will refer to as positionally relocated genes (PRGs), is the subject of this report. Assessment of the magnitude of such PRGs and of possible contributing mechanisms is aided by a comparative analysis of related genomes, where conserved chromosomal organization can aid in identifying genes that have acquired a new location in a lineage of these genomes. Here we utilize two methods to comprehensively identify relocated protein-coding genes in the recently sequenced genomes of 12 species of genus Drosophila. We use exceptions to the general rule of maintenance of chromosome arm (Muller element) association for most Drosophila genes to identify one major class of PRGs. We also identify a partially overlapping set of PRGs among “embedded genes,” located within the extents of other surrounding genes. We provide evidence that PRG movements have at least two different origins: Some events occur via retrotransposition of processed RNAs and others via a DNA-based transposition mechanism. Overall, we identify several hundred PRGs that arose within a lineage of the genus Drosophila phylogeny and provide suggestive evidence that a few thousand such events have occurred within the radiation of the insect order Diptera, thereby illustrating the magnitude of the contribution of PRG movement to chromosomal reorganization during evolution. PMID:17989252

  10. Outlier analysis of functional genomic profiles enriches for oncology targets and enables precision medicine.

    PubMed

    Zhu, Zhou; Ihle, Nathan T; Rejto, Paul A; Zarrinkar, Patrick P

    2016-06-13

    Genome-scale functional genomic screens across large cell line panels provide a rich resource for discovering tumor vulnerabilities that can lead to the next generation of targeted therapies. Their data analysis typically has focused on identifying genes whose knockdown enhances response in various pre-defined genetic contexts, which are limited by biological complexities as well as the incompleteness of our knowledge. We thus introduce a complementary data mining strategy to identify genes with exceptional sensitivity in subsets, or outlier groups, of cell lines, allowing an unbiased analysis without any a priori assumption about the underlying biology of dependency. Genes with outlier features are strongly and specifically enriched with those known to be associated with cancer and relevant biological processes, despite no a priori knowledge being used to drive the analysis. Identification of exceptional responders (outliers) may not lead only to new candidates for therapeutic intervention, but also tumor indications and response biomarkers for companion precision medicine strategies. Several tumor suppressors have an outlier sensitivity pattern, supporting and generalizing the notion that tumor suppressors can play context-dependent oncogenic roles. The novel application of outlier analysis described here demonstrates a systematic and data-driven analytical strategy to decipher large-scale functional genomic data for oncology target and precision medicine discoveries.

  11. Genome-scale metabolic model for Lactococcus lactis MG1363 and its application to the analysis of flavor formation.

    PubMed

    Flahaut, Nicolas A L; Wiersma, Anne; van de Bunt, Bert; Martens, Dirk E; Schaap, Peter J; Sijtsma, Lolke; Dos Santos, Vitor A Martins; de Vos, Willem M

    2013-10-01

    Lactococcus lactis subsp. cremoris MG1363 is a paradigm strain for lactococci used in industrial dairy fermentations. However, despite of its importance for process development, no genome-scale metabolic model has been reported thus far. Moreover, current models for other lactococci only focus on growth and sugar degradation. A metabolic model that includes nitrogen metabolism and flavor-forming pathways is instrumental for the understanding and designing new industrial applications of these lactic acid bacteria. A genome-scale, constraint-based model of the metabolism and transport in L. lactis MG1363, accounting for 518 genes, 754 reactions, and 650 metabolites, was developed and experimentally validated. Fifty-nine reactions are directly or indirectly involved in flavor formation. Flux Balance Analysis and Flux Variability Analysis were used to investigate flux distributions within the whole metabolic network. Anaerobic carbon-limited continuous cultures were used for estimating the energetic parameters. A thorough model-driven analysis showing a highly flexible nitrogen metabolism, e.g., branched-chain amino acid catabolism which coupled with the redox balance, is pivotal for the prediction of the formation of different flavor compounds. Furthermore, the model predicted the formation of volatile sulfur compounds as a result of the fermentation. These products were subsequently identified in the experimental fermentations carried out. Thus, the genome-scale metabolic model couples the carbon and nitrogen metabolism in L. lactis MG1363 with complete known catabolic pathways leading to flavor formation. The model provided valuable insights into the metabolic networks underlying flavor formation and has the potential to contribute to new developments in dairy industries and cheese-flavor research.

  12. Identification and Characterization of Genomic Amplifications in Ovarian Serous Carcinoma

    DTIC Science & Technology

    2009-07-01

    oncogenes, Rsf1 and Notch3, which were up-regulated in both genomic DNA and transcript levels in ovarian cancer. In a large- scale FISH analysis, Rsf1...associated with worse disease outcome, suggesting that Rsf1 could be potentially used as a prognostic marker in the future (Appendix #1). For the...over- expressed in a recurrent carcinoma. Although the follow-up study in a larger- scale sample size did not demonstrate clear amplification in NAC1

  13. Cloud-Scale Genomic Signals Processing for Robust Large-Scale Cancer Genomic Microarray Data Analysis.

    PubMed

    Harvey, Benjamin Simeon; Ji, Soo-Yeon

    2017-01-01

    As microarray data available to scientists continues to increase in size and complexity, it has become overwhelmingly important to find multiple ways to bring forth oncological inference to the bioinformatics community through the analysis of large-scale cancer genomic (LSCG) DNA and mRNA microarray data that is useful to scientists. Though there have been many attempts to elucidate the issue of bringing forth biological interpretation by means of wavelet preprocessing and classification, there has not been a research effort that focuses on a cloud-scale distributed parallel (CSDP) separable 1-D wavelet decomposition technique for denoising through differential expression thresholding and classification of LSCG microarray data. This research presents a novel methodology that utilizes a CSDP separable 1-D method for wavelet-based transformation in order to initialize a threshold which will retain significantly expressed genes through the denoising process for robust classification of cancer patients. Additionally, the overall study was implemented and encompassed within CSDP environment. The utilization of cloud computing and wavelet-based thresholding for denoising was used for the classification of samples within the Global Cancer Map, Cancer Cell Line Encyclopedia, and The Cancer Genome Atlas. The results proved that separable 1-D parallel distributed wavelet denoising in the cloud and differential expression thresholding increased the computational performance and enabled the generation of higher quality LSCG microarray datasets, which led to more accurate classification results.

  14. Genome scale transcriptomics of baculovirus-insect interactions.

    PubMed

    Nguyen, Quan; Nielsen, Lars K; Reid, Steven

    2013-11-12

    Baculovirus-insect cell technologies are applied in the production of complex proteins, veterinary and human vaccines, gene delivery vectors' and biopesticides. Better understanding of how baculoviruses and insect cells interact would facilitate baculovirus-based production. While complete genomic sequences are available for over 58 baculovirus species, little insect genomic information is known. The release of the Bombyx mori and Plutella xylostella genomes, the accumulation of EST sequences for several Lepidopteran species, and especially the availability of two genome-scale analysis tools, namely oligonucleotide microarrays and next generation sequencing (NGS), have facilitated expression studies to generate a rich picture of insect gene responses to baculovirus infections. This review presents current knowledge on the interaction dynamics of the baculovirus-insect system' which is relatively well studied in relation to nucleocapsid transportation, apoptosis, and heat shock responses, but is still poorly understood regarding responses involved in pro-survival pathways, DNA damage pathways, protein degradation, translation, signaling pathways, RNAi pathways, and importantly metabolic pathways for energy, nucleotide and amino acid production. We discuss how the two genome-scale transcriptomic tools can be applied for studying such pathways and suggest that proteomics and metabolomics can produce complementary findings to transcriptomic studies.

  15. Joint scaling laws in functional and evolutionary categories in prokaryotic genomes

    PubMed Central

    Grilli, J.; Bassetti, B.; Maslov, S.; Cosentino Lagomarsino, M.

    2012-01-01

    We propose and study a class-expansion/innovation/loss model of genome evolution taking into account biological roles of genes and their constituent domains. In our model, numbers of genes in different functional categories are coupled to each other. For example, an increase in the number of metabolic enzymes in a genome is usually accompanied by addition of new transcription factors regulating these enzymes. Such coupling can be thought of as a proportional ‘recipe’ for genome composition of the type ‘a spoonful of sugar for each egg yolk’. The model jointly reproduces two known empirical laws: the distribution of family sizes and the non-linear scaling of the number of genes in certain functional categories (e.g. transcription factors) with genome size. In addition, it allows us to derive a novel relation between the exponents characterizing these two scaling laws, establishing a direct quantitative connection between evolutionary and functional categories. It predicts that functional categories that grow faster-than-linearly with genome size to be characterized by flatter-than-average family size distributions. This relation is confirmed by our bioinformatics analysis of prokaryotic genomes. This proves that the joint quantitative trends of functional and evolutionary classes can be understood in terms of evolutionary growth with proportional recipes. PMID:21937509

  16. Evaluation of Targeted Sequencing for Transcriptional Analysis of Archival Formalin-Fixed Paraffin-Embedded (FFPE) Samples

    EPA Science Inventory

    Next-generation sequencing provides unprecedented access to genomic information in archival FFPE tissue samples. However, costs and technical challenges related to RNA isolation and enrichment limit use of whole-genome RNA-sequencing for large-scale studies of FFPE specimens. Rec...

  17. Microarray Data Processing Techniques for Genome-Scale Network Inference from Large Public Repositories.

    PubMed

    Chockalingam, Sriram; Aluru, Maneesha; Aluru, Srinivas

    2016-09-19

    Pre-processing of microarray data is a well-studied problem. Furthermore, all popular platforms come with their own recommended best practices for differential analysis of genes. However, for genome-scale network inference using microarray data collected from large public repositories, these methods filter out a considerable number of genes. This is primarily due to the effects of aggregating a diverse array of experiments with different technical and biological scenarios. Here we introduce a pre-processing pipeline suitable for inferring genome-scale gene networks from large microarray datasets. We show that partitioning of the available microarray datasets according to biological relevance into tissue- and process-specific categories significantly extends the limits of downstream network construction. We demonstrate the effectiveness of our pre-processing pipeline by inferring genome-scale networks for the model plant Arabidopsis thaliana using two different construction methods and a collection of 11,760 Affymetrix ATH1 microarray chips. Our pre-processing pipeline and the datasets used in this paper are made available at http://alurulab.cc.gatech.edu/microarray-pp.

  18. Evaluation of a Genome-Scale In Silico Metabolic Model for Geobacter metallireducens Using Proteomic Data from a Field Biostimulation Experiment

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Fang, Yilin; Wilkins, Michael J.; Yabusaki, Steven B.

    2012-12-12

    Biomass and shotgun global proteomics data that reflected relative protein abundances from samples collected during the 2008 experiment at the U.S. Department of Energy Integrated Field-Scale Subsurface Research Challenge site in Rifle, Colorado, provided an unprecedented opportunity to validate a genome-scale metabolic model of Geobacter metallireducens and assess its performance with respect to prediction of metal reduction, biomass yield, and growth rate under dynamic field conditions. Reconstructed from annotated genomic sequence, biochemical, and physiological data, the constraint-based in silico model of G. metallireducens relates an annotated genome sequence to the physiological functions with 697 reactions controlled by 747 enzyme-coding genes.more » Proteomic analysis showed that 180 of the 637 G. metallireducens proteins detected during the 2008 experiment were associated with specific metabolic reactions in the in silico model. When the field-calibrated Fe(III) terminal electron acceptor process reaction in a reactive transport model for the field experiments was replaced with the genome-scale model, the model predicted that the largest metabolic fluxes through the in silico model reactions generally correspond to the highest abundances of proteins that catalyze those reactions. Central metabolism predicted by the model agrees well with protein abundance profiles inferred from proteomic analysis. Model discrepancies with the proteomic data, such as the relatively low fluxes through amino acid transport and metabolism, revealed pathways or flux constraints in the in silico model that could be updated to more accurately predict metabolic processes that occur in the subsurface environment.« less

  19. Dating and functional characterization of duplicated genes in the apple (Malus domestica Borkh.) by analyzing EST data.

    PubMed

    Sanzol, Javier

    2010-05-14

    Gene duplication is central to genome evolution. In plants, genes can be duplicated through small-scale events and large-scale duplications often involving polyploidy. The apple belongs to the subtribe Pyrinae (Rosaceae), a diverse lineage that originated via allopolyploidization. Both small-scale duplications and polyploidy may have been important mechanisms shaping the genome of this species. This study evaluates the gene duplication and polyploidy history of the apple by characterizing duplicated genes in this species using EST data. Overall, 68% of the apple genes were clustered into families with a mean copy-number of 4.6. Analysis of the age distribution of gene duplications supported a continuous mode of small-scale duplications, plus two episodes of large-scale duplicates of vastly different ages. The youngest was consistent with the polyploid origin of the Pyrinae 37-48 MYBP, whereas the older may be related to gamma-triplication; an ancient hexapolyploidization previously characterized in the four sequenced eurosid genomes and basal to the eurosid-asterid divergence. Duplicated genes were studied for functional diversification with an emphasis on young paralogs; those originated during or after the formation of the Pyrinae lineage. Unequal assignment of single-copy genes and gene families to Gene Ontology categories suggested functional bias in the pattern of gene retention of paralogs. Young paralogs related to signal transduction, metabolism, and energy pathways have been preferentially retained. Non-random retention of duplicated genes seems to have mediated the expansion of gene families, some of which may have substantially increased their members after the origin of the Pyrinae. The joint analysis of over-duplicated functional categories and phylogenies, allowed evaluation of the role of both polyploidy and small-scale duplications during this process. Finally, gene expression analysis indicated that 82% of duplicated genes, including 80% of young paralogs, showed uncorrelated expression profiles, suggesting extensive subfunctionalization and a role of gene duplication in the acquisition of novel patterns of gene expression. This study reports a genome-wide analysis of the mode of gene duplication in the apple, and provides evidence for its role in genome functional diversification by characterising three major processes: selective retention of paralogs, amplification of gene families, and changes in gene expression.

  20. Dichlorvos Exposure Results in Large Scale Disruption of Energy Metabolism in the Liver of the Zebra Fish, Danio Rerio

    DTIC Science & Technology

    2015-10-24

    zebrafish reference genome sequence and its relationship to the human genome . Nature. 2013;496(7446):498–503. 21. Linney E, Upchurch L, Donerly S. Zebrafish...To obtain a broader understanding of the effects of dichlorvos on liver metabolism, we per- formed a genome -wide analysis of gene expression in the ...condition) for whole genome transcript ana- lysis, and fixed another set of fish for histological evaluation (n = 5/condition). We determined the target

  1. MultiMetEval: Comparative and Multi-Objective Analysis of Genome-Scale Metabolic Models

    PubMed Central

    Gevorgyan, Albert; Kierzek, Andrzej M.; Breitling, Rainer; Takano, Eriko

    2012-01-01

    Comparative metabolic modelling is emerging as a novel field, supported by the development of reliable and standardized approaches for constructing genome-scale metabolic models in high throughput. New software solutions are needed to allow efficient comparative analysis of multiple models in the context of multiple cellular objectives. Here, we present the user-friendly software framework Multi-Metabolic Evaluator (MultiMetEval), built upon SurreyFBA, which allows the user to compose collections of metabolic models that together can be subjected to flux balance analysis. Additionally, MultiMetEval implements functionalities for multi-objective analysis by calculating the Pareto front between two cellular objectives. Using a previously generated dataset of 38 actinobacterial genome-scale metabolic models, we show how these approaches can lead to exciting novel insights. Firstly, after incorporating several pathways for the biosynthesis of natural products into each of these models, comparative flux balance analysis predicted that species like Streptomyces that harbour the highest diversity of secondary metabolite biosynthetic gene clusters in their genomes do not necessarily have the metabolic network topology most suitable for compound overproduction. Secondly, multi-objective analysis of biomass production and natural product biosynthesis in these actinobacteria shows that the well-studied occurrence of discrete metabolic switches during the change of cellular objectives is inherent to their metabolic network architecture. Comparative and multi-objective modelling can lead to insights that could not be obtained by normal flux balance analyses. MultiMetEval provides a powerful platform that makes these analyses straightforward for biologists. Sources and binaries of MultiMetEval are freely available from https://github.com/PiotrZakrzewski/MetEval/downloads. PMID:23272111

  2. Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing.

    PubMed

    Zhao, Shanrong; Prenger, Kurt; Smith, Lance; Messina, Thomas; Fan, Hongtao; Jaeger, Edward; Stephens, Susan

    2013-06-27

    Technical improvements have decreased sequencing costs and, as a result, the size and number of genomic datasets have increased rapidly. Because of the lower cost, large amounts of sequence data are now being produced by small to midsize research groups. Crossbow is a software tool that can detect single nucleotide polymorphisms (SNPs) in whole-genome sequencing (WGS) data from a single subject; however, Crossbow has a number of limitations when applied to multiple subjects from large-scale WGS projects. The data storage and CPU resources that are required for large-scale whole genome sequencing data analyses are too large for many core facilities and individual laboratories to provide. To help meet these challenges, we have developed Rainbow, a cloud-based software package that can assist in the automation of large-scale WGS data analyses. Here, we evaluated the performance of Rainbow by analyzing 44 different whole-genome-sequenced subjects. Rainbow has the capacity to process genomic data from more than 500 subjects in two weeks using cloud computing provided by the Amazon Web Service. The time includes the import and export of the data using Amazon Import/Export service. The average cost of processing a single sample in the cloud was less than 120 US dollars. Compared with Crossbow, the main improvements incorporated into Rainbow include the ability: (1) to handle BAM as well as FASTQ input files; (2) to split large sequence files for better load balance downstream; (3) to log the running metrics in data processing and monitoring multiple Amazon Elastic Compute Cloud (EC2) instances; and (4) to merge SOAPsnp outputs for multiple individuals into a single file to facilitate downstream genome-wide association studies. Rainbow is a scalable, cost-effective, and open-source tool for large-scale WGS data analysis. For human WGS data sequenced by either the Illumina HiSeq 2000 or HiSeq 2500 platforms, Rainbow can be used straight out of the box. Rainbow is available for third-party implementation and use, and can be downloaded from http://s3.amazonaws.com/jnj_rainbow/index.html.

  3. Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing

    PubMed Central

    2013-01-01

    Background Technical improvements have decreased sequencing costs and, as a result, the size and number of genomic datasets have increased rapidly. Because of the lower cost, large amounts of sequence data are now being produced by small to midsize research groups. Crossbow is a software tool that can detect single nucleotide polymorphisms (SNPs) in whole-genome sequencing (WGS) data from a single subject; however, Crossbow has a number of limitations when applied to multiple subjects from large-scale WGS projects. The data storage and CPU resources that are required for large-scale whole genome sequencing data analyses are too large for many core facilities and individual laboratories to provide. To help meet these challenges, we have developed Rainbow, a cloud-based software package that can assist in the automation of large-scale WGS data analyses. Results Here, we evaluated the performance of Rainbow by analyzing 44 different whole-genome-sequenced subjects. Rainbow has the capacity to process genomic data from more than 500 subjects in two weeks using cloud computing provided by the Amazon Web Service. The time includes the import and export of the data using Amazon Import/Export service. The average cost of processing a single sample in the cloud was less than 120 US dollars. Compared with Crossbow, the main improvements incorporated into Rainbow include the ability: (1) to handle BAM as well as FASTQ input files; (2) to split large sequence files for better load balance downstream; (3) to log the running metrics in data processing and monitoring multiple Amazon Elastic Compute Cloud (EC2) instances; and (4) to merge SOAPsnp outputs for multiple individuals into a single file to facilitate downstream genome-wide association studies. Conclusions Rainbow is a scalable, cost-effective, and open-source tool for large-scale WGS data analysis. For human WGS data sequenced by either the Illumina HiSeq 2000 or HiSeq 2500 platforms, Rainbow can be used straight out of the box. Rainbow is available for third-party implementation and use, and can be downloaded from http://s3.amazonaws.com/jnj_rainbow/index.html. PMID:23802613

  4. Construction of a plant-transformation-competent BIBAC library and genome sequence analysis of polyploid Upland cotton (Gossypium hirsutum L.)

    PubMed Central

    2013-01-01

    Background Cotton, one of the world’s leading crops, is important to the world’s textile and energy industries, and is a model species for studies of plant polyploidization, cellulose biosynthesis and cell wall biogenesis. Here, we report the construction of a plant-transformation-competent binary bacterial artificial chromosome (BIBAC) library and comparative genome sequence analysis of polyploid Upland cotton (Gossypium hirsutum L.) with one of its diploid putative progenitor species, G. raimondii Ulbr. Results We constructed the cotton BIBAC library in a vector competent for high-molecular-weight DNA transformation in different plant species through either Agrobacterium or particle bombardment. The library contains 76,800 clones with an average insert size of 135 kb, providing an approximate 99% probability of obtaining at least one positive clone from the library using a single-copy probe. The quality and utility of the library were verified by identifying BIBACs containing genes important for fiber development, fiber cellulose biosynthesis, seed fatty acid metabolism, cotton-nematode interaction, and bacterial blight resistance. In order to gain an insight into the Upland cotton genome and its relationship with G. raimondii, we sequenced nearly 10,000 BIBAC ends (BESs) randomly selected from the library, generating approximately one BES for every 250 kb along the Upland cotton genome. The retroelement Gypsy/DIRS1 family predominates in the Upland cotton genome, accounting for over 77% of all transposable elements. From the BESs, we identified 1,269 simple sequence repeats (SSRs), of which 1,006 were new, thus providing additional markers for cotton genome research. Surprisingly, comparative sequence analysis showed that Upland cotton is much more diverged from G. raimondii at the genomic sequence level than expected. There seems to be no significant difference between the relationships of the Upland cotton D- and A-subgenomes with the G. raimondii genome, even though G. raimondii contains a D genome (D5). Conclusions The library represents the first BIBAC library in cotton and related species, thus providing tools useful for integrative physical mapping, large-scale genome sequencing and large-scale functional analysis of the Upland cotton genome. Comparative sequence analysis provides insights into the Upland cotton genome, and a possible mechanism underlying the divergence and evolution of polyploid Upland cotton from its diploid putative progenitor species, G. raimondii. PMID:23537070

  5. redGEM: Systematic reduction and analysis of genome-scale metabolic reconstructions for development of consistent core metabolic models

    PubMed Central

    Ataman, Meric

    2017-01-01

    Genome-scale metabolic reconstructions have proven to be valuable resources in enhancing our understanding of metabolic networks as they encapsulate all known metabolic capabilities of the organisms from genes to proteins to their functions. However the complexity of these large metabolic networks often hinders their utility in various practical applications. Although reduced models are commonly used for modeling and in integrating experimental data, they are often inconsistent across different studies and laboratories due to different criteria and detail, which can compromise transferability of the findings and also integration of experimental data from different groups. In this study, we have developed a systematic semi-automatic approach to reduce genome-scale models into core models in a consistent and logical manner focusing on the central metabolism or subsystems of interest. The method minimizes the loss of information using an approach that combines graph-based search and optimization methods. The resulting core models are shown to be able to capture key properties of the genome-scale models and preserve consistency in terms of biomass and by-product yields, flux and concentration variability and gene essentiality. The development of these “consistently-reduced” models will help to clarify and facilitate integration of different experimental data to draw new understanding that can be directly extendable to genome-scale models. PMID:28727725

  6. APPLICATION OF DNA MICROARRAYS TO REPRODUCTIVE TOXICOLOGY AND THE DEVELOPMENT OF A TESTIS ARRAY

    EPA Science Inventory

    With the advent of sequence information for entire mammalian genomes, it is now possible to analyze gene expression and gene polymorphisms on a genomic scale. The primary tool for analysis of gene expression is the DNA microarray. We have used commercially available cDNA micro...

  7. A genome-wide association study platform built on iPlant cyber-infrastructure

    USDA-ARS?s Scientific Manuscript database

    We demonstrated a flexible Genome-Wide Association (GWA) Study (GWAS) platform built upon the iPlant Collaborative Cyber-infrastructure. The platform supports big data management, sharing, and large scale study of both genotype and phenotype data on clusters. End users can add their own analysis too...

  8. Genomic analysis of organismal complexity in the multicellular green alga Volvox carteri

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Prochnik, Simon E.; Umen, James; Nedelcu, Aurora

    2010-07-01

    Analysis of the Volvox carteri genome reveals that this green alga's increased organismal complexity and multicellularity are associated with modifications in protein families shared with its unicellular ancestor, and not with large-scale innovations in protein coding capacity. The multicellular green alga Volvox carteri and its morphologically diverse close relatives (the volvocine algae) are uniquely suited for investigating the evolution of multicellularity and development. We sequenced the 138 Mb genome of V. carteri and compared its {approx}14,500 predicted proteins to those of its unicellular relative, Chlamydomonas reinhardtii. Despite fundamental differences in organismal complexity and life history, the two species have similarmore » protein-coding potentials, and few species-specific protein-coding gene predictions. Interestingly, volvocine algal-specific proteins are enriched in Volvox, including those associated with an expanded and highly compartmentalized extracellular matrix. Our analysis shows that increases in organismal complexity can be associated with modifications of lineage-specific proteins rather than large-scale invention of protein-coding capacity.« less

  9. Functional precision medicine identifies novel druggable targets and therapeutic options in head and neck cancer. | Office of Cancer Genomics

    Cancer.gov

    Purpose: Head and neck squamous cell carcinoma (HNSCC) is the sixth most common cancer worldwide with high mortality and a lack of targeted therapies. To identify and prioritize druggable targets, we performed genome analysis together with genome-scale siRNA and oncology drug profiling using low passage tumor cells derived from a patient with a treatmentresistant HPV-negative HNSCC.

  10. Inference of Ancestral Recombination Graphs through Topological Data Analysis

    PubMed Central

    Cámara, Pablo G.; Levine, Arnold J.; Rabadán, Raúl

    2016-01-01

    The recent explosion of genomic data has underscored the need for interpretable and comprehensive analyses that can capture complex phylogenetic relationships within and across species. Recombination, reassortment and horizontal gene transfer constitute examples of pervasive biological phenomena that cannot be captured by tree-like representations. Starting from hundreds of genomes, we are interested in the reconstruction of potential evolutionary histories leading to the observed data. Ancestral recombination graphs represent potential histories that explicitly accommodate recombination and mutation events across orthologous genomes. However, they are computationally costly to reconstruct, usually being infeasible for more than few tens of genomes. Recently, Topological Data Analysis (TDA) methods have been proposed as robust and scalable methods that can capture the genetic scale and frequency of recombination. We build upon previous TDA developments for detecting and quantifying recombination, and present a novel framework that can be applied to hundreds of genomes and can be interpreted in terms of minimal histories of mutation and recombination events, quantifying the scales and identifying the genomic locations of recombinations. We implement this framework in a software package, called TARGet, and apply it to several examples, including small migration between different populations, human recombination, and horizontal evolution in finches inhabiting the Galápagos Islands. PMID:27532298

  11. LOLAweb: a containerized web server for interactive genomic locus overlap enrichment analysis.

    PubMed

    Nagraj, V P; Magee, Neal E; Sheffield, Nathan C

    2018-06-06

    The past few years have seen an explosion of interest in understanding the role of regulatory DNA. This interest has driven large-scale production of functional genomics data and analytical methods. One popular analysis is to test for enrichment of overlaps between a query set of genomic regions and a database of region sets. In this way, new genomic data can be easily connected to annotations from external data sources. Here, we present an interactive interface for enrichment analysis of genomic locus overlaps using a web server called LOLAweb. LOLAweb accepts a set of genomic ranges from the user and tests it for enrichment against a database of region sets. LOLAweb renders results in an R Shiny application to provide interactive visualization features, enabling users to filter, sort, and explore enrichment results dynamically. LOLAweb is built and deployed in a Linux container, making it scalable to many concurrent users on our servers and also enabling users to download and run LOLAweb locally.

  12. Fine-scale human genetic structure in Western France.

    PubMed

    Karakachoff, Matilde; Duforet-Frebourg, Nicolas; Simonet, Floriane; Le Scouarnec, Solena; Pellen, Nadine; Lecointe, Simon; Charpentier, Eric; Gros, Françoise; Cauchi, Stéphane; Froguel, Philippe; Copin, Nane; Le Tourneau, Thierry; Probst, Vincent; Le Marec, Hervé; Molinaro, Sabrina; Balkau, Beverley; Redon, Richard; Schott, Jean-Jacques; Blum, Michael Gb; Dina, Christian

    2015-06-01

    The difficulties arising from association analysis with rare variants underline the importance of suitable reference population cohorts, which integrate detailed spatial information. We analyzed a sample of 1684 individuals from Western France, who were genotyped at genome-wide level, from two cohorts D.E.S.I.R and CavsGen. We found that fine-scale population structure occurs at the scale of Western France, with distinct admixture proportions for individuals originating from the Brittany Region and the Vendée Department. Genetic differentiation increases with distance at a high rate in these two parts of Northwestern France and linkage disequilibrium is higher in Brittany suggesting a lower effective population size. When looking for genomic regions informative about Breton origin, we found two prominent associated regions that include the lactase region and the HLA complex. For both the lactase and the HLA regions, there is a low differentiation between Bretons and Irish, and this is also found at the genome-wide level. At a more refined scale, and within the Pays de la Loire Region, we also found evidence of fine-scale population structure, although principal component analysis showed that individuals from different departments cannot be confidently discriminated. Because of the evidence for fine-scale genetic structure in Western France, we anticipate that rare and geographically localized variants will be identified in future full-sequence analyses.

  13. Comparative genome-scale modelling of Staphylococcus aureus strains identifies strain-specific metabolic capabilities linked to pathogenicity

    PubMed Central

    Bosi, Emanuele; Monk, Jonathan M.; Aziz, Ramy K.; Fondi, Marco; Nizet, Victor; Palsson, Bernhard Ø.

    2016-01-01

    Staphylococcus aureus is a preeminent bacterial pathogen capable of colonizing diverse ecological niches within its human host. We describe here the pangenome of S. aureus based on analysis of genome sequences from 64 strains of S. aureus spanning a range of ecological niches, host types, and antibiotic resistance profiles. Based on this set, S. aureus is expected to have an open pangenome composed of 7,411 genes and a core genome composed of 1,441 genes. Metabolism was highly conserved in this core genome; however, differences were identified in amino acid and nucleotide biosynthesis pathways between the strains. Genome-scale models (GEMs) of metabolism were constructed for the 64 strains of S. aureus. These GEMs enabled a systems approach to characterizing the core metabolic and panmetabolic capabilities of the S. aureus species. All models were predicted to be auxotrophic for the vitamins niacin (vitamin B3) and thiamin (vitamin B1), whereas strain-specific auxotrophies were predicted for riboflavin (vitamin B2), guanosine, leucine, methionine, and cysteine, among others. GEMs were used to systematically analyze growth capabilities in more than 300 different growth-supporting environments. The results identified metabolic capabilities linked to pathogenic traits and virulence acquisitions. Such traits can be used to differentiate strains responsible for mild vs. severe infections and preference for hosts (e.g., animals vs. humans). Genome-scale analysis of multiple strains of a species can thus be used to identify metabolic determinants of virulence and increase our understanding of why certain strains of this deadly pathogen have spread rapidly throughout the world. PMID:27286824

  14. Savant Genome Browser 2: visualization and analysis for population-scale genomics.

    PubMed

    Fiume, Marc; Smith, Eric J M; Brook, Andrew; Strbenac, Dario; Turner, Brian; Mezlini, Aziz M; Robinson, Mark D; Wodak, Shoshana J; Brudno, Michael

    2012-07-01

    High-throughput sequencing (HTS) technologies are providing an unprecedented capacity for data generation, and there is a corresponding need for efficient data exploration and analysis capabilities. Although most existing tools for HTS data analysis are developed for either automated (e.g. genotyping) or visualization (e.g. genome browsing) purposes, such tools are most powerful when combined. For example, integration of visualization and computation allows users to iteratively refine their analyses by updating computational parameters within the visual framework in real-time. Here we introduce the second version of the Savant Genome Browser, a standalone program for visual and computational analysis of HTS data. Savant substantially improves upon its predecessor and existing tools by introducing innovative visualization modes and navigation interfaces for several genomic datatypes, and synergizing visual and automated analyses in a way that is powerful yet easy even for non-expert users. We also present a number of plugins that were developed by the Savant Community, which demonstrate the power of integrating visual and automated analyses using Savant. The Savant Genome Browser is freely available (open source) at www.savantbrowser.com.

  15. Savant Genome Browser 2: visualization and analysis for population-scale genomics

    PubMed Central

    Smith, Eric J. M.; Brook, Andrew; Strbenac, Dario; Turner, Brian; Mezlini, Aziz M.; Robinson, Mark D.; Wodak, Shoshana J.; Brudno, Michael

    2012-01-01

    High-throughput sequencing (HTS) technologies are providing an unprecedented capacity for data generation, and there is a corresponding need for efficient data exploration and analysis capabilities. Although most existing tools for HTS data analysis are developed for either automated (e.g. genotyping) or visualization (e.g. genome browsing) purposes, such tools are most powerful when combined. For example, integration of visualization and computation allows users to iteratively refine their analyses by updating computational parameters within the visual framework in real-time. Here we introduce the second version of the Savant Genome Browser, a standalone program for visual and computational analysis of HTS data. Savant substantially improves upon its predecessor and existing tools by introducing innovative visualization modes and navigation interfaces for several genomic datatypes, and synergizing visual and automated analyses in a way that is powerful yet easy even for non-expert users. We also present a number of plugins that were developed by the Savant Community, which demonstrate the power of integrating visual and automated analyses using Savant. The Savant Genome Browser is freely available (open source) at www.savantbrowser.com. PMID:22638571

  16. PGen: large-scale genomic variations analysis workflow and browser in SoyKB.

    PubMed

    Liu, Yang; Khan, Saad M; Wang, Juexin; Rynge, Mats; Zhang, Yuanxun; Zeng, Shuai; Chen, Shiyuan; Maldonado Dos Santos, Joao V; Valliyodan, Babu; Calyam, Prasad P; Merchant, Nirav; Nguyen, Henry T; Xu, Dong; Joshi, Trupti

    2016-10-06

    With the advances in next-generation sequencing (NGS) technology and significant reductions in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations and to apply the knowledge towards improvements in traits. To efficiently facilitate large-scale NGS resequencing data analysis of genomic variations, we have developed "PGen", an integrated and optimized workflow using the Extreme Science and Engineering Discovery Environment (XSEDE) high-performance computing (HPC) virtual system, iPlant cloud data storage resources and Pegasus workflow management system (Pegasus-WMS). The workflow allows users to identify single nucleotide polymorphisms (SNPs) and insertion-deletions (indels), perform SNP annotations and conduct copy number variation analyses on multiple resequencing datasets in a user-friendly and seamless way. We have developed both a Linux version in GitHub ( https://github.com/pegasus-isi/PGen-GenomicVariations-Workflow ) and a web-based implementation of the PGen workflow integrated within the Soybean Knowledge Base (SoyKB), ( http://soykb.org/Pegasus/index.php ). Using PGen, we identified 10,218,140 single-nucleotide polymorphisms (SNPs) and 1,398,982 indels from analysis of 106 soybean lines sequenced at 15X coverage. 297,245 non-synonymous SNPs and 3330 copy number variation (CNV) regions were identified from this analysis. SNPs identified using PGen from additional soybean resequencing projects adding to 500+ soybean germplasm lines in total have been integrated. These SNPs are being utilized for trait improvement using genotype to phenotype prediction approaches developed in-house. In order to browse and access NGS data easily, we have also developed an NGS resequencing data browser ( http://soykb.org/NGS_Resequence/NGS_index.php ) within SoyKB to provide easy access to SNP and downstream analysis results for soybean researchers. PGen workflow has been optimized for the most efficient analysis of soybean data using thorough testing and validation. This research serves as an example of best practices for development of genomics data analysis workflows by integrating remote HPC resources and efficient data management with ease of use for biological users. PGen workflow can also be easily customized for analysis of data in other species.

  17. CGAT: a model for immersive personalized training in computational genomics

    PubMed Central

    Sims, David; Ponting, Chris P.

    2016-01-01

    How should the next generation of genomics scientists be trained while simultaneously pursuing high quality and diverse research? CGAT, the Computational Genomics Analysis and Training programme, was set up in 2010 by the UK Medical Research Council to complement its investment in next-generation sequencing capacity. CGAT was conceived around the twin goals of training future leaders in genome biology and medicine, and providing much needed capacity to UK science for analysing genome scale data sets. Here we outline the training programme employed by CGAT and describe how it dovetails with collaborative research projects to launch scientists on the road towards independent research careers in genomics. PMID:25981124

  18. The FLEXGene repository: exploiting the fruits of the genome projects by creating a needed resource to face the challenges of the post-genomic era.

    PubMed

    Brizuela, Leonardo; Richardson, Aaron; Marsischky, Gerald; Labaer, Joshua

    2002-01-01

    Thanks to the results of the multiple completed and ongoing genome sequencing projects and to the newly available recombination-based cloning techniques, it is now possible to build gene repositories with no precedent in their composition, formatting, and potential. This new type of gene repository is necessary to address the challenges imposed by the post-genomic era, i.e., experimentation on a genome-wide scale. We are building the FLEXGene (Full Length EXpression-ready) repository. This unique resource will contain clones representing the complete ORFeome of different organisms, including Homo sapiens as well as several pathogens and model organisms. It will consist of a comprehensive, characterized (sequence-verified), and arrayed gene repository. This resource will allow full exploitation of the genomic information by enabling genome-wide scale experimentation at the level of functional/phenotypic assays as well as at the level of protein expression, purification, and analysis. Here we describe the rationale and construction of this resource and focus on the data obtained from the Saccharomyces cerevisiae project.

  19. Phylogenomics of nonavian reptiles and the structure of the ancestral amniote genome

    PubMed Central

    Shedlock, Andrew M.; Botka, Christopher W.; Zhao, Shaying; Shetty, Jyoti; Zhang, Tingting; Liu, Jun S.; Deschavanne, Patrick J.; Edwards, Scott V.

    2007-01-01

    We report results of a megabase-scale phylogenomic analysis of the Reptilia, the sister group of mammals. Large-scale end-sequence scanning of genomic clones of a turtle, alligator, and lizard reveals diverse, mammal-like landscapes of retroelements and simple sequence repeats (SSRs) not found in the chicken. Several global genomic traits, including distinctive phylogenetic lineages of CR1-like long interspersed elements (LINEs) and a paucity of A-T rich SSRs, characterize turtles and archosaur genomes, whereas higher frequencies of tandem repeats and a lower global GC content reveal mammal-like features in Anolis. Nonavian reptile genomes also possess a high frequency of diverse and novel 50-bp unit tandem duplications not found in chicken or mammals. The frequency distributions of ≈65,000 8-mer oligonucleotides suggest that rates of DNA-word frequency change are an order of magnitude slower in reptiles than in mammals. These results suggest a diverse array of interspersed and SSRs in the common ancestor of amniotes and a genomic conservatism and gradual loss of retroelements in reptiles that culminated in the minimalist chicken genome. PMID:17307883

  20. 2012 U.S. Department of Energy: Joint Genome Institute: Progress Report

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gilbert, David

    2013-01-01

    The mission of the U.S. Department of Energy Joint Genome Institute (DOE JGI) is to serve the diverse scientific community as a user facility, enabling the application of large-scale genomics and analysis of plants, microbes, and communities of microbes to address the DOE mission goals in bioenergy and the environment. The DOE JGI's sequencing efforts fall under the Eukaryote Super Program, which includes the Plant and Fungal Genomics Programs; and the Prokaryote Super Program, which includes the Microbial Genomics and Metagenomics Programs. In 2012, several projects made news for their contributions to energy and environment research.

  1. Population-Sequencing as a Biomarker of Burkholderia mallei and Burkholderia pseudomallei Evolution through Microbial Forensic Analysis.

    PubMed

    Jakupciak, John P; Wells, Jeffrey M; Karalus, Richard J; Pawlowski, David R; Lin, Jeffrey S; Feldman, Andrew B

    2013-01-01

    Large-scale genomics projects are identifying biomarkers to detect human disease. B. pseudomallei and B. mallei are two closely related select agents that cause melioidosis and glanders. Accurate characterization of metagenomic samples is dependent on accurate measurements of genetic variation between isolates with resolution down to strain level. Often single biomarker sensitivity is augmented by use of multiple or panels of biomarkers. In parallel with single biomarker validation, advances in DNA sequencing enable analysis of entire genomes in a single run: population-sequencing. Potentially, direct sequencing could be used to analyze an entire genome to serve as the biomarker for genome identification. However, genome variation and population diversity complicate use of direct sequencing, as well as differences caused by sample preparation protocols including sequencing artifacts and mistakes. As part of a Department of Homeland Security program in bacterial forensics, we examined how to implement whole genome sequencing (WGS) analysis as a judicially defensible forensic method for attributing microbial sample relatedness; and also to determine the strengths and limitations of whole genome sequence analysis in a forensics context. Herein, we demonstrate use of sequencing to provide genetic characterization of populations: direct sequencing of populations.

  2. Population-Sequencing as a Biomarker of Burkholderia mallei and Burkholderia pseudomallei Evolution through Microbial Forensic Analysis

    PubMed Central

    Jakupciak, John P.; Wells, Jeffrey M.; Karalus, Richard J.; Pawlowski, David R.; Lin, Jeffrey S.; Feldman, Andrew B.

    2013-01-01

    Large-scale genomics projects are identifying biomarkers to detect human disease. B. pseudomallei and B. mallei are two closely related select agents that cause melioidosis and glanders. Accurate characterization of metagenomic samples is dependent on accurate measurements of genetic variation between isolates with resolution down to strain level. Often single biomarker sensitivity is augmented by use of multiple or panels of biomarkers. In parallel with single biomarker validation, advances in DNA sequencing enable analysis of entire genomes in a single run: population-sequencing. Potentially, direct sequencing could be used to analyze an entire genome to serve as the biomarker for genome identification. However, genome variation and population diversity complicate use of direct sequencing, as well as differences caused by sample preparation protocols including sequencing artifacts and mistakes. As part of a Department of Homeland Security program in bacterial forensics, we examined how to implement whole genome sequencing (WGS) analysis as a judicially defensible forensic method for attributing microbial sample relatedness; and also to determine the strengths and limitations of whole genome sequence analysis in a forensics context. Herein, we demonstrate use of sequencing to provide genetic characterization of populations: direct sequencing of populations. PMID:24455204

  3. The RAVEN Toolbox and Its Use for Generating a Genome-scale Metabolic Model for Penicillium chrysogenum

    PubMed Central

    Agren, Rasmus; Liu, Liming; Shoaie, Saeed; Vongsangnak, Wanwipa; Nookaew, Intawat; Nielsen, Jens

    2013-01-01

    We present the RAVEN (Reconstruction, Analysis and Visualization of Metabolic Networks) Toolbox: a software suite that allows for semi-automated reconstruction of genome-scale models. It makes use of published models and/or the KEGG database, coupled with extensive gap-filling and quality control features. The software suite also contains methods for visualizing simulation results and omics data, as well as a range of methods for performing simulations and analyzing the results. The software is a useful tool for system-wide data analysis in a metabolic context and for streamlined reconstruction of metabolic networks based on protein homology. The RAVEN Toolbox workflow was applied in order to reconstruct a genome-scale metabolic model for the important microbial cell factory Penicillium chrysogenum Wisconsin54-1255. The model was validated in a bibliomic study of in total 440 references, and it comprises 1471 unique biochemical reactions and 1006 ORFs. It was then used to study the roles of ATP and NADPH in the biosynthesis of penicillin, and to identify potential metabolic engineering targets for maximization of penicillin production. PMID:23555215

  4. Using SQL Databases for Sequence Similarity Searching and Analysis.

    PubMed

    Pearson, William R; Mackey, Aaron J

    2017-09-13

    Relational databases can integrate diverse types of information and manage large sets of similarity search results, greatly simplifying genome-scale analyses. By focusing on taxonomic subsets of sequences, relational databases can reduce the size and redundancy of sequence libraries and improve the statistical significance of homologs. In addition, by loading similarity search results into a relational database, it becomes possible to explore and summarize the relationships between all of the proteins in an organism and those in other biological kingdoms. This unit describes how to use relational databases to improve the efficiency of sequence similarity searching and demonstrates various large-scale genomic analyses of homology-related data. It also describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. The unit also introduces search_demo, a database that stores sequence similarity search results. The search_demo database is then used to explore the evolutionary relationships between E. coli proteins and proteins in other organisms in a large-scale comparative genomic analysis. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.

  5. An efficient and scalable analysis framework for variant extraction and refinement from population-scale DNA sequence data.

    PubMed

    Jun, Goo; Wing, Mary Kate; Abecasis, Gonçalo R; Kang, Hyun Min

    2015-06-01

    The analysis of next-generation sequencing data is computationally and statistically challenging because of the massive volume of data and imperfect data quality. We present GotCloud, a pipeline for efficiently detecting and genotyping high-quality variants from large-scale sequencing data. GotCloud automates sequence alignment, sample-level quality control, variant calling, filtering of likely artifacts using machine-learning techniques, and genotype refinement using haplotype information. The pipeline can process thousands of samples in parallel and requires less computational resources than current alternatives. Experiments with whole-genome and exome-targeted sequence data generated by the 1000 Genomes Project show that the pipeline provides effective filtering against false positive variants and high power to detect true variants. Our pipeline has already contributed to variant detection and genotyping in several large-scale sequencing projects, including the 1000 Genomes Project and the NHLBI Exome Sequencing Project. We hope it will now prove useful to many medical sequencing studies. © 2015 Jun et al.; Published by Cold Spring Harbor Laboratory Press.

  6. Selected Insights from Application of Whole Genome Sequencing for Outbreak Investigations

    PubMed Central

    Le, Vien Thi Minh; Diep, Binh An

    2014-01-01

    Purpose of review The advent of high-throughput whole genome sequencing has the potential to revolutionize the conduct of outbreak investigation. Because of its ultimate pathogen strain resolution, whole genome sequencing could augment traditional epidemiologic investigations of infectious disease outbreaks. Recent findings The combination of whole genome sequencing and intensive epidemiologic analysis provided new insights on the sources and transmission dynamics of large-scale epidemics caused by Escherichia coli and Vibrio cholerae, nosocomial outbreaks caused by methicillin-resistant Staphylococcus aureus, Klebsiella pneumonia, and Mycobacterium abscessus, community-centered outbreaks caused by Mycobacterium tuberculosis, and natural disaster-associated outbreak caused by environmentally acquired molds. Summary When combined with traditional epidemiologic investigation, whole genome sequencing has proven useful for elucidating sources and transmission dynamics of disease outbreaks. Development of a fully automated bioinformatics pipeline for analysis of whole genome sequence data is much needed to make this powerful tool more widely accessible. PMID:23856896

  7. A genome-scale metabolic reconstruction of Pseudomonas putida KT2440: iJN746 as a cell factory.

    PubMed

    Nogales, Juan; Palsson, Bernhard Ø; Thiele, Ines

    2008-09-16

    Pseudomonas putida is the best studied pollutant degradative bacteria and is harnessed by industrial biotechnology to synthesize fine chemicals. Since the publication of P. putida KT2440's genome, some in silico analyses of its metabolic and biotechnology capacities have been published. However, global understanding of the capabilities of P. putida KT2440 requires the construction of a metabolic model that enables the integration of classical experimental data along with genomic and high-throughput data. The constraint-based reconstruction and analysis (COBRA) approach has been successfully used to build and analyze in silico genome-scale metabolic reconstructions. We present a genome-scale reconstruction of P. putida KT2440's metabolism, iJN746, which was constructed based on genomic, biochemical, and physiological information. This manually-curated reconstruction accounts for 746 genes, 950 reactions, and 911 metabolites. iJN746 captures biotechnologically relevant pathways, including polyhydroxyalkanoate synthesis and catabolic pathways of aromatic compounds (e.g., toluene, benzoate, phenylacetate, nicotinate), not described in other metabolic reconstructions or biochemical databases. The predictive potential of iJN746 was validated using experimental data including growth performance and gene deletion studies. Furthermore, in silico growth on toluene was found to be oxygen-limited, suggesting the existence of oxygen-efficient pathways not yet annotated in P. putida's genome. Moreover, we evaluated the production efficiency of polyhydroxyalkanoates from various carbon sources and found fatty acids as the most prominent candidates, as expected. Here we presented the first genome-scale reconstruction of P. putida, a biotechnologically interesting all-surrounder. Taken together, this work illustrates the utility of iJN746 as i) a knowledge-base, ii) a discovery tool, and iii) an engineering platform to explore P. putida's potential in bioremediation and bioplastic production.

  8. A genome scale metabolic network for rice and accompanying analysis of tryptophan, auxin and serotonin biosynthesis regulation under biotic stress

    USDA-ARS?s Scientific Manuscript database

    Functional annotations of large plant genome projects mostly provide information on gene function and gene families based on the presence of protein domains and gene homology, but not necessarily in association with gene expression or metabolic and regulatory networks. These additional annotations a...

  9. Genomic Footprints of Selective Sweeps from Metabolic Resistance to Pyrethroids in African Malaria Vectors Are Driven by Scale up of Insecticide-Based Vector Control.

    PubMed

    Barnes, Kayla G; Weedall, Gareth D; Ndula, Miranda; Irving, Helen; Mzihalowa, Themba; Hemingway, Janet; Wondji, Charles S

    2017-02-01

    Insecticide resistance in mosquito populations threatens recent successes in malaria prevention. Elucidating patterns of genetic structure in malaria vectors to predict the speed and direction of the spread of resistance is essential to get ahead of the 'resistance curve' and to avert a public health catastrophe. Here, applying a combination of microsatellite analysis, whole genome sequencing and targeted sequencing of a resistance locus, we elucidated the continent-wide population structure of a major African malaria vector, Anopheles funestus. We identified a major selective sweep in a genomic region controlling cytochrome P450-based metabolic resistance conferring high resistance to pyrethroids. This selective sweep occurred since 2002, likely as a direct consequence of scaled up vector control as revealed by whole genome and fine-scale sequencing of pre- and post-intervention populations. Fine-scaled analysis of the pyrethroid resistance locus revealed that a resistance-associated allele of the cytochrome P450 monooxygenase CYP6P9a has swept through southern Africa to near fixation, in contrast to high polymorphism levels before interventions, conferring high levels of pyrethroid resistance linked to control failure. Population structure analysis revealed a barrier to gene flow between southern Africa and other areas, which may prevent or slow the spread of the southern mechanism of pyrethroid resistance to other regions. By identifying a genetic signature of pyrethroid-based interventions, we have demonstrated the intense selective pressure that control interventions exert on mosquito populations. If this level of selection and spread of resistance continues unabated, our ability to control malaria with current interventions will be compromised.

  10. GIANT 2.0: genome-scale integrated analysis of gene networks in tissues.

    PubMed

    Wong, Aaron K; Krishnan, Arjun; Troyanskaya, Olga G

    2018-05-25

    GIANT2 (Genome-wide Integrated Analysis of gene Networks in Tissues) is an interactive web server that enables biomedical researchers to analyze their proteins and pathways of interest and generate hypotheses in the context of genome-scale functional maps of human tissues. The precise actions of genes are frequently dependent on their tissue context, yet direct assay of tissue-specific protein function and interactions remains infeasible in many normal human tissues and cell-types. With GIANT2, researchers can explore predicted tissue-specific functional roles of genes and reveal changes in those roles across tissues, all through interactive multi-network visualizations and analyses. Additionally, the NetWAS approach available through the server uses tissue-specific/cell-type networks predicted by GIANT2 to re-prioritize statistical associations from GWAS studies and identify disease-associated genes. GIANT2 predicts tissue-specific interactions by integrating diverse functional genomics data from now over 61 400 experiments for 283 diverse tissues and cell-types. GIANT2 does not require any registration or installation and is freely available for use at http://giant-v2.princeton.edu.

  11. CoCoNUT: an efficient system for the comparison and analysis of genomes

    PubMed Central

    2008-01-01

    Background Comparative genomics is the analysis and comparison of genomes from different species. This area of research is driven by the large number of sequenced genomes and heavily relies on efficient algorithms and software to perform pairwise and multiple genome comparisons. Results Most of the software tools available are tailored for one specific task. In contrast, we have developed a novel system CoCoNUT (Computational Comparative geNomics Utility Toolkit) that allows solving several different tasks in a unified framework: (1) finding regions of high similarity among multiple genomic sequences and aligning them, (2) comparing two draft or multi-chromosomal genomes, (3) locating large segmental duplications in large genomic sequences, and (4) mapping cDNA/EST to genomic sequences. Conclusion CoCoNUT is competitive with other software tools w.r.t. the quality of the results. The use of state of the art algorithms and data structures allows CoCoNUT to solve comparative genomics tasks more efficiently than previous tools. With the improved user interface (including an interactive visualization component), CoCoNUT provides a unified, versatile, and easy-to-use software tool for large scale studies in comparative genomics. PMID:19014477

  12. Framework for network modularization and Bayesian network analysis to investigate the perturbed metabolic network

    PubMed Central

    2011-01-01

    Background Genome-scale metabolic network models have contributed to elucidating biological phenomena, and predicting gene targets to engineer for biotechnological applications. With their increasing importance, their precise network characterization has also been crucial for better understanding of the cellular physiology. Results We herein introduce a framework for network modularization and Bayesian network analysis (FMB) to investigate organism’s metabolism under perturbation. FMB reveals direction of influences among metabolic modules, in which reactions with similar or positively correlated flux variation patterns are clustered, in response to specific perturbation using metabolic flux data. With metabolic flux data calculated by constraints-based flux analysis under both control and perturbation conditions, FMB, in essence, reveals the effects of specific perturbations on the biological system through network modularization and Bayesian network analysis at metabolic modular level. As a demonstration, this framework was applied to the genetically perturbed Escherichia coli metabolism, which is a lpdA gene knockout mutant, using its genome-scale metabolic network model. Conclusions After all, it provides alternative scenarios of metabolic flux distributions in response to the perturbation, which are complementary to the data obtained from conventionally available genome-wide high-throughput techniques or metabolic flux analysis. PMID:22784571

  13. Framework for network modularization and Bayesian network analysis to investigate the perturbed metabolic network.

    PubMed

    Kim, Hyun Uk; Kim, Tae Yong; Lee, Sang Yup

    2011-01-01

    Genome-scale metabolic network models have contributed to elucidating biological phenomena, and predicting gene targets to engineer for biotechnological applications. With their increasing importance, their precise network characterization has also been crucial for better understanding of the cellular physiology. We herein introduce a framework for network modularization and Bayesian network analysis (FMB) to investigate organism's metabolism under perturbation. FMB reveals direction of influences among metabolic modules, in which reactions with similar or positively correlated flux variation patterns are clustered, in response to specific perturbation using metabolic flux data. With metabolic flux data calculated by constraints-based flux analysis under both control and perturbation conditions, FMB, in essence, reveals the effects of specific perturbations on the biological system through network modularization and Bayesian network analysis at metabolic modular level. As a demonstration, this framework was applied to the genetically perturbed Escherichia coli metabolism, which is a lpdA gene knockout mutant, using its genome-scale metabolic network model. After all, it provides alternative scenarios of metabolic flux distributions in response to the perturbation, which are complementary to the data obtained from conventionally available genome-wide high-throughput techniques or metabolic flux analysis.

  14. Genomic paradigms for food-borne enteric pathogen analysis at the USFDA: case studies highlighting method utility, integration and resolution.

    PubMed

    Elkins, C A; Kotewicz, M L; Jackson, S A; Lacher, D W; Abu-Ali, G S; Patel, I R

    2013-01-01

    Modern risk control and food safety practices involving food-borne bacterial pathogens are benefiting from new genomic technologies for rapid, yet highly specific, strain characterisations. Within the United States Food and Drug Administration (USFDA) Center for Food Safety and Applied Nutrition (CFSAN), optical genome mapping and DNA microarray genotyping have been used for several years to quickly assess genomic architecture and gene content, respectively, for outbreak strain subtyping and to enhance retrospective trace-back analyses. The application and relative utility of each method varies with outbreak scenario and the suspect pathogen, with comparative analytical power enhanced by database scale and depth. Integration of these two technologies allows high-resolution scrutiny of the genomic landscapes of enteric food-borne pathogens with notable examples including Shiga toxin-producing Escherichia coli (STEC) and Salmonella enterica serovars from a variety of food commodities. Moreover, the recent application of whole genome sequencing technologies to food-borne pathogen outbreaks and surveillance has enhanced resolution to the single nucleotide scale. This new wealth of sequence data will support more refined next-generation custom microarray designs, targeted re-sequencing and "genomic signature recognition" approaches involving a combination of genes and single nucleotide polymorphism detection to distil strain-specific fingerprinting to a minimised scale. This paper examines the utility of microarrays and optical mapping in analysing outbreaks, reviews best practices and the limits of these technologies for pathogen differentiation, and it considers future integration with whole genome sequencing efforts.

  15. A fast boosting-based screening method for large-scale association study in complex traits with genetic heterogeneity.

    PubMed

    Wang, Lu-Yong; Fasulo, D

    2006-01-01

    Genome-wide association study for complex diseases will generate massive amount of single nucleotide polymorphisms (SNPs) data. Univariate statistical test (i.e. Fisher exact test) was used to single out non-associated SNPs. However, the disease-susceptible SNPs may have little marginal effects in population and are unlikely to retain after the univariate tests. Also, model-based methods are impractical for large-scale dataset. Moreover, genetic heterogeneity makes the traditional methods harder to identify the genetic causes of diseases. A more recent random forest method provides a more robust method for screening the SNPs in thousands scale. However, for more large-scale data, i.e., Affymetrix Human Mapping 100K GeneChip data, a faster screening method is required to screening SNPs in whole-genome large scale association analysis with genetic heterogeneity. We propose a boosting-based method for rapid screening in large-scale analysis of complex traits in the presence of genetic heterogeneity. It provides a relatively fast and fairly good tool for screening and limiting the candidate SNPs for further more complex computational modeling task.

  16. Moving image analysis to the cloud: A case study with a genome-scale tomographic study

    NASA Astrophysics Data System (ADS)

    Mader, Kevin; Stampanoni, Marco

    2016-01-01

    Over the last decade, the time required to measure a terabyte of microscopic imaging data has gone from years to minutes. This shift has moved many of the challenges away from experimental design and measurement to scalable storage, organization, and analysis. As many scientists and scientific institutions lack training and competencies in these areas, major bottlenecks have arisen and led to substantial delays and gaps between measurement, understanding, and dissemination. We present in this paper a framework for analyzing large 3D datasets using cloud-based computational and storage resources. We demonstrate its applicability by showing the setup and costs associated with the analysis of a genome-scale study of bone microstructure. We then evaluate the relative advantages and disadvantages associated with local versus cloud infrastructures.

  17. Refined genetic maps reveal sexual dimorphism in human meiotic recombination at multiple scales

    NASA Astrophysics Data System (ADS)

    Bhérer, Claude; Campbell, Christopher L.; Auton, Adam

    2017-04-01

    In humans, males have lower recombination rates than females over the majority of the genome, but the opposite is usually true near the telomeres. These broad-scale differences have been known for decades, yet little is known about differences at the fine scale. By combining data sets, we have collected recombination events from over 100,000 meioses and have constructed sex-specific genetic maps at a previously unachievable resolution. Here we show that, although a substantial fraction of the genome shows some degree of sexually dimorphic recombination, the vast majority of hotspots are shared between the sexes, with only a small number of putative sex-specific hotspots. Wavelet analysis indicates that most of the differences can be attributed to the fine scale, and that variation in rate between the sexes can mostly be explained by differences in hotspot magnitude, rather than location. Nonetheless, known recombination-associated genomic features, such as THE1B repeat elements, show systematic differences between the sexes.

  18. CGAT: a model for immersive personalized training in computational genomics.

    PubMed

    Sims, David; Ponting, Chris P; Heger, Andreas

    2016-01-01

    How should the next generation of genomics scientists be trained while simultaneously pursuing high quality and diverse research? CGAT, the Computational Genomics Analysis and Training programme, was set up in 2010 by the UK Medical Research Council to complement its investment in next-generation sequencing capacity. CGAT was conceived around the twin goals of training future leaders in genome biology and medicine, and providing much needed capacity to UK science for analysing genome scale data sets. Here we outline the training programme employed by CGAT and describe how it dovetails with collaborative research projects to launch scientists on the road towards independent research careers in genomics. © The Author 2015. Published by Oxford University Press.

  19. The layout of a bacterial genome.

    PubMed

    Képès, François; Jester, Brian C; Lepage, Thibaut; Rafiei, Nafiseh; Rosu, Bianca; Junier, Ivan

    2012-07-16

    Recently the mismatch between our newly acquired capacity to synthetize DNA at genome scale, and our low capacity to design ab initio a functional genome has become conspicuous. This essay gathers a variety of constraints that globally shape natural genomes, with a focus on eubacteria. These constraints originate from chromosome replication (leading/lagging strand asymmetry; gene dosage gradient from origin to terminus; collisions with the transcription complexes), from biased codon usage, from noise control in gene expression, and from genome layout for co-functional genes. On the basis of this analysis, lessons are drawn for full genome design. Copyright © 2012 Federation of European Biochemical Societies. Published by Elsevier B.V. All rights reserved.

  20. A New Method for Rapid Screening of End-Point PCR Products: Application to Single Genome Amplified HIV and SIV Envelope Amplicons

    PubMed Central

    Houzet, Laurent; Deleage, Claire; Satie, Anne-Pascale; Merlande, Laetitia; Mahe, Dominique; Dejucq-Rainsford, Nathalie

    2015-01-01

    PCR is the most widely applied technique for large scale screening of bacterial clones, mouse genotypes, virus genomes etc. A drawback of large PCR screening is that amplicon analysis is usually performed using gel electrophoresis, a step that is very labor intensive, tedious and chemical waste generating. Single genome amplification (SGA) is used to characterize the diversity and evolutionary dynamics of virus populations within infected hosts. SGA is based on the isolation of single template molecule using limiting dilution followed by nested PCR amplification and requires the analysis of hundreds of reactions per sample, making large scale SGA studies very challenging. Here we present a novel approach entitled Long Amplicon Melt Profiling (LAMP) based on the analysis of the melting profile of the PCR reactions using SYBR Green and/or EvaGreen fluorescent dyes. The LAMP method represents an attractive alternative to gel electrophoresis and enables the quick discrimination of positive reactions. We validate LAMP for SIV and HIV env-SGA, in 96- and 384-well plate formats. Because the melt profiling allows the screening of several thousands of PCR reactions in a cost-effective, rapid and robust way, we believe it will greatly facilitate any large scale PCR screening. PMID:26053379

  1. A protocol for generating a high-quality genome-scale metabolic reconstruction.

    PubMed

    Thiele, Ines; Palsson, Bernhard Ø

    2010-01-01

    Network reconstructions are a common denominator in systems biology. Bottom-up metabolic network reconstructions have been developed over the last 10 years. These reconstructions represent structured knowledge bases that abstract pertinent information on the biochemical transformations taking place within specific target organisms. The conversion of a reconstruction into a mathematical format facilitates a myriad of computational biological studies, including evaluation of network content, hypothesis testing and generation, analysis of phenotypic characteristics and metabolic engineering. To date, genome-scale metabolic reconstructions for more than 30 organisms have been published and this number is expected to increase rapidly. However, these reconstructions differ in quality and coverage that may minimize their predictive potential and use as knowledge bases. Here we present a comprehensive protocol describing each step necessary to build a high-quality genome-scale metabolic reconstruction, as well as the common trials and tribulations. Therefore, this protocol provides a helpful manual for all stages of the reconstruction process.

  2. A protocol for generating a high-quality genome-scale metabolic reconstruction

    PubMed Central

    Thiele, Ines; Palsson, Bernhard Ø.

    2011-01-01

    Network reconstructions are a common denominator in systems biology. Bottom-up metabolic network reconstructions have developed over the past 10 years. These reconstructions represent structured knowledge-bases that abstract pertinent information on the biochemical transformations taking place within specific target organisms. The conversion of a reconstruction into a mathematical format facilitates myriad computational biological studies including evaluation of network content, hypothesis testing and generation, analysis of phenotypic characteristics, and metabolic engineering. To date, genome-scale metabolic reconstructions for more than 30 organisms have been published and this number is expected to increase rapidly. However, these reconstructions differ in quality and coverage that may minimize their predictive potential and use as knowledge-bases. Here, we present a comprehensive protocol describing each step necessary to build a high-quality genome-scale metabolic reconstruction as well as common trials and tribulations. Therefore, this protocol provides a helpful manual for all stages of the reconstruction process. PMID:20057383

  3. Computing patient data in the cloud: practical and legal considerations for genetics and genomics research in Europe and internationally.

    PubMed

    Molnár-Gábor, Fruzsina; Lueck, Rupert; Yakneen, Sergei; Korbel, Jan O

    2017-06-20

    Biomedical research is becoming increasingly large-scale and international. Cloud computing enables the comprehensive integration of genomic and clinical data, and the global sharing and collaborative processing of these data within a flexibly scalable infrastructure. Clouds offer novel research opportunities in genomics, as they facilitate cohort studies to be carried out at unprecedented scale, and they enable computer processing with superior pace and throughput, allowing researchers to address questions that could not be addressed by studies using limited cohorts. A well-developed example of such research is the Pan-Cancer Analysis of Whole Genomes project, which involves the analysis of petabyte-scale genomic datasets from research centers in different locations or countries and different jurisdictions. Aside from the tremendous opportunities, there are also concerns regarding the utilization of clouds; these concerns pertain to perceived limitations in data security and protection, and the need for due consideration of the rights of patient donors and research participants. Furthermore, the increased outsourcing of information technology impedes the ability of researchers to act within the realm of existing local regulations owing to fundamental differences in the understanding of the right to data protection in various legal systems. In this Opinion article, we address the current opportunities and limitations of cloud computing and highlight the responsible use of federated and hybrid clouds that are set up between public and private partners as an adequate solution for genetics and genomics research in Europe, and under certain conditions between Europe and international partners. This approach could represent a sensible middle ground between fragmented individual solutions and a "one-size-fits-all" approach.

  4. 13C metabolic flux analysis at a genome-scale.

    PubMed

    Gopalakrishnan, Saratram; Maranas, Costas D

    2015-11-01

    Metabolic models used in 13C metabolic flux analysis generally include a limited number of reactions primarily from central metabolism. They typically omit degradation pathways, complete cofactor balances, and atom transition contributions for reactions outside central metabolism. This study addresses the impact on prediction fidelity of scaling-up mapping models to a genome-scale. The core mapping model employed in this study accounts for (75 reactions and 65 metabolites) primarily from central metabolism. The genome-scale metabolic mapping model (GSMM) (697 reaction and 595 metabolites) is constructed using as a basis the iAF1260 model upon eliminating reactions guaranteed not to carry flux based on growth and fermentation data for a minimal glucose growth medium. Labeling data for 17 amino acid fragments obtained from cells fed with glucose labeled at the second carbon was used to obtain fluxes and ranges. Metabolic fluxes and confidence intervals are estimated, for both core and genome-scale mapping models, by minimizing the sum of square of differences between predicted and experimentally measured labeling patterns using the EMU decomposition algorithm. Overall, we find that both topology and estimated values of the metabolic fluxes remain largely consistent between core and GSM model. Stepping up to a genome-scale mapping model leads to wider flux inference ranges for 20 key reactions present in the core model. The glycolysis flux range doubles due to the possibility of active gluconeogenesis, the TCA flux range expanded by 80% due to the availability of a bypass through arginine consistent with labeling data, and the transhydrogenase reaction flux was essentially unresolved due to the presence of as many as five routes for the inter-conversion of NADPH to NADH afforded by the genome-scale model. By globally accounting for ATP demands in the GSMM model the unused ATP decreased drastically with the lower bound matching the maintenance ATP requirement. A non-zero flux for the arginine degradation pathway was identified to meet biomass precursor demands as detailed in the iAF1260 model. Inferred ranges for 81% of the reactions in the genome-scale metabolic (GSM) model varied less than one-tenth of the basis glucose uptake rate (95% confidence test). This is because as many as 411 reactions in the GSM are growth coupled meaning that the single measurement of biomass formation rate locks the reaction flux values. This implies that accurate biomass formation rate and composition are critical for resolving metabolic fluxes away from central metabolism and suggests the importance of biomass composition (re)assessment under different genetic and environmental backgrounds. In addition, the loss of information associated with mapping fluxes from MFA on a core model to a GSM model is quantified. Copyright © 2015 International Metabolic Engineering Society. Published by Elsevier Inc. All rights reserved.

  5. A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection.

    PubMed

    Wang, WeiBo; Sun, Wei; Wang, Wei; Szatkiewicz, Jin

    2018-03-01

    The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example. We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as "R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG's accuracy in CNV detection. Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.

  6. Integrated Approach to Reconstruction of Microbial Regulatory Networks

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rodionov, Dmitry A; Novichkov, Pavel S

    2013-11-04

    This project had the goal(s) of development of integrated bioinformatics platform for genome-scale inference and visualization of transcriptional regulatory networks (TRNs) in bacterial genomes. The work was done in Sanford-Burnham Medical Research Institute (SBMRI, P.I. D.A. Rodionov) and Lawrence Berkeley National Laboratory (LBNL, co-P.I. P.S. Novichkov). The developed computational resources include: (1) RegPredict web-platform for TRN inference and regulon reconstruction in microbial genomes, and (2) RegPrecise database for collection, visualization and comparative analysis of transcriptional regulons reconstructed by comparative genomics. These analytical resources were selected as key components in the DOE Systems Biology KnowledgeBase (SBKB). The high-quality data accumulated inmore » RegPrecise will provide essential datasets of reference regulons in diverse microbes to enable automatic reconstruction of draft TRNs in newly sequenced genomes. We outline our progress toward the three aims of this grant proposal, which were: Develop integrated platform for genome-scale regulon reconstruction; Infer regulatory annotations in several groups of bacteria and building of reference collections of microbial regulons; and Develop KnowledgeBase on microbial transcriptional regulation.« less

  7. Prokaryote genome fluidity: toward a system approach of the mobilome.

    PubMed

    Toussaint, Ariane; Chandler, Mick

    2012-01-01

    The importance of horizontal/lateral gene transfer (LGT) in shaping the genomes of prokaryotic organisms has been recognized in recent years as a result of analysis of the increasing number of available genome sequences. LGT is largely due to the transfer and recombination activities of mobile genetic elements (MGEs). Bacterial and archaeal genomes are mosaics of vertically and horizontally transmitted DNA segments. This generates reticulate relationships between members of the prokaryotic world that are better represented by networks than by "classical" phylogenetic trees. In this review we summarize the nature and activities of MGEs, and the problems that presently limit their analysis on a large scale. We propose routes to improve their annotation in the flow of genomic and metagenomic sequences that currently exist and those that become available. We describe network analysis of evolutionary relationships among some MGE categories and sketch out possible developments of this type of approach to get more insight into the role of the mobilome in bacterial adaptation and evolution.

  8. Bioinformatics and genomic analysis of transposable elements in eukaryotic genomes.

    PubMed

    Janicki, Mateusz; Rooke, Rebecca; Yang, Guojun

    2011-08-01

    A major portion of most eukaryotic genomes are transposable elements (TEs). During evolution, TEs have introduced profound changes to genome size, structure, and function. As integral parts of genomes, the dynamic presence of TEs will continue to be a major force in reshaping genomes. Early computational analyses of TEs in genome sequences focused on filtering out "junk" sequences to facilitate gene annotation. When the high abundance and diversity of TEs in eukaryotic genomes were recognized, these early efforts transformed into the systematic genome-wide categorization and classification of TEs. The availability of genomic sequence data reversed the classical genetic approaches to discovering new TE families and superfamilies. Curated TE databases and their accurate annotation of genome sequences in turn facilitated the studies on TEs in a number of frontiers including: (1) TE-mediated changes of genome size and structure, (2) the influence of TEs on genome and gene functions, (3) TE regulation by host, (4) the evolution of TEs and their population dynamics, and (5) genomic scale studies of TE activity. Bioinformatics and genomic approaches have become an integral part of large-scale studies on TEs to extract information with pure in silico analyses or to assist wet lab experimental studies. The current revolution in genome sequencing technology facilitates further progress in the existing frontiers of research and emergence of new initiatives. The rapid generation of large-sequence datasets at record low costs on a routine basis is challenging the computing industry on storage capacity and manipulation speed and the bioinformatics community for improvement in algorithms and their implementations.

  9. Population genomics reveals structure at the individual, host-tree scale and persistence of genotypic variants of the undomesticated yeast Saccharomyces paradoxus in a natural woodland.

    PubMed

    Xia, Wenjing; Nielly-Thibault, Lou; Charron, Guillaume; Landry, Christian R; Kasimer, Dahlia; Anderson, James B; Kohn, Linda M

    2017-02-01

    Genetic diversity in experimental, domesticated and wild populations of the related yeasts, Saccharomyces cerevisiae and Saccharomyces paradoxus, has been well described at the global scale. We investigated the population genomics of a local population on a small spatial scale to address two main questions. First, is there genomic variation in a S. paradoxus population at a spatial scale spanning centimetres (microsites) to tens of metres? Second, does the distribution of genomic variants persist over time? Our sample consisted of 42 S. paradoxus strains from 2014 and 43 strains from 2015 collected from the same 72 microsites around four host trees (Quercus rubra and Quercus alba) within 1 km 2 in a mixed hardwood forest in southern Ontario. Six additional S. paradoxus strains recovered from adjacent maple and beech trees in 2015 are also included in the sample. Whole-genome sequencing and genomic SNP analysis revealed five differentiated groups (clades) within the sampled area. The signal of persistence of genotypes in their microsites from 2014 to 2015 was highly significant. Isolates from the same tree tended to be more related than strains from different trees, with limited evidence of dispersal between trees. In growth assays, one genotype had a significantly longer lag phase than the other strains. Our results indicate that different clades coexist at fine spatial scale and that population structure persists over at least a one-year interval in these wild yeasts, suggesting the efficacy of yearly sampling to follow longer term genetic dynamics in future studies. © 2016 John Wiley & Sons Ltd.

  10. WormBase ParaSite - a comprehensive resource for helminth genomics.

    PubMed

    Howe, Kevin L; Bolt, Bruce J; Shafie, Myriam; Kersey, Paul; Berriman, Matthew

    2017-07-01

    The number of publicly available parasitic worm genome sequences has increased dramatically in the past three years, and research interest in helminth functional genomics is now quickly gathering pace in response to the foundation that has been laid by these collective efforts. A systematic approach to the organisation, curation, analysis and presentation of these data is clearly vital for maximising the utility of these data to researchers. We have developed a portal called WormBase ParaSite (http://parasite.wormbase.org) for interrogating helminth genomes on a large scale. Data from over 100 nematode and platyhelminth species are integrated, adding value by way of systematic and consistent functional annotation (e.g. protein domains and Gene Ontology terms), gene expression analysis (e.g. alignment of life-stage specific transcriptome data sets), and comparative analysis (e.g. orthologues and paralogues). We provide several ways of exploring the data, including genome browsers, genome and gene summary pages, text search, sequence search, a query wizard, bulk downloads, and programmatic interfaces. In this review, we provide an overview of the back-end infrastructure and analysis behind WormBase ParaSite, and the displays and tools available to users for interrogating helminth genomic data. Copyright © 2016 The Authors. Published by Elsevier B.V. All rights reserved.

  11. Network-assisted target identification for haploinsufficiency and homozygous profiling screens

    PubMed Central

    Wang, Sheng

    2017-01-01

    Chemical genomic screens have recently emerged as a systematic approach to drug discovery on a genome-wide scale. Drug target identification and elucidation of the mechanism of action (MoA) of hits from these noisy high-throughput screens remain difficult. Here, we present GIT (Genetic Interaction Network-Assisted Target Identification), a network analysis method for drug target identification in haploinsufficiency profiling (HIP) and homozygous profiling (HOP) screens. With the drug-induced phenotypic fitness defect of the deletion of a gene, GIT also incorporates the fitness defects of the gene’s neighbors in the genetic interaction network. On three genome-scale yeast chemical genomic screens, GIT substantially outperforms previous scoring methods on target identification on HIP and HOP assays, respectively. Finally, we showed that by combining HIP and HOP assays, GIT further boosts target identification and reveals potential drug’s mechanism of action. PMID:28574983

  12. Optimizing eukaryotic cell hosts for protein production through systems biotechnology and genome-scale modeling.

    PubMed

    Gutierrez, Jahir M; Lewis, Nathan E

    2015-07-01

    Eukaryotic cell lines, including Chinese hamster ovary cells, yeast, and insect cells, are invaluable hosts for the production of many recombinant proteins. With the advent of genomic resources, one can now leverage genome-scale computational modeling of cellular pathways to rationally engineer eukaryotic host cells. Genome-scale models of metabolism include all known biochemical reactions occurring in a specific cell. By describing these mathematically and using tools such as flux balance analysis, the models can simulate cell physiology and provide targets for cell engineering that could lead to enhanced cell viability, titer, and productivity. Here we review examples in which metabolic models in eukaryotic cell cultures have been used to rationally select targets for genetic modification, improve cellular metabolic capabilities, design media supplementation, and interpret high-throughput omics data. As more comprehensive models of metabolism and other cellular processes are developed for eukaryotic cell culture, these will enable further exciting developments in cell line engineering, thus accelerating recombinant protein production and biotechnology in the years to come. Copyright © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  13. Resources for Genetic and Genomic Analysis of Emerging Pathogen Acinetobacter baumannii

    PubMed Central

    Ramage, Elizabeth; Weiss, Eli J.; Radey, Matthew; Hayden, Hillary S.; Held, Kiara G.; Huse, Holly K.; Zurawski, Daniel V.; Brittnacher, Mitchell J.; Manoil, Colin

    2015-01-01

    ABSTRACT Acinetobacter baumannii is a Gram-negative bacterial pathogen notorious for causing serious nosocomial infections that resist antibiotic therapy. Research to identify factors responsible for the pathogen's success has been limited by the resources available for genome-scale experimental studies. This report describes the development of several such resources for A. baumannii strain AB5075, a recently characterized wound isolate that is multidrug resistant and displays robust virulence in animal models. We report the completion and annotation of the genome sequence, the construction of a comprehensive ordered transposon mutant library, the extension of high-coverage transposon mutant pool sequencing (Tn-seq) to the strain, and the identification of the genes essential for growth on nutrient-rich agar. These resources should facilitate large-scale genetic analysis of virulence, resistance, and other clinically relevant traits that make A. baumannii a formidable public health threat. IMPORTANCE Acinetobacter baumannii is one of six bacterial pathogens primarily responsible for antibiotic-resistant infections that have become the scourge of health care facilities worldwide. Eliminating such infections requires a deeper understanding of the factors that enable the pathogen to persist in hospital environments, establish infections, and resist antibiotics. We present a set of resources that should accelerate genome-scale genetic characterization of these traits for a reference isolate of A. baumannii that is highly virulent and representative of current outbreak strains. PMID:25845845

  14. CPTAC | Office of Cancer Clinical Proteomics Research

    Cancer.gov

    The National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics.

  15. Survey of protein–DNA interactions in Aspergillus oryzae on a genomic scale

    PubMed Central

    Wang, Chao; Lv, Yangyong; Wang, Bin; Yin, Chao; Lin, Ying; Pan, Li

    2015-01-01

    The genome-scale delineation of in vivo protein–DNA interactions is key to understanding genome function. Only ∼5% of transcription factors (TFs) in the Aspergillus genus have been identified using traditional methods. Although the Aspergillus oryzae genome contains >600 TFs, knowledge of the in vivo genome-wide TF-binding sites (TFBSs) in aspergilli remains limited because of the lack of high-quality antibodies. We investigated the landscape of in vivo protein–DNA interactions across the A. oryzae genome through coupling the DNase I digestion of intact nuclei with massively parallel sequencing and the analysis of cleavage patterns in protein–DNA interactions at single-nucleotide resolution. The resulting map identified overrepresented de novo TF-binding motifs from genomic footprints, and provided the detailed chromatin remodeling patterns and the distribution of digital footprints near transcription start sites. The TFBSs of 19 known Aspergillus TFs were also identified based on DNase I digestion data surrounding potential binding sites in conjunction with TF binding specificity information. We observed that the cleavage patterns of TFBSs were dependent on the orientation of TF motifs and independent of strand orientation, consistent with the DNA shape features of binding motifs with flanking sequences. PMID:25883143

  16. Genome-wide association study to identify common variants associated with brachial circumference: a meta-analysis of 14 cohorts.

    PubMed

    Boraska, Vesna; Day-Williams, Aaron; Franklin, Christopher S; Elliott, Katherine S; Panoutsopoulou, Kalliope; Tachmazidou, Ioanna; Albrecht, Eva; Bandinelli, Stefania; Beilin, Lawrence J; Bochud, Murielle; Cadby, Gemma; Ernst, Florian; Evans, David M; Hayward, Caroline; Hicks, Andrew A; Huffman, Jennifer; Huth, Cornelia; James, Alan L; Klopp, Norman; Kolcic, Ivana; Kutalik, Zoltán; Lawlor, Debbie A; Musk, Arthur W; Pehlic, Marina; Pennell, Craig E; Perry, John R B; Peters, Annette; Polasek, Ozren; St Pourcain, Beate; Ring, Susan M; Salvi, Erika; Schipf, Sabine; Staessen, Jan A; Teumer, Alexander; Timpson, Nicholas; Vitart, Veronique; Warrington, Nicole M; Yaghootkar, Hanieh; Zemunik, Tatijana; Zgaga, Lina; An, Ping; Anttila, Verneri; Borecki, Ingrid B; Holmen, Jostein; Ntalla, Ioanna; Palotie, Aarno; Pietiläinen, Kirsi H; Wedenoja, Juho; Winsvold, Bendik S; Dedoussis, George V; Kaprio, Jaakko; Province, Michael A; Zwart, John-Anker; Burnier, Michel; Campbell, Harry; Cusi, Daniele; Smith, George Davey; Frayling, Timothy M; Gieger, Christian; Palmer, Lyle J; Pramstaller, Peter P; Rudan, Igor; Völzke, Henry; Wichmann, H-Erich; Wright, Alan F; Zeggini, Eleftheria

    2012-01-01

    Brachial circumference (BC), also known as upper arm or mid arm circumference, can be used as an indicator of muscle mass and fat tissue, which are distributed differently in men and women. Analysis of anthropometric measures of peripheral fat distribution such as BC could help in understanding the complex pathophysiology behind overweight and obesity. The purpose of this study is to identify genetic variants associated with BC through a large-scale genome-wide association scan (GWAS) meta-analysis. We used fixed-effects meta-analysis to synthesise summary results across 14 GWAS discovery and 4 replication cohorts comprising overall 22,376 individuals (12,031 women and 10,345 men) of European ancestry. Individual analyses were carried out for men, women, and combined across sexes using linear regression and an additive genetic model: adjusted for age and adjusted for age and BMI. We prioritised signals for follow-up in two-stages. We did not detect any signals reaching genome-wide significance. The FTO rs9939609 SNP showed nominal evidence for association (p<0.05) in the age-adjusted strata for men and across both sexes. In this first GWAS meta-analysis for BC to date, we have not identified any genome-wide significant signals and do not observe robust association of previously established obesity loci with BC. Large-scale collaborations will be necessary to achieve higher power to detect loci underlying BC.

  17. Genome-Wide Association Study to Identify Common Variants Associated with Brachial Circumference: A Meta-Analysis of 14 Cohorts

    PubMed Central

    Boraska, Vesna; Day-Williams, Aaron; Franklin, Christopher S.; Elliott, Katherine S.; Panoutsopoulou, Kalliope; Tachmazidou, Ioanna; Albrecht, Eva; Bandinelli, Stefania; Beilin, Lawrence J.; Bochud, Murielle; Cadby, Gemma; Ernst, Florian; Evans, David M.; Hayward, Caroline; Hicks, Andrew A.; Huffman, Jennifer; Huth, Cornelia; James, Alan L.; Klopp, Norman; Kolcic, Ivana; Kutalik, Zoltán; Lawlor, Debbie A.; Musk, Arthur W.; Pehlic, Marina; Pennell, Craig E.; Perry, John R. B.; Peters, Annette; Polasek, Ozren; Pourcain, Beate St; Ring, Susan M.; Salvi, Erika; Schipf, Sabine; Staessen, Jan A.; Teumer, Alexander; Timpson, Nicholas; Vitart, Veronique; Warrington, Nicole M.; Yaghootkar, Hanieh; Zemunik, Tatijana; Zgaga, Lina; An, Ping; Anttila, Verneri; Borecki, Ingrid B.; Holmen, Jostein; Ntalla, Ioanna; Palotie, Aarno; Pietiläinen, Kirsi H.; Wedenoja, Juho; Winsvold, Bendik S.; Dedoussis, George V.; Kaprio, Jaakko; Province, Michael A.; Zwart, John-Anker; Burnier, Michel; Campbell, Harry; Cusi, Daniele; Davey Smith, George; Frayling, Timothy M.; Gieger, Christian; Palmer, Lyle J.; Pramstaller, Peter P.; Rudan, Igor; Völzke, Henry; Wichmann, H. -Erich; Wright, Alan F.; Zeggini, Eleftheria

    2012-01-01

    Brachial circumference (BC), also known as upper arm or mid arm circumference, can be used as an indicator of muscle mass and fat tissue, which are distributed differently in men and women. Analysis of anthropometric measures of peripheral fat distribution such as BC could help in understanding the complex pathophysiology behind overweight and obesity. The purpose of this study is to identify genetic variants associated with BC through a large-scale genome-wide association scan (GWAS) meta-analysis. We used fixed-effects meta-analysis to synthesise summary results across 14 GWAS discovery and 4 replication cohorts comprising overall 22,376 individuals (12,031 women and 10,345 men) of European ancestry. Individual analyses were carried out for men, women, and combined across sexes using linear regression and an additive genetic model: adjusted for age and adjusted for age and BMI. We prioritised signals for follow-up in two-stages. We did not detect any signals reaching genome-wide significance. The FTO rs9939609 SNP showed nominal evidence for association (p<0.05) in the age-adjusted strata for men and across both sexes. In this first GWAS meta-analysis for BC to date, we have not identified any genome-wide significant signals and do not observe robust association of previously established obesity loci with BC. Large-scale collaborations will be necessary to achieve higher power to detect loci underlying BC. PMID:22479309

  18. Accurate evaluation and analysis of functional genomics data and methods

    PubMed Central

    Greene, Casey S.; Troyanskaya, Olga G.

    2016-01-01

    The development of technology capable of inexpensively performing large-scale measurements of biological systems has generated a wealth of data. Integrative analysis of these data holds the promise of uncovering gene function, regulation, and, in the longer run, understanding complex disease. However, their analysis has proved very challenging, as it is difficult to quickly and effectively assess the relevance and accuracy of these data for individual biological questions. Here, we identify biases that present challenges for the assessment of functional genomics data and methods. We then discuss evaluation methods that, taken together, begin to address these issues. We also argue that the funding of systematic data-driven experiments and of high-quality curation efforts will further improve evaluation metrics so that they more-accurately assess functional genomics data and methods. Such metrics will allow researchers in the field of functional genomics to continue to answer important biological questions in a data-driven manner. PMID:22268703

  19. How do students react to analyzing their own genomes in a whole-genome sequencing course?: outcomes of a longitudinal cohort study.

    PubMed

    Sanderson, Saskia C; Linderman, Michael D; Zinberg, Randi; Bashir, Ali; Kasarskis, Andrew; Zweig, Micol; Suckiel, Sabrina; Shah, Hardik; Mahajan, Milind; Diaz, George A; Schadt, Eric E

    2015-11-01

    Health-care professionals need to be trained to work with whole-genome sequencing (WGS) in their practice. Our aim was to explore how students responded to a novel genome analysis course that included the option to analyze their own genomes. This was an observational cohort study. Questionnaires were administered before (T3) and after the genome analysis course (T4), as well as 6 months later (T5). In-depth interviews were conducted at T5. All students (n = 19) opted to analyze their own genomes. At T5, 12 of 15 students stated that analyzing their own genomes had been useful. Ten reported they had applied their knowledge in the workplace. Technical WGS knowledge increased (mean of 63.8% at T3, mean of 72.5% at T4; P = 0.005). In-depth interviews suggested that analyzing their own genomes may increase students' motivation to learn and their understanding of the patient experience. Most (but not all) of the students reported low levels of WGS results-related distress and low levels of regret about their decision to analyze their own genomes. Giving students the option of analyzing their own genomes may increase motivation to learn, but some students may experience personal WGS results-related distress and regret. Additional evidence is required before considering incorporating optional personal genome analysis into medical education on a large scale.

  20. Re-annotation, improved large-scale assembly and establishment of a catalogue of noncoding loci for the genome of the model brown alga Ectocarpus.

    PubMed

    Cormier, Alexandre; Avia, Komlan; Sterck, Lieven; Derrien, Thomas; Wucher, Valentin; Andres, Gwendoline; Monsoor, Misharl; Godfroy, Olivier; Lipinska, Agnieszka; Perrineau, Marie-Mathilde; Van De Peer, Yves; Hitte, Christophe; Corre, Erwan; Coelho, Susana M; Cock, J Mark

    2017-04-01

    The genome of the filamentous brown alga Ectocarpus was the first to be completely sequenced from within the brown algal group and has served as a key reference genome both for this lineage and for the stramenopiles. We present a complete structural and functional reannotation of the Ectocarpus genome. The large-scale assembly of the Ectocarpus genome was significantly improved and genome-wide gene re-annotation using extensive RNA-seq data improved the structure of 11 108 existing protein-coding genes and added 2030 new loci. A genome-wide analysis of splicing isoforms identified an average of 1.6 transcripts per locus. A large number of previously undescribed noncoding genes were identified and annotated, including 717 loci that produce long noncoding RNAs. Conservation of lncRNAs between Ectocarpus and another brown alga, the kelp Saccharina japonica, suggests that at least a proportion of these loci serve a function. Finally, a large collection of single nucleotide polymorphism-based markers was developed for genetic analyses. These resources are available through an updated and improved genome database. This study significantly improves the utility of the Ectocarpus genome as a high-quality reference for the study of many important aspects of brown algal biology and as a reference for genomic analyses across the stramenopiles. © 2016 The Authors. New Phytologist © 2016 New Phytologist Trust.

  1. WEbcoli: an interactive and asynchronous web application for in silico design and analysis of genome-scale E.coli model.

    PubMed

    Jung, Tae-Sung; Yeo, Hock Chuan; Reddy, Satty G; Cho, Wan-Sup; Lee, Dong-Yup

    2009-11-01

    WEbcoli is a WEb application for in silico designing, analyzing and engineering Escherichia coli metabolism. It is devised and implemented using advanced web technologies, thereby leading to enhanced usability and dynamic web accessibility. As a main feature, the WEbcoli system provides a user-friendly rich web interface, allowing users to virtually design and synthesize mutant strains derived from the genome-scale wild-type E.coli model and to customize pathways of interest through a graph editor. In addition, constraints-based flux analysis can be conducted for quantifying metabolic fluxes and charactering the physiological and metabolic states under various genetic and/or environmental conditions. WEbcoli is freely accessible at http://webcoli.org. cheld@nus.edu.sg.

  2. Sybil--efficient constraint-based modelling in R.

    PubMed

    Gelius-Dietrich, Gabriel; Desouki, Abdelmoneim Amer; Fritzemeier, Claus Jonathan; Lercher, Martin J

    2013-11-13

    Constraint-based analyses of metabolic networks are widely used to simulate the properties of genome-scale metabolic networks. Publicly available implementations tend to be slow, impeding large scale analyses such as the genome-wide computation of pairwise gene knock-outs, or the automated search for model improvements. Furthermore, available implementations cannot easily be extended or adapted by users. Here, we present sybil, an open source software library for constraint-based analyses in R; R is a free, platform-independent environment for statistical computing and graphics that is widely used in bioinformatics. Among other functions, sybil currently provides efficient methods for flux-balance analysis (FBA), MOMA, and ROOM that are about ten times faster than previous implementations when calculating the effect of whole-genome single gene deletions in silico on a complete E. coli metabolic model. Due to the object-oriented architecture of sybil, users can easily build analysis pipelines in R or even implement their own constraint-based algorithms. Based on its highly efficient communication with different mathematical optimisation programs, sybil facilitates the exploration of high-dimensional optimisation problems on small time scales. Sybil and all its dependencies are open source. Sybil and its documentation are available for download from the comprehensive R archive network (CRAN).

  3. IMG-ABC: A Knowledge Base To Fuel Discovery of Biosynthetic Gene Clusters and Novel Secondary Metabolites.

    PubMed

    Hadjithomas, Michalis; Chen, I-Min Amy; Chu, Ken; Ratner, Anna; Palaniappan, Krishna; Szeto, Ernest; Huang, Jinghua; Reddy, T B K; Cimermančič, Peter; Fischbach, Michael A; Ivanova, Natalia N; Markowitz, Victor M; Kyrpides, Nikos C; Pati, Amrita

    2015-07-14

    In the discovery of secondary metabolites, analysis of sequence data is a promising exploration path that remains largely underutilized due to the lack of computational platforms that enable such a systematic approach on a large scale. In this work, we present IMG-ABC (https://img.jgi.doe.gov/abc), an atlas of biosynthetic gene clusters within the Integrated Microbial Genomes (IMG) system, which is aimed at harnessing the power of "big" genomic data for discovering small molecules. IMG-ABC relies on IMG's comprehensive integrated structural and functional genomic data for the analysis of biosynthetic gene clusters (BCs) and associated secondary metabolites (SMs). SMs and BCs serve as the two main classes of objects in IMG-ABC, each with a rich collection of attributes. A unique feature of IMG-ABC is the incorporation of both experimentally validated and computationally predicted BCs in genomes as well as metagenomes, thus identifying BCs in uncultured populations and rare taxa. We demonstrate the strength of IMG-ABC's focused integrated analysis tools in enabling the exploration of microbial secondary metabolism on a global scale, through the discovery of phenazine-producing clusters for the first time in Alphaproteobacteria. IMG-ABC strives to fill the long-existent void of resources for computational exploration of the secondary metabolism universe; its underlying scalable framework enables traversal of uncovered phylogenetic and chemical structure space, serving as a doorway to a new era in the discovery of novel molecules. IMG-ABC is the largest publicly available database of predicted and experimental biosynthetic gene clusters and the secondary metabolites they produce. The system also includes powerful search and analysis tools that are integrated with IMG's extensive genomic/metagenomic data and analysis tool kits. As new research on biosynthetic gene clusters and secondary metabolites is published and more genomes are sequenced, IMG-ABC will continue to expand, with the goal of becoming an essential component of any bioinformatic exploration of the secondary metabolism world. Copyright © 2015 Hadjithomas et al.

  4. Comparative genomic analysis of clinical and environmental strains provides insight into the pathogenicity and evolution of Vibrio parahaemolyticus.

    PubMed

    Li, Lei; Wong, Hin-chung; Nong, Wenyan; Cheung, Man Kit; Law, Patrick Tik Wan; Kam, Kai Man; Kwan, Hoi Shan

    2014-12-18

    Vibrio parahaemolyticus is a Gram-negative halophilic bacterium. Infections with the bacterium could become systemic and can be life-threatening to immunocompromised individuals. Genome sequences of a few clinical isolates of V. parahaemolyticus are currently available, but the genome dynamics across the species and virulence potential of environmental strains on a genome-scale have not been described before. Here we present genome sequences of four V. parahaemolyticus clinical strains from stool samples of patients and five environmental strains in Hong Kong. Phylogenomics analysis based on single nucleotide polymorphisms revealed a clear distinction between the clinical and environmental isolates. A new gene cluster belonging to the biofilm associated proteins of V. parahaemolyticus was found in clincial strains. In addition, a novel small genomic island frequently found among clinical isolates was reported. A few environmental strains were found harboring virulence genes and prophage elements, indicating their virulence potential. A unique biphenyl degradation pathway was also reported. A database for V. parahaemolyticus (http://kwanlab.bio.cuhk.edu.hk/vp) was constructed here as a platform to access and analyze genome sequences and annotations of the bacterium. We have performed a comparative genomics analysis of clinical and environmental strains of V. parahaemolyticus. Our analyses could facilitate understanding of the phylogenetic diversity and niche adaptation of this bacterium.

  5. Short and long-term genome stability analysis of prokaryotic genomes.

    PubMed

    Brilli, Matteo; Liò, Pietro; Lacroix, Vincent; Sagot, Marie-France

    2013-05-08

    Gene organization dynamics is actively studied because it provides useful evolutionary information, makes functional annotation easier and often enables to characterize pathogens. There is therefore a strong interest in understanding the variability of this trait and the possible correlations with life-style. Two kinds of events affect genome organization: on one hand translocations and recombinations change the relative position of genes shared by two genomes (i.e. the backbone gene order); on the other, insertions and deletions leave the backbone gene order unchanged but they alter the gene neighborhoods by breaking the syntenic regions. A complete picture about genome organization evolution therefore requires to account for both kinds of events. We developed an approach where we model chromosomes as graphs on which we compute different stability estimators; we consider genome rearrangements as well as the effect of gene insertions and deletions. In a first part of the paper, we fit a measure of backbone gene order conservation (hereinafter called backbone stability) against phylogenetic distance for over 3000 genome comparisons, improving existing models for the divergence in time of backbone stability. Intra- and inter-specific comparisons were treated separately to focus on different time-scales. The use of multiple genomes of a same species allowed to identify genomes with diverging gene order with respect to their conspecific. The inter-species analysis indicates that pathogens are more often unstable with respect to non-pathogens. In a second part of the text, we show that in pathogens, gene content dynamics (insertions and deletions) have a much more dramatic effect on genome organization stability than backbone rearrangements. In this work, we studied genome organization divergence taking into account the contribution of both genome order rearrangements and genome content dynamics. By studying species with multiple sequenced genomes available, we were able to explore genome organization stability at different time-scales and to find significant differences for pathogen and non-pathogen species. The output of our framework also allows to identify the conserved gene clusters and/or partial occurrences thereof, making possible to explore how gene clusters assembled during evolution.

  6. Insights from Human/Mouse genome comparisons

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pennacchio, Len A.

    2003-03-30

    Large-scale public genomic sequencing efforts have provided a wealth of vertebrate sequence data poised to provide insights into mammalian biology. These include deep genomic sequence coverage of human, mouse, rat, zebrafish, and two pufferfish (Fugu rubripes and Tetraodon nigroviridis) (Aparicio et al. 2002; Lander et al. 2001; Venter et al. 2001; Waterston et al. 2002). In addition, a high-priority has been placed on determining the genomic sequence of chimpanzee, dog, cow, frog, and chicken (Boguski 2002). While only recently available, whole genome sequence data have provided the unique opportunity to globally compare complete genome contents. Furthermore, the shared evolutionary ancestrymore » of vertebrate species has allowed the development of comparative genomic approaches to identify ancient conserved sequences with functionality. Accordingly, this review focuses on the initial comparison of available mammalian genomes and describes various insights derived from such analysis.« less

  7. Genomic correlates of recombination rate and its variability across eight recombination maps in the western honey bee (Apis mellifera L.).

    PubMed

    Ross, Caitlin R; DeFelice, Dominick S; Hunt, Greg J; Ihle, Kate E; Amdam, Gro V; Rueppell, Olav

    2015-02-21

    Meiotic recombination has traditionally been explained based on the structural requirement to stabilize homologous chromosome pairs to ensure their proper meiotic segregation. Competing hypotheses seek to explain the emerging findings of significant heterogeneity in recombination rates within and between genomes, but intraspecific comparisons of genome-wide recombination patterns are rare. The honey bee (Apis mellifera) exhibits the highest rate of genomic recombination among multicellular animals with about five cross-over events per chromatid. Here, we present a comparative analysis of recombination rates across eight genetic linkage maps of the honey bee genome to investigate which genomic sequence features are correlated with recombination rate and with its variation across the eight data sets, ranging in average marker spacing ranging from 1 Mbp to 120 kbp. Overall, we found that GC content explained best the variation in local recombination rate along chromosomes at the analyzed 100 kbp scale. In contrast, variation among the different maps was correlated to the abundance of microsatellites and several specific tri- and tetra-nucleotides. The combined evidence from eight medium-scale recombination maps of the honey bee genome suggests that recombination rate variation in this highly recombining genome might be due to the DNA configuration instead of distinct sequence motifs. However, more fine-scale analyses are needed. The empirical basis of eight differing genetic maps allowed for robust conclusions about the correlates of the local recombination rates and enabled the study of the relation between DNA features and variability in local recombination rates, which is particularly relevant in the honey bee genome with its exceptionally high recombination rate.

  8. Hi-C-constrained physical models of human chromosomes recover functionally-related properties of genome organization

    NASA Astrophysics Data System (ADS)

    di Stefano, Marco; Paulsen, Jonas; Lien, Tonje G.; Hovig, Eivind; Micheletti, Cristian

    2016-10-01

    Combining genome-wide structural models with phenomenological data is at the forefront of efforts to understand the organizational principles regulating the human genome. Here, we use chromosome-chromosome contact data as knowledge-based constraints for large-scale three-dimensional models of the human diploid genome. The resulting models remain minimally entangled and acquire several functional features that are observed in vivo and that were never used as input for the model. We find, for instance, that gene-rich, active regions are drawn towards the nuclear center, while gene poor and lamina associated domains are pushed to the periphery. These and other properties persist upon adding local contact constraints, suggesting their compatibility with non-local constraints for the genome organization. The results show that suitable combinations of data analysis and physical modelling can expose the unexpectedly rich functionally-related properties implicit in chromosome-chromosome contact data. Specific directions are suggested for further developments based on combining experimental data analysis and genomic structural modelling.

  9. Hi-C-constrained physical models of human chromosomes recover functionally-related properties of genome organization.

    PubMed

    Di Stefano, Marco; Paulsen, Jonas; Lien, Tonje G; Hovig, Eivind; Micheletti, Cristian

    2016-10-27

    Combining genome-wide structural models with phenomenological data is at the forefront of efforts to understand the organizational principles regulating the human genome. Here, we use chromosome-chromosome contact data as knowledge-based constraints for large-scale three-dimensional models of the human diploid genome. The resulting models remain minimally entangled and acquire several functional features that are observed in vivo and that were never used as input for the model. We find, for instance, that gene-rich, active regions are drawn towards the nuclear center, while gene poor and lamina associated domains are pushed to the periphery. These and other properties persist upon adding local contact constraints, suggesting their compatibility with non-local constraints for the genome organization. The results show that suitable combinations of data analysis and physical modelling can expose the unexpectedly rich functionally-related properties implicit in chromosome-chromosome contact data. Specific directions are suggested for further developments based on combining experimental data analysis and genomic structural modelling.

  10. G2S: a web-service for annotating genomic variants on 3D protein structures.

    PubMed

    Wang, Juexin; Sheridan, Robert; Sumer, S Onur; Schultz, Nikolaus; Xu, Dong; Gao, Jianjiong

    2018-06-01

    Accurately mapping and annotating genomic locations on 3D protein structures is a key step in structure-based analysis of genomic variants detected by recent large-scale sequencing efforts. There are several mapping resources currently available, but none of them provides a web API (Application Programming Interface) that supports programmatic access. We present G2S, a real-time web API that provides automated mapping of genomic variants on 3D protein structures. G2S can align genomic locations of variants, protein locations, or protein sequences to protein structures and retrieve the mapped residues from structures. G2S API uses REST-inspired design and it can be used by various clients such as web browsers, command terminals, programming languages and other bioinformatics tools for bringing 3D structures into genomic variant analysis. The webserver and source codes are freely available at https://g2s.genomenexus.org. g2s@genomenexus.org. Supplementary data are available at Bioinformatics online.

  11. From genes to genomes: a new paradigm for studying fungal pathogenesis in Magnaporthe oryzae.

    PubMed

    Xu, Jin-Rong; Zhao, Xinhua; Dean, Ralph A

    2007-01-01

    Magnaporthe oryzae is the most destructive fungal pathogen of rice worldwide and because of its amenability to classical and molecular genetic manipulation, availability of a genome sequence, and other resources it has emerged as a leading model system to study host-pathogen interactions. This chapter reviews recent progress toward elucidation of the molecular basis of infection-related morphogenesis, host penetration, invasive growth, and host-pathogen interactions. Related information on genome analysis and genomic studies of plant infection processes is summarized under specific topics where appropriate. Particular emphasis is placed on the role of MAP kinase and cAMP signal transduction pathways and unique features in the genome such as repetitive sequences and expanded gene families. Emerging developments in functional genome analysis through large-scale insertional mutagenesis and gene expression profiling are detailed. The chapter concludes with new prospects in the area of systems biology, such as protein expression profiling, and highlighting remaining crucial information needed to fully appreciate host-pathogen interactions.

  12. Defining the transcriptome assembly and its use for genome dynamics and transcriptome profiling studies in pigeonpea (Cajanus cajan L.)

    USDA-ARS?s Scientific Manuscript database

    This study reports generation of large-scale genomic resources for pigeonpea, a so-called ‘orphan crop species’ of the semi-arid tropic regions. Roche FLX/454 sequencing was carried out on a normalized cDNA pool prepared from 31 tissues produced 494,353 short transcript reads (STRs). Cluster analysi...

  13. pico-PLAZA, a genome database of microbial photosynthetic eukaryotes.

    PubMed

    Vandepoele, Klaas; Van Bel, Michiel; Richard, Guilhem; Van Landeghem, Sofie; Verhelst, Bram; Moreau, Hervé; Van de Peer, Yves; Grimsley, Nigel; Piganeau, Gwenael

    2013-08-01

    With the advent of next generation genome sequencing, the number of sequenced algal genomes and transcriptomes is rapidly growing. Although a few genome portals exist to browse individual genome sequences, exploring complete genome information from multiple species for the analysis of user-defined sequences or gene lists remains a major challenge. pico-PLAZA is a web-based resource (http://bioinformatics.psb.ugent.be/pico-plaza/) for algal genomics that combines different data types with intuitive tools to explore genomic diversity, perform integrative evolutionary sequence analysis and study gene functions. Apart from homologous gene families, multiple sequence alignments, phylogenetic trees, Gene Ontology, InterPro and text-mining functional annotations, different interactive viewers are available to study genome organization using gene collinearity and synteny information. Different search functions, documentation pages, export functions and an extensive glossary are available to guide non-expert scientists. To illustrate the versatility of the platform, different case studies are presented demonstrating how pico-PLAZA can be used to functionally characterize large-scale EST/RNA-Seq data sets and to perform environmental genomics. Functional enrichments analysis of 16 Phaeodactylum tricornutum transcriptome libraries offers a molecular view on diatom adaptation to different environments of ecological relevance. Furthermore, we show how complementary genomic data sources can easily be combined to identify marker genes to study the diversity and distribution of algal species, for example in metagenomes, or to quantify intraspecific diversity from environmental strains. © 2013 John Wiley & Sons Ltd and Society for Applied Microbiology.

  14. InSilico DB genomic datasets hub: an efficient starting point for analyzing genome-wide studies in GenePattern, Integrative Genomics Viewer, and R/Bioconductor.

    PubMed

    Coletta, Alain; Molter, Colin; Duqué, Robin; Steenhoff, David; Taminau, Jonatan; de Schaetzen, Virginie; Meganck, Stijn; Lazar, Cosmin; Venet, David; Detours, Vincent; Nowé, Ann; Bersini, Hugues; Weiss Solís, David Y

    2012-11-18

    Genomics datasets are increasingly useful for gaining biomedical insights, with adoption in the clinic underway. However, multiple hurdles related to data management stand in the way of their efficient large-scale utilization. The solution proposed is a web-based data storage hub. Having clear focus, flexibility and adaptability, InSilico DB seamlessly connects genomics dataset repositories to state-of-the-art and free GUI and command-line data analysis tools. The InSilico DB platform is a powerful collaborative environment, with advanced capabilities for biocuration, dataset sharing, and dataset subsetting and combination. InSilico DB is available from https://insilicodb.org.

  15. Use of an uncertainty analysis for genome-scale models as a prediction tool for microbial growth processes in subsurface environments.

    PubMed

    Klier, Christine

    2012-03-06

    The integration of genome-scale, constraint-based models of microbial cell function into simulations of contaminant transport and fate in complex groundwater systems is a promising approach to help characterize the metabolic activities of microorganisms in natural environments. In constraint-based modeling, the specific uptake flux rates of external metabolites are usually determined by Michaelis-Menten kinetic theory. However, extensive data sets based on experimentally measured values are not always available. In this study, a genome-scale model of Pseudomonas putida was used to study the key issue of uncertainty arising from the parametrization of the influx of two growth-limiting substrates: oxygen and toluene. The results showed that simulated growth rates are highly sensitive to substrate affinity constants and that uncertainties in specific substrate uptake rates have a significant influence on the variability of simulated microbial growth. Michaelis-Menten kinetic theory does not, therefore, seem to be appropriate for descriptions of substrate uptake processes in the genome-scale model of P. putida. Microbial growth rates of P. putida in subsurface environments can only be accurately predicted if the processes of complex substrate transport and microbial uptake regulation are sufficiently understood in natural environments and if data-driven uptake flux constraints can be applied.

  16. Landscape genomics in Atlantic salmon (Salmo salar): searching for gene-environment interactions driving local adaptation.

    PubMed

    Vincent, Bourret; Dionne, Mélanie; Kent, Matthew P; Lien, Sigbjørn; Bernatchez, Louis

    2013-12-01

    A growing number of studies are examining the factors driving historical and contemporary evolution in wild populations. By combining surveys of genomic variation with a comprehensive assessment of environmental parameters, such studies can increase our understanding of the genomic and geographical extent of local adaptation in wild populations. We used a large-scale landscape genomics approach to examine adaptive and neutral differentiation across 54 North American populations of Atlantic salmon representing seven previously defined genetically distinct regional groups. Over 5500 genome-wide single nucleotide polymorphisms were genotyped in 641 individuals and 28 bulk assays of 25 pooled individuals each. Genome scans, linkage map, and 49 environmental variables were combined to conduct an innovative landscape genomic analysis. Our results provide valuable insight into the links between environmental variation and both neutral and potentially adaptive genetic divergence. In particular, we identified markers potentially under divergent selection, as well as associated selective environmental factors and biological functions with the observed adaptive divergence. Multivariate landscape genetic analysis revealed strong associations of both genetic and environmental structures. We found an enrichment of growth-related functions among outlier markers. Climate (temperature-precipitation) and geological characteristics were significantly associated with both potentially adaptive and neutral genetic divergence and should be considered as candidate loci involved in adaptation at the regional scale in Atlantic salmon. Hence, this study significantly contributes to the improvement of tools used in modern conservation and management schemes of Atlantic salmon wild populations. © 2013 The Author(s). Evolution © 2013 The Society for the Study of Evolution.

  17. The genome sequence of E. coli W (ATCC 9637): comparative genome analysis and an improved genome-scale reconstruction of E. coli

    PubMed Central

    2011-01-01

    Background Escherichia coli is a model prokaryote, an important pathogen, and a key organism for industrial biotechnology. E. coli W (ATCC 9637), one of four strains designated as safe for laboratory purposes, has not been sequenced. E. coli W is a fast-growing strain and is the only safe strain that can utilize sucrose as a carbon source. Lifecycle analysis has demonstrated that sucrose from sugarcane is a preferred carbon source for industrial bioprocesses. Results We have sequenced and annotated the genome of E. coli W. The chromosome is 4,900,968 bp and encodes 4,764 ORFs. Two plasmids, pRK1 (102,536 bp) and pRK2 (5,360 bp), are also present. W has unique features relative to other sequenced laboratory strains (K-12, B and Crooks): it has a larger genome and belongs to phylogroup B1 rather than A. W also grows on a much broader range of carbon sources than does K-12. A genome-scale reconstruction was developed and validated in order to interrogate metabolic properties. Conclusions The genome of W is more similar to commensal and pathogenic B1 strains than phylogroup A strains, and therefore has greater utility for comparative analyses with these strains. W should therefore be the strain of choice, or 'type strain' for group B1 comparative analyses. The genome annotation and tools created here are expected to allow further utilization and development of E. coli W as an industrial organism for sucrose-based bioprocesses. Refinements in our E. coli metabolic reconstruction allow it to more accurately define E. coli metabolism relative to previous models. PMID:21208457

  18. Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation.

    PubMed

    Nakato, Ryuichiro; Shirahige, Katsuhiko

    2017-03-01

    Chromatin immunoprecipitation followed by sequencing (ChIP-seq) analysis can detect protein/DNA-binding and histone-modification sites across an entire genome. Recent advances in sequencing technologies and analyses enable us to compare hundreds of samples simultaneously; such large-scale analysis has potential to reveal the high-dimensional interrelationship level for regulatory elements and annotate novel functional genomic regions de novo. Because many experimental considerations are relevant to the choice of a method in a ChIP-seq analysis, the overall design and quality management of the experiment are of critical importance. This review offers guiding principles of computation and sample preparation for ChIP-seq analyses, highlighting the validity and limitations of the state-of-the-art procedures at each step. We also discuss the latest challenges of single-cell analysis that will encourage a new era in this field. © The Author 2016. Published by Oxford University Press.

  19. Molecular inversion probe assay.

    PubMed

    Absalan, Farnaz; Ronaghi, Mostafa

    2007-01-01

    We have described molecular inversion probe technologies for large-scale genetic analyses. This technique provides a comprehensive and powerful tool for the analysis of genetic variation and enables affordable, large-scale studies that will help uncover the genetic basis of complex disease and explain the individual variation in response to therapeutics. Major applications of the molecular inversion probes (MIP) technologies include targeted genotyping from focused regions to whole-genome studies, and allele quantification of genomic rearrangements. The MIP technology (used in the HapMap project) provides an efficient, scalable, and affordable way to score polymorphisms in case/control populations for genetic studies. The MIP technology provides the highest commercially available multiplexing levels and assay conversion rates for targeted genotyping. This enables more informative, genome-wide studies with either the functional (direct detection) approach or the indirect detection approach.

  20. GAP Final Technical Report 12-14-04

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Andrew J. Bordner, PhD, Senior Research Scientist

    2004-12-14

    The Genomics Annotation Platform (GAP) was designed to develop new tools for high throughput functional annotation and characterization of protein sequences and structures resulting from genomics and structural proteomics, benchmarking and application of those tools. Furthermore, this platform integrated the genomic scale sequence and structural analysis and prediction tools with the advanced structure prediction and bioinformatics environment of ICM. The development of GAP was primarily oriented towards the annotation of new biomolecular structures using both structural and sequence data. Even though the amount of protein X-ray crystal data is growing exponentially, the volume of sequence data is growing even moremore » rapidly. This trend was exploited by leveraging the wealth of sequence data to provide functional annotation for protein structures. The additional information provided by GAP is expected to assist the majority of the commercial users of ICM, who are involved in drug discovery, in identifying promising drug targets as well in devising strategies for the rational design of therapeutics directed at the protein of interest. The GAP also provided valuable tools for biochemistry education, and structural genomics centers. In addition, GAP incorporates many novel prediction and analysis methods not available in other molecular modeling packages. This development led to signing the first Molsoft agreement in the structural genomics annotation area with the University of oxford Structural Genomics Center. This commercial agreement validated the Molsoft efforts under the GAP project and provided the basis for further development of the large scale functional annotation platform.« less

  1. NCBI prokaryotic genome annotation pipeline.

    PubMed

    Tatusova, Tatiana; DiCuccio, Michael; Badretdin, Azat; Chetvernin, Vyacheslav; Nawrocki, Eric P; Zaslavsky, Leonid; Lomsadze, Alexandre; Pruitt, Kim D; Borodovsky, Mark; Ostell, James

    2016-08-19

    Recent technological advances have opened unprecedented opportunities for large-scale sequencing and analysis of populations of pathogenic species in disease outbreaks, as well as for large-scale diversity studies aimed at expanding our knowledge across the whole domain of prokaryotes. To meet the challenge of timely interpretation of structure, function and meaning of this vast genetic information, a comprehensive approach to automatic genome annotation is critically needed. In collaboration with Georgia Tech, NCBI has developed a new approach to genome annotation that combines alignment based methods with methods of predicting protein-coding and RNA genes and other functional elements directly from sequence. A new gene finding tool, GeneMarkS+, uses the combined evidence of protein and RNA placement by homology as an initial map of annotation to generate and modify ab initio gene predictions across the whole genome. Thus, the new NCBI's Prokaryotic Genome Annotation Pipeline (PGAP) relies more on sequence similarity when confident comparative data are available, while it relies more on statistical predictions in the absence of external evidence. The pipeline provides a framework for generation and analysis of annotation on the full breadth of prokaryotic taxonomy. For additional information on PGAP see https://www.ncbi.nlm.nih.gov/genome/annotation_prok/ and the NCBI Handbook, https://www.ncbi.nlm.nih.gov/books/NBK174280/. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.

  2. Understanding the direction of evolution in Burkholderia glumae through comparative genomics.

    PubMed

    Lee, Hyun-Hee; Park, Jungwook; Kim, Jinnyun; Park, Inmyoung; Seo, Young-Su

    2016-02-01

    Members of the genus Burkholderia occupy remarkably diverse niches, with genome sizes ranging from ~3.75 to 11.29 Mbp. The genome of Burkholderia glumae ranges in size from ~5.81 to 7.89 Mbp. Unlike other plant pathogenic bacteria, B. glumae can infect a wide range of monocot and dicot plants. Comparative genome analysis of B. glumae strains can provide insight into genome variation as well as differential features of whole metabolism or pathways between multiple strains of B. glumae infecting the same host. Comparative analysis of complete genomes among B. glumae BGR1, B. glumae LMG 2196, and B. glumae PG1 revealed the largest departmentalization of genes onto separate replicons in B. glumae BGR1 and considerable downsizing of the genome in B. glumae LMG 2196. In addition, the presence of large-scale evolutionary events such as rearrangement and inversion and the development of highly specialized systems were found to be related to virulence-associated features in the three B. glumae strains. This connection may explain why this bacterium broadens its host range and reinforces its interaction with hosts.

  3. Transposon Insertions, Structural Variations, and SNPs Contribute to the Evolution of the Melon Genome.

    PubMed

    Sanseverino, Walter; Hénaff, Elizabeth; Vives, Cristina; Pinosio, Sara; Burgos-Paz, William; Morgante, Michele; Ramos-Onsins, Sebastián E; Garcia-Mas, Jordi; Casacuberta, Josep Maria

    2015-10-01

    The availability of extensive databases of crop genome sequences should allow analysis of crop variability at an unprecedented scale, which should have an important impact in plant breeding. However, up to now the analysis of genetic variability at the whole-genome scale has been mainly restricted to single nucleotide polymorphisms (SNPs). This is a strong limitation as structural variation (SV) and transposon insertion polymorphisms are frequent in plant species and have had an important mutational role in crop domestication and breeding. Here, we present the first comprehensive analysis of melon genetic diversity, which includes a detailed analysis of SNPs, SV, and transposon insertion polymorphisms. The variability found among seven melon varieties representing the species diversity and including wild accessions and highly breed lines, is relatively high due in part to the marked divergence of some lineages. The diversity is distributed nonuniformly across the genome, being lower at the extremes of the chromosomes and higher in the pericentromeric regions, which is compatible with the effect of purifying selection and recombination forces over functional regions. Additionally, this variability is greatly reduced among elite varieties, probably due to selection during breeding. We have found some chromosomal regions showing a high differentiation of the elite varieties versus the rest, which could be considered as strongly selected candidate regions. Our data also suggest that transposons and SV may be at the origin of an important fraction of the variability in melon, which highlights the importance of analyzing all types of genetic variability to understand crop genome evolution. © The Author 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  4. Large-scale genomic analyses reveal the population structure and evolutionary trends of Streptococcus agalactiae strains in Brazilian fish farms.

    PubMed

    Barony, Gustavo M; Tavares, Guilherme C; Pereira, Felipe L; Carvalho, Alex F; Dorella, Fernanda A; Leal, Carlos A G; Figueiredo, Henrique C P

    2017-10-19

    Streptococcus agalactiae is a major pathogen and a hindrance on tilapia farming worldwide. The aims of this work were to analyze the genomic evolution of Brazilian strains of S. agalactiae and to establish spatial and temporal relations between strains isolated from different outbreaks of streptococcosis. A total of 39 strains were obtained from outbreaks and their whole genomes were sequenced and annotated for comparative analysis of multilocus sequence typing, genomic similarity and whole genome multilocus sequence typing (wgMLST). The Brazilian strains presented two sequence types, including a newly described ST, and a non-typeable lineage. The use of wgMLST could differentiate each strain in a single clone and was used to establish temporal and geographical correlations among strains. Bayesian phylogenomic analysis suggests that the studied Brazilian population was co-introduced in the country with their host, approximately 60 years ago. Brazilian strains of S. agalactiae were shown to be heterogeneous in their genome sequences and were distributed in different regions of the country according to their genotype, which allowed the use of wgMLST analysis to track each outbreak event individually.

  5. Research Guidelines in the Era of Large-scale Collaborations: An Analysis of Genome-wide Association Study Consortia

    PubMed Central

    Austin, Melissa A.; Hair, Marilyn S.; Fullerton, Stephanie M.

    2012-01-01

    Scientific research has shifted from studies conducted by single investigators to the creation of large consortia. Genetic epidemiologists, for example, now collaborate extensively for genome-wide association studies (GWAS). The effect has been a stream of confirmed disease-gene associations. However, effects on human subjects oversight, data-sharing, publication and authorship practices, research organization and productivity, and intellectual property remain to be examined. The aim of this analysis was to identify all research consortia that had published the results of a GWAS analysis since 2005, characterize them, determine which have publicly accessible guidelines for research practices, and summarize the policies in these guidelines. A review of the National Human Genome Research Institute’s Catalog of Published Genome-Wide Association Studies identified 55 GWAS consortia as of April 1, 2011. These consortia were comprised of individual investigators, research centers, studies, or other consortia and studied 48 different diseases or traits. Only 14 (25%) were found to have publicly accessible research guidelines on consortia websites. The available guidelines provide information on organization, governance, and research protocols; half address institutional review board approval. Details of publication, authorship, data-sharing, and intellectual property vary considerably. Wider access to consortia guidelines is needed to establish appropriate research standards with broad applicability to emerging forms of large-scale collaboration. PMID:22491085

  6. Comprehensive phylogeny of ray-finned fishes (Actinopterygii) based on transcriptomic and genomic data.

    PubMed

    Hughes, Lily C; Ortí, Guillermo; Huang, Yu; Sun, Ying; Baldwin, Carole C; Thompson, Andrew W; Arcila, Dahiana; Betancur-R, Ricardo; Li, Chenhong; Becker, Leandro; Bellora, Nicolás; Zhao, Xiaomeng; Li, Xiaofeng; Wang, Min; Fang, Chao; Xie, Bing; Zhou, Zhuocheng; Huang, Hai; Chen, Songlin; Venkatesh, Byrappa; Shi, Qiong

    2018-05-14

    Our understanding of phylogenetic relationships among bony fishes has been transformed by analysis of a small number of genes, but uncertainty remains around critical nodes. Genome-scale inferences so far have sampled a limited number of taxa and genes. Here we leveraged 144 genomes and 159 transcriptomes to investigate fish evolution with an unparalleled scale of data: >0.5 Mb from 1,105 orthologous exon sequences from 303 species, representing 66 out of 72 ray-finned fish orders. We apply phylogenetic tests designed to trace the effect of whole-genome duplication events on gene trees and find paralogy-free loci using a bioinformatics approach. Genome-wide data support the structure of the fish phylogeny, and hypothesis-testing procedures appropriate for phylogenomic datasets using explicit gene genealogy interrogation settle some long-standing uncertainties, such as the branching order at the base of the teleosts and among early euteleosts, and the sister lineage to the acanthomorph and percomorph radiations. Comprehensive fossil calibrations date the origin of all major fish lineages before the end of the Cretaceous.

  7. EUPAN enables pan-genome studies of a large number of eukaryotic genomes.

    PubMed

    Hu, Zhiqiang; Sun, Chen; Lu, Kuang-Chen; Chu, Xixia; Zhao, Yue; Lu, Jinyuan; Shi, Jianxin; Wei, Chaochun

    2017-08-01

    Pan-genome analyses are routinely carried out for bacteria to interpret the within-species gene presence/absence variations (PAVs). However, pan-genome analyses are rare for eukaryotes due to the large sizes and higher complexities of their genomes. Here we proposed EUPAN, a eukaryotic pan-genome analysis toolkit, enabling automatic large-scale eukaryotic pan-genome analyses and detection of gene PAVs at a relatively low sequencing depth. In the previous studies, we demonstrated the effectiveness and high accuracy of EUPAN in the pan-genome analysis of 453 rice genomes, in which we also revealed widespread gene PAVs among individual rice genomes. Moreover, EUPAN can be directly applied to the current re-sequencing projects primarily focusing on single nucleotide polymorphisms. EUPAN is implemented in Perl, R and C ++. It is supported under Linux and preferred for a computer cluster with LSF and SLURM job scheduling system. EUPAN together with its standard operating procedure (SOP) is freely available for non-commercial use (CC BY-NC 4.0) at http://cgm.sjtu.edu.cn/eupan/index.html . ccwei@sjtu.edu.cn or jianxin.shi@sjtu.edu.cn. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  8. Microbial genomic island discovery, visualization and analysis.

    PubMed

    Bertelli, Claire; Tilley, Keith E; Brinkman, Fiona S L

    2018-06-03

    Horizontal gene transfer (also called lateral gene transfer) is a major mechanism for microbial genome evolution, enabling rapid adaptation and survival in specific niches. Genomic islands (GIs), commonly defined as clusters of bacterial or archaeal genes of probable horizontal origin, are of particular medical, environmental and/or industrial interest, as they disproportionately encode virulence factors and some antimicrobial resistance genes and may harbor entire metabolic pathways that confer a specific adaptation (solvent resistance, symbiosis properties, etc). As large-scale analyses of microbial genomes increases, such as for genomic epidemiology investigations of infectious disease outbreaks in public health, there is increased appreciation of the need to accurately predict and track GIs. Over the past decade, numerous computational tools have been developed to tackle the challenges inherent in accurate GI prediction. We review here the main types of GI prediction methods and discuss their advantages and limitations for a routine analysis of microbial genomes in this era of rapid whole-genome sequencing. An assessment is provided of 20 GI prediction software methods that use sequence-composition bias to identify the GIs, using a reference GI data set from 104 genomes obtained using an independent comparative genomics approach. Finally, we present guidelines to assist researchers in effectively identifying these key genomic regions.

  9. Successful application of FTA Classic Card technology and use of bacteriophage phi29 DNA polymerase for large-scale field sampling and cloning of complete maize streak virus genomes.

    PubMed

    Owor, Betty E; Shepherd, Dionne N; Taylor, Nigel J; Edema, Richard; Monjane, Adérito L; Thomson, Jennifer A; Martin, Darren P; Varsani, Arvind

    2007-03-01

    Leaf samples from 155 maize streak virus (MSV)-infected maize plants were collected from 155 farmers' fields in 23 districts in Uganda in May/June 2005 by leaf-pressing infected samples onto FTA Classic Cards. Viral DNA was successfully extracted from cards stored at room temperature for 9 months. The diversity of 127 MSV isolates was analysed by PCR-generated RFLPs. Six representative isolates having different RFLP patterns and causing either severe, moderate or mild disease symptoms, were chosen for amplification from FTA cards by bacteriophage phi29 DNA polymerase using the TempliPhi system. Full-length genomes were inserted into a cloning vector using a unique restriction enzyme site, and sequenced. The 1.3-kb PCR product amplified directly from FTA-eluted DNA and used for RFLP analysis was also cloned and sequenced. Comparison of cloned whole genome sequences with those of the original PCR products indicated that the correct virus genome had been cloned and that no errors were introduced by the phi29 polymerase. This is the first successful large-scale application of FTA card technology to the field, and illustrates the ease with which large numbers of infected samples can be collected and stored for downstream molecular applications such as diversity analysis and cloning of potentially new virus genomes.

  10. Incidental and clinically actionable genetic variants in 1005 whole exomes and genomes from Qatar.

    PubMed

    Jain, Abhinav; Gandhi, Shrey; Koshy, Remya; Scaria, Vinod

    2018-03-20

    Incidental findings in genomic data have been studied in great detail in the recent years, especially from population-scale data sets. However, little is known about the frequency of such findings in ethnic groups, specifically the Middle East, which were not previously covered in global sequencing studies. The availability of whole exome and genome data sets for a highly consanguineous Arab population from Qatar motivated us to explore the incidental findings in this population-scale data. The sequence data of 1005 Qatari individuals were systematically analyzed for incidental genetic variants in the 59 genes suggested by the American College of Medical Genetics and Genomics. We identified four genetic variants which were pathogenic or likely pathogenic. These variants occurred in six individuals, suggesting a frequency of 0.59% in the population, much lesser than that previously reported from European and African populations. Our analysis identified a variant in RYR1 gene associated with Malignant Hyperthermia that has significantly higher frequency in the population compared to global frequencies. Evaluation of the allele frequencies of these variants suggested enrichment in sub-populations, especially in individuals of Sub-Saharan African ancestry. The present study thereby provides the information on pathogenicity and frequency, which could aid in genomic medicine. To the best of our knowledge, this is the first comprehensive analysis of incidental genetic findings in any Arab population and suggests ethnic differences in incidental findings.

  11. Genome-scale estimate of the metabolic turnover of E. Coli from the energy balance analysis

    NASA Astrophysics Data System (ADS)

    De Martino, D.

    2016-02-01

    In this article the notion of metabolic turnover is revisited in the light of recent results of out-of-equilibrium thermodynamics. By means of Monte Carlo methods we perform an exact sampling of the enzymatic fluxes in a genome scale metabolic network of E. Coli in stationary growth conditions from which we infer the metabolites turnover times. However the latter are inferred from net fluxes, and we argue that this approximation is not valid for enzymes working nearby thermodynamic equilibrium. We recalculate turnover times from total fluxes by performing an energy balance analysis of the network and recurring to the fluctuation theorem. We find in many cases values one of order of magnitude lower, implying a faster picture of intermediate metabolism.

  12. A genome-scale Escherichia coli kinetic metabolic model k-ecoli457 satisfying flux data for multiple mutant strains

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Khodayari, Ali; Maranas, Costas D.

    Kinetic models of metabolism at a genome scale that faithfully recapitulate the effect of multiple genetic interventions would be transformative in our ability to reliably design novel overproducing microbial strains. Here, we introduce k-ecoli457, a genome-scale kinetic model of Escherichia coli metabolism that satisfies fluxomic data for wild-type and 25 mutant strains under different substrates and growth conditions. The k-ecoli457 model contains 457 model reactions, 337 metabolites and 295 substrate-level regulatory interactions. Parameterization is carried out using a genetic algorithm by simultaneously imposing all available fluxomic data (about 30 measured fluxes per mutant). Furthermore, the Pearson correlation coefficient between experimentalmore » data and predicted product yields for 320 engineered strains spanning 24 product metabolites is 0.84. This is substantially higher than that using flux balance analysis, minimization of metabolic adjustment or maximization of product yield exhibiting systematic errors with correlation coefficients of, respectively, 0.18, 0.37 and 0.47.« less

  13. A genome-scale Escherichia coli kinetic metabolic model k-ecoli457 satisfying flux data for multiple mutant strains

    DOE PAGES

    Khodayari, Ali; Maranas, Costas D.

    2016-12-20

    Kinetic models of metabolism at a genome scale that faithfully recapitulate the effect of multiple genetic interventions would be transformative in our ability to reliably design novel overproducing microbial strains. Here, we introduce k-ecoli457, a genome-scale kinetic model of Escherichia coli metabolism that satisfies fluxomic data for wild-type and 25 mutant strains under different substrates and growth conditions. The k-ecoli457 model contains 457 model reactions, 337 metabolites and 295 substrate-level regulatory interactions. Parameterization is carried out using a genetic algorithm by simultaneously imposing all available fluxomic data (about 30 measured fluxes per mutant). Furthermore, the Pearson correlation coefficient between experimentalmore » data and predicted product yields for 320 engineered strains spanning 24 product metabolites is 0.84. This is substantially higher than that using flux balance analysis, minimization of metabolic adjustment or maximization of product yield exhibiting systematic errors with correlation coefficients of, respectively, 0.18, 0.37 and 0.47.« less

  14. The Systems Biology Markup Language (SBML) Level 3 Package: Flux Balance Constraints.

    PubMed

    Olivier, Brett G; Bergmann, Frank T

    2015-09-04

    Constraint-based modeling is a well established modelling methodology used to analyze and study biological networks on both a medium and genome scale. Due to their large size, genome scale models are typically analysed using constraint-based optimization techniques. One widely used method is Flux Balance Analysis (FBA) which, for example, requires a modelling description to include: the definition of a stoichiometric matrix, an objective function and bounds on the values that fluxes can obtain at steady state. The Flux Balance Constraints (FBC) Package extends SBML Level 3 and provides a standardized format for the encoding, exchange and annotation of constraint-based models. It includes support for modelling concepts such as objective functions, flux bounds and model component annotation that facilitates reaction balancing. The FBC package establishes a base level for the unambiguous exchange of genome-scale, constraint-based models, that can be built upon by the community to meet future needs (e. g. by extending it to cover dynamic FBC models).

  15. The Systems Biology Markup Language (SBML) Level 3 Package: Flux Balance Constraints.

    PubMed

    Olivier, Brett G; Bergmann, Frank T

    2015-06-01

    Constraint-based modeling is a well established modelling methodology used to analyze and study biological networks on both a medium and genome scale. Due to their large size, genome scale models are typically analysed using constraint-based optimization techniques. One widely used method is Flux Balance Analysis (FBA) which, for example, requires a modelling description to include: the definition of a stoichiometric matrix, an objective function and bounds on the values that fluxes can obtain at steady state. The Flux Balance Constraints (FBC) Package extends SBML Level 3 and provides a standardized format for the encoding, exchange and annotation of constraint-based models. It includes support for modelling concepts such as objective functions, flux bounds and model component annotation that facilitates reaction balancing. The FBC package establishes a base level for the unambiguous exchange of genome-scale, constraint-based models, that can be built upon by the community to meet future needs (e. g. by extending it to cover dynamic FBC models).

  16. PATRIC: the Comprehensive Bacterial Bioinformatics Resource with a Focus on Human Pathogenic Species ▿ ‡ #

    PubMed Central

    Gillespie, Joseph J.; Wattam, Alice R.; Cammer, Stephen A.; Gabbard, Joseph L.; Shukla, Maulik P.; Dalay, Oral; Driscoll, Timothy; Hix, Deborah; Mane, Shrinivasrao P.; Mao, Chunhong; Nordberg, Eric K.; Scott, Mark; Schulman, Julie R.; Snyder, Eric E.; Sullivan, Daniel E.; Wang, Chunxia; Warren, Andrew; Williams, Kelly P.; Xue, Tian; Seung Yoo, Hyun; Zhang, Chengdong; Zhang, Yan; Will, Rebecca; Kenyon, Ronald W.; Sobral, Bruno W.

    2011-01-01

    Funded by the National Institute of Allergy and Infectious Diseases, the Pathosystems Resource Integration Center (PATRIC) is a genomics-centric relational database and bioinformatics resource designed to assist scientists in infectious-disease research. Specifically, PATRIC provides scientists with (i) a comprehensive bacterial genomics database, (ii) a plethora of associated data relevant to genomic analysis, and (iii) an extensive suite of computational tools and platforms for bioinformatics analysis. While the primary aim of PATRIC is to advance the knowledge underlying the biology of human pathogens, all publicly available genome-scale data for bacteria are compiled and continually updated, thereby enabling comparative analyses to reveal the basis for differences between infectious free-living and commensal species. Herein we summarize the major features available at PATRIC, dividing the resources into two major categories: (i) organisms, genomes, and comparative genomics and (ii) recurrent integration of community-derived associated data. Additionally, we present two experimental designs typical of bacterial genomics research and report on the execution of both projects using only PATRIC data and tools. These applications encompass a broad range of the data and analysis tools available, illustrating practical uses of PATRIC for the biologist. Finally, a summary of PATRIC's outreach activities, collaborative endeavors, and future research directions is provided. PMID:21896772

  17. Explaining human uniqueness: genome interactions with environment, behaviour and culture.

    PubMed

    Varki, Ajit; Geschwind, Daniel H; Eichler, Evan E

    2008-10-01

    What makes us human? Specialists in each discipline respond through the lens of their own expertise. In fact, 'anthropogeny' (explaining the origin of humans) requires a transdisciplinary approach that eschews such barriers. Here we take a genomic and genetic perspective towards molecular variation, explore systems analysis of gene expression and discuss an organ-systems approach. Rejecting any 'genes versus environment' dichotomy, we then consider genome interactions with environment, behaviour and culture, finally speculating that aspects of human uniqueness arose because of a primate evolutionary trend towards increasing and irreversible dependence on learned behaviours and culture - perhaps relaxing allowable thresholds for large-scale genomic diversity.

  18. Explaining human uniqueness: genome interactions with environment, behaviour and culture

    PubMed Central

    Varki, Ajit; Geschwind, Daniel H.; Eichler, Evan E.

    2009-01-01

    What makes us human? Specialists in each discipline respond through the lens of their own expertise. In fact, ‘anthropogeny’ (explaining the origin of humans) requires a transdisciplinary approach that eschews such barriers. Here we take a genomic and genetic perspective towards molecular variation, explore systems analysis of gene expression and discuss an organ-systems approach. Rejecting any ‘genes versus environment’ dichotomy, we then consider genome interactions with environment, behaviour and culture, finally speculating that aspects of human uniqueness arose because of a primate evolutionary trend towards increasing and irreversible dependence on learned behaviours and culture — perhaps relaxing allowable thresholds for large-scale genomic diversity. PMID:18802414

  19. Ethical considerations of research policy for personal genome analysis: the approach of the Genome Science Project in Japan.

    PubMed

    Minari, Jusaku; Shirai, Tetsuya; Kato, Kazuto

    2014-12-01

    As evidenced by high-throughput sequencers, genomic technologies have recently undergone radical advances. These technologies enable comprehensive sequencing of personal genomes considerably more efficiently and less expensively than heretofore. These developments present a challenge to the conventional framework of biomedical ethics; under these changing circumstances, each research project has to develop a pragmatic research policy. Based on the experience with a new large-scale project-the Genome Science Project-this article presents a novel approach to conducting a specific policy for personal genome research in the Japanese context. In creating an original informed-consent form template for the project, we present a two-tiered process: making the draft of the template following an analysis of national and international policies; refining the draft template in conjunction with genome project researchers for practical application. Through practical use of the template, we have gained valuable experience in addressing challenges in the ethical review process, such as the importance of sharing details of the latest developments in genomics with members of research ethics committees. We discuss certain limitations of the conventional concept of informed consent and its governance system and suggest the potential of an alternative process using information technology.

  20. A genomic scale map of genetic diversity in Trypanosoma cruzi

    PubMed Central

    2012-01-01

    Background Trypanosoma cruzi, the causal agent of Chagas Disease, affects more than 16 million people in Latin America. The clinical outcome of the disease results from a complex interplay between environmental factors and the genetic background of both the human host and the parasite. However, knowledge of the genetic diversity of the parasite, is currently limited to a number of highly studied loci. The availability of a number of genomes from different evolutionary lineages of T. cruzi provides an unprecedented opportunity to look at the genetic diversity of the parasite at a genomic scale. Results Using a bioinformatic strategy, we have clustered T. cruzi sequence data available in the public domain and obtained multiple sequence alignments in which one or two alleles from the reference CL-Brener were included. These data covers 4 major evolutionary lineages (DTUs): TcI, TcII, TcIII, and the hybrid TcVI. Using these set of alignments we have identified 288,957 high quality single nucleotide polymorphisms and 1,480 indels. In a reduced re-sequencing study we were able to validate ~ 97% of high-quality SNPs identified in 47 loci. Analysis of how these changes affect encoded protein products showed a 0.77 ratio of synonymous to non-synonymous changes in the T. cruzi genome. We observed 113 changes that introduce or remove a stop codon, some causing significant functional changes, and a number of tri-allelic and tetra-allelic SNPs that could be exploited in strain typing assays. Based on an analysis of the observed nucleotide diversity we show that the T. cruzi genome contains a core set of genes that are under apparent purifying selection. Interestingly, orthologs of known druggable targets show statistically significant lower nucleotide diversity values. Conclusions This study provides the first look at the genetic diversity of T. cruzi at a genomic scale. The analysis covers an estimated ~ 60% of the genetic diversity present in the population, providing an essential resource for future studies on the development of new drugs and diagnostics, for Chagas Disease. These data is available through the TcSNP database (http://snps.tcruzi.org). PMID:23270511

  1. In Depth Characterization of Repetitive DNA in 23 Plant Genomes Reveals Sources of Genome Size Variation in the Legume Tribe Fabeae.

    PubMed

    Macas, Jiří; Novák, Petr; Pellicer, Jaume; Čížková, Jana; Koblížková, Andrea; Neumann, Pavel; Fuková, Iva; Doležel, Jaroslav; Kelly, Laura J; Leitch, Ilia J

    2015-01-01

    The differential accumulation and elimination of repetitive DNA are key drivers of genome size variation in flowering plants, yet there have been few studies which have analysed how different types of repeats in related species contribute to genome size evolution within a phylogenetic context. This question is addressed here by conducting large-scale comparative analysis of repeats in 23 species from four genera of the monophyletic legume tribe Fabeae, representing a 7.6-fold variation in genome size. Phylogenetic analysis and genome size reconstruction revealed that this diversity arose from genome size expansions and contractions in different lineages during the evolution of Fabeae. Employing a combination of low-pass genome sequencing with novel bioinformatic approaches resulted in identification and quantification of repeats making up 55-83% of the investigated genomes. In turn, this enabled an analysis of how each major repeat type contributed to the genome size variation encountered. Differential accumulation of repetitive DNA was found to account for 85% of the genome size differences between the species, and most (57%) of this variation was found to be driven by a single lineage of Ty3/gypsy LTR-retrotransposons, the Ogre elements. Although the amounts of several other lineages of LTR-retrotransposons and the total amount of satellite DNA were also positively correlated with genome size, their contributions to genome size variation were much smaller (up to 6%). Repeat analysis within a phylogenetic framework also revealed profound differences in the extent of sequence conservation between different repeat types across Fabeae. In addition to these findings, the study has provided a proof of concept for the approach combining recent developments in sequencing and bioinformatics to perform comparative analyses of repetitive DNAs in a large number of non-model species without the need to assemble their genomes.

  2. Predicting Survival within the Lung Cancer Histopathological Hierarchy Using a Multi-Scale Genomic Model of Development

    PubMed Central

    Liu, Hongye; Kho, Alvin T; Kohane, Isaac S; Sun, Yao

    2006-01-01

    Background The histopathologic heterogeneity of lung cancer remains a significant confounding factor in its diagnosis and prognosis—spurring numerous recent efforts to find a molecular classification of the disease that has clinical relevance. Methods and Findings Molecular profiles of tumors from 186 patients representing four different lung cancer subtypes (and 17 normal lung tissue samples) were compared with a mouse lung development model using principal component analysis in both temporal and genomic domains. An algorithm for the classification of lung cancers using a multi-scale developmental framework was developed. Kaplan–Meier survival analysis was conducted for lung adenocarcinoma patient subgroups identified via their developmental association. We found multi-scale genomic similarities between four human lung cancer subtypes and the developing mouse lung that are prognostically meaningful. Significant association was observed between the localization of human lung cancer cases along the principal mouse lung development trajectory and the corresponding patient survival rate at three distinct levels of classical histopathologic resolution: among different lung cancer subtypes, among patients within the adenocarcinoma subtype, and within the stage I adenocarcinoma subclass. The earlier the genomic association between a human tumor profile and the mouse lung development sequence, the poorer the patient's prognosis. Furthermore, decomposing this principal lung development trajectory identified a gene set that was significantly enriched for pyrimidine metabolism and cell-adhesion functions specific to lung development and oncogenesis. Conclusions From a multi-scale disease modeling perspective, the molecular dynamics of murine lung development provide an effective framework that is not only data driven but also informed by the biology of development for elucidating the mechanisms of human lung cancer biology and its clinical outcome. PMID:16800721

  3. Functional RNA structures throughout the Hepatitis C Virus genome.

    PubMed

    Adams, Rebecca L; Pirakitikulr, Nathan; Pyle, Anna Marie

    2017-06-01

    The single-stranded Hepatitis C Virus (HCV) genome adopts a set of elaborate RNA structures that are involved in every stage of the viral lifecycle. Recent advances in chemical probing, sequencing, and structural biology have facilitated analysis of RNA folding on a genome-wide scale, revealing novel structures and networks of interactions. These studies have underscored the active role played by RNA in every function of HCV and they open the door to new types of RNA-targeted therapeutics. Copyright © 2017 Elsevier B.V. All rights reserved.

  4. ChIP-Seq Analysis for Identifying Genome-Wide Histone Modifications Associated with Stress-Responsive Genes in Plants.

    PubMed

    Li, Guosheng; Jagadeeswaran, Guru; Mort, Andrew; Sunkar, Ramanjulu

    2017-01-01

    Histone modifications represent the crux of epigenetic gene regulation essential for most biological processes including abiotic stress responses in plants. Thus, identification of histone modifications at the genome-scale can provide clues for how some genes are 'turned-on' while some others are "turned-off" in response to stress. This chapter details a step-by-step protocol for identifying genome-wide histone modifications associated with stress-responsive gene regulation using chromatin immunoprecipitation (ChIP) followed by sequencing of the DNA (ChIP-seq).

  5. Complex multifractal nature in Mycobacterium tuberculosis genome

    PubMed Central

    Mandal, Saurav; Roychowdhury, Tanmoy; Chirom, Keilash; Bhattacharya, Alok; Brojen Singh, R. K.

    2017-01-01

    The mutifractal and long range correlation (C(r)) properties of strings, such as nucleotide sequence can be a useful parameter for identification of underlying patterns and variations. In this study C(r) and multifractal singularity function f(α) have been used to study variations in the genomes of a pathogenic bacteria Mycobacterium tuberculosis. Genomic sequences of M. tuberculosis isolates displayed significant variations in C(r) and f(α) reflecting inherent differences in sequences among isolates. M. tuberculosis isolates can be categorised into different subgroups based on sensitivity to drugs, these are DS (drug sensitive isolates), MDR (multi-drug resistant isolates) and XDR (extremely drug resistant isolates). C(r) follows significantly different scaling rules in different subgroups of isolates, but all the isolates follow one parameter scaling law. The richness in complexity of each subgroup can be quantified by the measures of multifractal parameters displaying a pattern in which XDR isolates have highest value and lowest for drug sensitive isolates. Therefore C(r) and multifractal functions can be useful parameters for analysis of genomic sequences. PMID:28440326

  6. Fine-scale population structure and the era of next-generation sequencing.

    PubMed

    Henn, Brenna M; Gravel, Simon; Moreno-Estrada, Andres; Acevedo-Acevedo, Suehelay; Bustamante, Carlos D

    2010-10-15

    Fine-scale population structure characterizes most continents and is especially pronounced in non-cosmopolitan populations. Roughly half of the world's population remains non-cosmopolitan and even populations within cities often assort along ethnic and linguistic categories. Barriers to random mating can be ecologically extreme, such as the Sahara Desert, or cultural, such as the Indian caste system. In either case, subpopulations accumulate genetic differences if the barrier is maintained over multiple generations. Genome-wide polymorphism data, initially with only a few hundred autosomal microsatellites, have clearly established differences in allele frequency not only among continental regions, but also within continents and within countries. We review recent evidence from the analysis of genome-wide polymorphism data for genetic boundaries delineating human population structure and the main demographic and genomic processes shaping variation, and discuss the implications of population structure for the distribution and discovery of disease-causing genetic variants, in the light of the imminent availability of sequencing data for a multitude of diverse human genomes.

  7. Resources for Functional Genomics Studies in Drosophila melanogaster

    PubMed Central

    Mohr, Stephanie E.; Hu, Yanhui; Kim, Kevin; Housden, Benjamin E.; Perrimon, Norbert

    2014-01-01

    Drosophila melanogaster has become a system of choice for functional genomic studies. Many resources, including online databases and software tools, are now available to support design or identification of relevant fly stocks and reagents or analysis and mining of existing functional genomic, transcriptomic, proteomic, etc. datasets. These include large community collections of fly stocks and plasmid clones, “meta” information sites like FlyBase and FlyMine, and an increasing number of more specialized reagents, databases, and online tools. Here, we introduce key resources useful to plan large-scale functional genomics studies in Drosophila and to analyze, integrate, and mine the results of those studies in ways that facilitate identification of highest-confidence results and generation of new hypotheses. We also discuss ways in which existing resources can be used and might be improved and suggest a few areas of future development that would further support large- and small-scale studies in Drosophila and facilitate use of Drosophila information by the research community more generally. PMID:24653003

  8. Complex multifractal nature in Mycobacterium tuberculosis genome

    NASA Astrophysics Data System (ADS)

    Mandal, Saurav; Roychowdhury, Tanmoy; Chirom, Keilash; Bhattacharya, Alok; Brojen Singh, R. K.

    2017-04-01

    The mutifractal and long range correlation (C(r)) properties of strings, such as nucleotide sequence can be a useful parameter for identification of underlying patterns and variations. In this study C(r) and multifractal singularity function f(α) have been used to study variations in the genomes of a pathogenic bacteria Mycobacterium tuberculosis. Genomic sequences of M. tuberculosis isolates displayed significant variations in C(r) and f(α) reflecting inherent differences in sequences among isolates. M. tuberculosis isolates can be categorised into different subgroups based on sensitivity to drugs, these are DS (drug sensitive isolates), MDR (multi-drug resistant isolates) and XDR (extremely drug resistant isolates). C(r) follows significantly different scaling rules in different subgroups of isolates, but all the isolates follow one parameter scaling law. The richness in complexity of each subgroup can be quantified by the measures of multifractal parameters displaying a pattern in which XDR isolates have highest value and lowest for drug sensitive isolates. Therefore C(r) and multifractal functions can be useful parameters for analysis of genomic sequences.

  9. Whole-genome analyses resolve early branches in the tree of life of modern birds

    PubMed Central

    Jarvis, Erich D.; Mirarab, Siavash; Aberer, Andre J.; Li, Bo; Houde, Peter; Li, Cai; Ho, Simon Y. W.; Faircloth, Brant C.; Nabholz, Benoit; Howard, Jason T.; Suh, Alexander; Weber, Claudia C.; da Fonseca, Rute R.; Li, Jianwen; Zhang, Fang; Li, Hui; Zhou, Long; Narula, Nitish; Liu, Liang; Ganapathy, Ganesh; Boussau, Bastien; Bayzid, Md. Shamsuzzoha; Zavidovych, Volodymyr; Subramanian, Sankar; Gabaldón, Toni; Capella-Gutiérrez, Salvador; Huerta-Cepas, Jaime; Rekepalli, Bhanu; Munch, Kasper; Schierup, Mikkel; Lindow, Bent; Warren, Wesley C.; Ray, David; Green, Richard E.; Bruford, Michael W.; Zhan, Xiangjiang; Dixon, Andrew; Li, Shengbin; Li, Ning; Huang, Yinhua; Derryberry, Elizabeth P.; Bertelsen, Mads Frost; Sheldon, Frederick H.; Brumfield, Robb T.; Mello, Claudio V.; Lovell, Peter V.; Wirthlin, Morgan; Schneider, Maria Paula Cruz; Prosdocimi, Francisco; Samaniego, José Alfredo; Velazquez, Amhed Missael Vargas; Alfaro-Núñez, Alonzo; Campos, Paula F.; Petersen, Bent; Sicheritz-Ponten, Thomas; Pas, An; Bailey, Tom; Scofield, Paul; Bunce, Michael; Lambert, David M.; Zhou, Qi; Perelman, Polina; Driskell, Amy C.; Shapiro, Beth; Xiong, Zijun; Zeng, Yongli; Liu, Shiping; Li, Zhenyu; Liu, Binghang; Wu, Kui; Xiao, Jin; Yinqi, Xiong; Zheng, Qiuemei; Zhang, Yong; Yang, Huanming; Wang, Jian; Smeds, Linnea; Rheindt, Frank E.; Braun, Michael; Fjeldsa, Jon; Orlando, Ludovic; Barker, F. Keith; Jønsson, Knud Andreas; Johnson, Warren; Koepfli, Klaus-Peter; O’Brien, Stephen; Haussler, David; Ryder, Oliver A.; Rahbek, Carsten; Willerslev, Eske; Graves, Gary R.; Glenn, Travis C.; McCormack, John; Burt, Dave; Ellegren, Hans; Alström, Per; Edwards, Scott V.; Stamatakis, Alexandros; Mindell, David P.; Cracraft, Joel; Braun, Edward L.; Warnow, Tandy; Jun, Wang; Gilbert, M. Thomas P.; Zhang, Guojie

    2015-01-01

    To better determine the history of modern birds, we performed a genome-scale phylogenetic analysis of 48 species representing all orders of Neoaves using phylogenomic methods created to handle genome-scale data. We recovered a highly resolved tree that confirms previously controversial sister or close relationships. We identified the first divergence in Neoaves, two groups we named Passerea and Columbea, representing independent lineages of diverse and convergently evolved land and water bird species. Among Passerea, we infer the common ancestor of core landbirds to have been an apex predator and confirm independent gains of vocal learning. Among Columbea, we identify pigeons and flamingoes as belonging to sister clades. Even with whole genomes, some of the earliest branches in Neoaves proved challenging to resolve, which was best explained by massive protein-coding sequence convergence and high levels of incomplete lineage sorting that occurred during a rapid radiation after the Cretaceous-Paleogene mass extinction event about 66 million years ago. PMID:25504713

  10. Whole-genome analyses resolve early branches in the tree of life of modern birds.

    PubMed

    Jarvis, Erich D; Mirarab, Siavash; Aberer, Andre J; Li, Bo; Houde, Peter; Li, Cai; Ho, Simon Y W; Faircloth, Brant C; Nabholz, Benoit; Howard, Jason T; Suh, Alexander; Weber, Claudia C; da Fonseca, Rute R; Li, Jianwen; Zhang, Fang; Li, Hui; Zhou, Long; Narula, Nitish; Liu, Liang; Ganapathy, Ganesh; Boussau, Bastien; Bayzid, Md Shamsuzzoha; Zavidovych, Volodymyr; Subramanian, Sankar; Gabaldón, Toni; Capella-Gutiérrez, Salvador; Huerta-Cepas, Jaime; Rekepalli, Bhanu; Munch, Kasper; Schierup, Mikkel; Lindow, Bent; Warren, Wesley C; Ray, David; Green, Richard E; Bruford, Michael W; Zhan, Xiangjiang; Dixon, Andrew; Li, Shengbin; Li, Ning; Huang, Yinhua; Derryberry, Elizabeth P; Bertelsen, Mads Frost; Sheldon, Frederick H; Brumfield, Robb T; Mello, Claudio V; Lovell, Peter V; Wirthlin, Morgan; Schneider, Maria Paula Cruz; Prosdocimi, Francisco; Samaniego, José Alfredo; Vargas Velazquez, Amhed Missael; Alfaro-Núñez, Alonzo; Campos, Paula F; Petersen, Bent; Sicheritz-Ponten, Thomas; Pas, An; Bailey, Tom; Scofield, Paul; Bunce, Michael; Lambert, David M; Zhou, Qi; Perelman, Polina; Driskell, Amy C; Shapiro, Beth; Xiong, Zijun; Zeng, Yongli; Liu, Shiping; Li, Zhenyu; Liu, Binghang; Wu, Kui; Xiao, Jin; Yinqi, Xiong; Zheng, Qiuemei; Zhang, Yong; Yang, Huanming; Wang, Jian; Smeds, Linnea; Rheindt, Frank E; Braun, Michael; Fjeldsa, Jon; Orlando, Ludovic; Barker, F Keith; Jønsson, Knud Andreas; Johnson, Warren; Koepfli, Klaus-Peter; O'Brien, Stephen; Haussler, David; Ryder, Oliver A; Rahbek, Carsten; Willerslev, Eske; Graves, Gary R; Glenn, Travis C; McCormack, John; Burt, Dave; Ellegren, Hans; Alström, Per; Edwards, Scott V; Stamatakis, Alexandros; Mindell, David P; Cracraft, Joel; Braun, Edward L; Warnow, Tandy; Jun, Wang; Gilbert, M Thomas P; Zhang, Guojie

    2014-12-12

    To better determine the history of modern birds, we performed a genome-scale phylogenetic analysis of 48 species representing all orders of Neoaves using phylogenomic methods created to handle genome-scale data. We recovered a highly resolved tree that confirms previously controversial sister or close relationships. We identified the first divergence in Neoaves, two groups we named Passerea and Columbea, representing independent lineages of diverse and convergently evolved land and water bird species. Among Passerea, we infer the common ancestor of core landbirds to have been an apex predator and confirm independent gains of vocal learning. Among Columbea, we identify pigeons and flamingoes as belonging to sister clades. Even with whole genomes, some of the earliest branches in Neoaves proved challenging to resolve, which was best explained by massive protein-coding sequence convergence and high levels of incomplete lineage sorting that occurred during a rapid radiation after the Cretaceous-Paleogene mass extinction event about 66 million years ago. Copyright © 2014, American Association for the Advancement of Science.

  11. Pore-scale simulation of microbial growth using a genome-scale metabolic model: Implications for Darcy-scale reactive transport

    NASA Astrophysics Data System (ADS)

    Tartakovsky, G. D.; Tartakovsky, A. M.; Scheibe, T. D.; Fang, Y.; Mahadevan, R.; Lovley, D. R.

    2013-09-01

    Recent advances in microbiology have enabled the quantitative simulation of microbial metabolism and growth based on genome-scale characterization of metabolic pathways and fluxes. We have incorporated a genome-scale metabolic model of the iron-reducing bacteria Geobacter sulfurreducens into a pore-scale simulation of microbial growth based on coupling of iron reduction to oxidation of a soluble electron donor (acetate). In our model, fluid flow and solute transport is governed by a combination of the Navier-Stokes and advection-diffusion-reaction equations. Microbial growth occurs only on the surface of soil grains where solid-phase mineral iron oxides are available. Mass fluxes of chemical species associated with microbial growth are described by the genome-scale microbial model, implemented using a constraint-based metabolic model, and provide the Robin-type boundary condition for the advection-diffusion equation at soil grain surfaces. Conventional models of microbially-mediated subsurface reactions use a lumped reaction model that does not consider individual microbial reaction pathways, and describe reactions rates using empirically-derived rate formulations such as the Monod-type kinetics. We have used our pore-scale model to explore the relationship between genome-scale metabolic models and Monod-type formulations, and to assess the manifestation of pore-scale variability (microenvironments) in terms of apparent Darcy-scale microbial reaction rates. The genome-scale model predicted lower biomass yield, and different stoichiometry for iron consumption, in comparison to prior Monod formulations based on energetics considerations. We were able to fit an equivalent Monod model, by modifying the reaction stoichiometry and biomass yield coefficient, that could effectively match results of the genome-scale simulation of microbial behaviors under excess nutrient conditions, but predictions of the fitted Monod model deviated from those of the genome-scale model under conditions in which one or more nutrients were limiting. The fitted Monod kinetic model was also applied at the Darcy scale; that is, to simulate average reaction processes at the scale of the entire pore-scale model domain. As we expected, even under excess nutrient conditions for which the Monod and genome-scale models predicted equal reaction rates at the pore scale, the Monod model over-predicted the rates of biomass growth and iron and acetate utilization when applied at the Darcy scale. This discrepancy is caused by an inherent assumption of perfect mixing over the Darcy-scale domain, which is clearly violated in the pore-scale models. These results help to explain the need to modify the flux constraint parameters in order to match observations in previous applications of the genome-scale model at larger scales. These results also motivate further investigation of quantitative multi-scale relationships between fundamental behavior at the pore scale (where genome-scale models are appropriately applied) and observed behavior at larger scales (where predictions of reactive transport phenomena are needed).

  12. Pore-scale simulation of microbial growth using a genome-scale metabolic model: Implications for Darcy-scale reactive transport

    NASA Astrophysics Data System (ADS)

    Scheibe, T. D.; Tartakovsky, G.; Tartakovsky, A. M.; Fang, Y.; Mahadevan, R.; Lovley, D. R.

    2012-12-01

    Recent advances in microbiology have enabled the quantitative simulation of microbial metabolism and growth based on genome-scale characterization of metabolic pathways and fluxes. We have incorporated a genome-scale metabolic model of the iron-reducing bacteria Geobacter sulfurreducens into a pore-scale simulation of microbial growth based on coupling of iron reduction to oxidation of a soluble electron donor (acetate). In our model, fluid flow and solute transport is governed by a combination of the Navier-Stokes and advection-diffusion-reaction equations. Microbial growth occurs only on the surface of soil grains where solid-phase mineral iron oxides are available. Mass fluxes of chemical species associated with microbial growth are described by the genome-scale microbial model, implemented using a constraint-based metabolic model, and provide the Robin-type boundary condition for the advection-diffusion equation at soil grain surfaces. Conventional models of microbially-mediated subsurface reactions use a lumped reaction model that does not consider individual microbial reaction pathways, and describe reactions rates using empirically-derived rate formulations such as the Monod-type kinetics. We have used our pore-scale model to explore the relationship between genome-scale metabolic models and Monod-type formulations, and to assess the manifestation of pore-scale variability (microenvironments) in terms of apparent Darcy-scale microbial reaction rates. The genome-scale model predicted lower biomass yield, and different stoichiometry for iron consumption, in comparison to prior Monod formulations based on energetics considerations. We were able to fit an equivalent Monod model, by modifying the reaction stoichiometry and biomass yield coefficient, that could effectively match results of the genome-scale simulation of microbial behaviors under excess nutrient conditions, but predictions of the fitted Monod model deviated from those of the genome-scale model under conditions in which one or more nutrients were limiting. The fitted Monod kinetic model was also applied at the Darcy scale; that is, to simulate average reaction processes at the scale of the entire pore-scale model domain. As we expected, even under excess nutrient conditions for which the Monod and genome-scale models predicted equal reaction rates at the pore scale, the Monod model over-predicted the rates of biomass growth and iron and acetate utilization when applied at the Darcy scale. This discrepancy is caused by an inherent assumption of perfect mixing over the Darcy-scale domain, which is clearly violated in the pore-scale models. These results help to explain the need to modify the flux constraint parameters in order to match observations in previous applications of the genome-scale model at larger scales. These results also motivate further investigation of quantitative multi-scale relationships between fundamental behavior at the pore scale (where genome-scale models are appropriately applied) and observed behavior at larger scales (where predictions of reactive transport phenomena are needed).

  13. Pore-scale simulation of microbial growth using a genome-scale metabolic model: Implications for Darcy-scale reactive transport

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tartakovsky, Guzel D.; Tartakovsky, Alexandre M.; Scheibe, Timothy D.

    2013-09-07

    Recent advances in microbiology have enabled the quantitative simulation of microbial metabolism and growth based on genome-scale characterization of metabolic pathways and fluxes. We have incorporated a genome-scale metabolic model of the iron-reducing bacteria Geobacter sulfurreducens into a pore-scale simulation of microbial growth based on coupling of iron reduction to oxidation of a soluble electron donor (acetate). In our model, fluid flow and solute transport is governed by a combination of the Navier-Stokes and advection-diffusion-reaction equations. Microbial growth occurs only on the surface of soil grains where solid-phase mineral iron oxides are available. Mass fluxes of chemical species associated withmore » microbial growth are described by the genome-scale microbial model, implemented using a constraint-based metabolic model, and provide the Robin-type boundary condition for the advection-diffusion equation at soil grain surfaces. Conventional models of microbially-mediated subsurface reactions use a lumped reaction model that does not consider individual microbial reaction pathways, and describe reactions rates using empirically-derived rate formulations such as the Monod-type kinetics. We have used our pore-scale model to explore the relationship between genome-scale metabolic models and Monod-type formulations, and to assess the manifestation of pore-scale variability (microenvironments) in terms of apparent Darcy-scale microbial reaction rates. The genome-scale model predicted lower biomass yield, and different stoichiometry for iron consumption, in comparisonto prior Monod formulations based on energetics considerations. We were able to fit an equivalent Monod model, by modifying the reaction stoichiometry and biomass yield coefficient, that could effectively match results of the genome-scale simulation of microbial behaviors under excess nutrient conditions, but predictions of the fitted Monod model deviated from those of the genome-scale model under conditions in which one or more nutrients were limiting. The fitted Monod kinetic model was also applied at the Darcy scale; that is, to simulate average reaction processes at the scale of the entire pore-scale model domain. As we expected, even under excess nutrient conditions for which the Monod and genome-scale models predicted equal reaction rates at the pore scale, the Monod model over-predicted the rates of biomass growth and iron and acetate utilization when applied at the Darcy scale. This discrepancy is caused by an inherent assumption of perfect mixing over the Darcy-scale domain, which is clearly violated in the pore-scale models. These results help to explain the need to modify the flux constraint parameters in order to match observations in previous applications of the genome-scale model at larger scales. These results also motivate further investigation of quantitative multi-scale relationships between fundamental behavior at the pore scale (where genome-scale models are appropriately applied) and observed behavior at larger scales (where predictions of reactive transport phenomena are needed).« less

  14. Regulatory sequence analysis tools.

    PubMed

    van Helden, Jacques

    2003-07-01

    The web resource Regulatory Sequence Analysis Tools (RSAT) (http://rsat.ulb.ac.be/rsat) offers a collection of software tools dedicated to the prediction of regulatory sites in non-coding DNA sequences. These tools include sequence retrieval, pattern discovery, pattern matching, genome-scale pattern matching, feature-map drawing, random sequence generation and other utilities. Alternative formats are supported for the representation of regulatory motifs (strings or position-specific scoring matrices) and several algorithms are proposed for pattern discovery. RSAT currently holds >100 fully sequenced genomes and these data are regularly updated from GenBank.

  15. Automated multiplex genome-scale engineering in yeast

    PubMed Central

    Si, Tong; Chao, Ran; Min, Yuhao; Wu, Yuying; Ren, Wen; Zhao, Huimin

    2017-01-01

    Genome-scale engineering is indispensable in understanding and engineering microorganisms, but the current tools are mainly limited to bacterial systems. Here we report an automated platform for multiplex genome-scale engineering in Saccharomyces cerevisiae, an important eukaryotic model and widely used microbial cell factory. Standardized genetic parts encoding overexpression and knockdown mutations of >90% yeast genes are created in a single step from a full-length cDNA library. With the aid of CRISPR-Cas, these genetic parts are iteratively integrated into the repetitive genomic sequences in a modular manner using robotic automation. This system allows functional mapping and multiplex optimization on a genome scale for diverse phenotypes including cellulase expression, isobutanol production, glycerol utilization and acetic acid tolerance, and may greatly accelerate future genome-scale engineering endeavours in yeast. PMID:28469255

  16. SEED Servers: High-Performance Access to the SEED Genomes, Annotations, and Metabolic Models

    PubMed Central

    Aziz, Ramy K.; Devoid, Scott; Disz, Terrence; Edwards, Robert A.; Henry, Christopher S.; Olsen, Gary J.; Olson, Robert; Overbeek, Ross; Parrello, Bruce; Pusch, Gordon D.; Stevens, Rick L.; Vonstein, Veronika; Xia, Fangfang

    2012-01-01

    The remarkable advance in sequencing technology and the rising interest in medical and environmental microbiology, biotechnology, and synthetic biology resulted in a deluge of published microbial genomes. Yet, genome annotation, comparison, and modeling remain a major bottleneck to the translation of sequence information into biological knowledge, hence computational analysis tools are continuously being developed for rapid genome annotation and interpretation. Among the earliest, most comprehensive resources for prokaryotic genome analysis, the SEED project, initiated in 2003 as an integration of genomic data and analysis tools, now contains >5,000 complete genomes, a constantly updated set of curated annotations embodied in a large and growing collection of encoded subsystems, a derived set of protein families, and hundreds of genome-scale metabolic models. Until recently, however, maintaining current copies of the SEED code and data at remote locations has been a pressing issue. To allow high-performance remote access to the SEED database, we developed the SEED Servers (http://www.theseed.org/servers): four network-based servers intended to expose the data in the underlying relational database, support basic annotation services, offer programmatic access to the capabilities of the RAST annotation server, and provide access to a growing collection of metabolic models that support flux balance analysis. The SEED servers offer open access to regularly updated data, the ability to annotate prokaryotic genomes, the ability to create metabolic reconstructions and detailed models of metabolism, and access to hundreds of existing metabolic models. This work offers and supports a framework upon which other groups can build independent research efforts. Large integrations of genomic data represent one of the major intellectual resources driving research in biology, and programmatic access to the SEED data will provide significant utility to a broad collection of potential users. PMID:23110173

  17. Metagenomic analysis reveals a green sulfur bacterium as a potential coral symbiont.

    PubMed

    Cai, Lin; Zhou, Guowei; Tian, Ren-Mao; Tong, Haoya; Zhang, Weipeng; Sun, Jin; Ding, Wei; Wong, Yue Him; Xie, James Y; Qiu, Jian-Wen; Liu, Sheng; Huang, Hui; Qian, Pei-Yuan

    2017-08-24

    Coral reefs are ecologically significant habitats. Coral-algal symbiosis confers ecological success on coral reefs and coral-microbial symbiosis is also vital to coral reefs. However, current understanding of coral-microbial symbiosis on a genomic scale is largely unknown. Here we report a potential microbial symbiont in corals revealed by metagenomics-based genomic study. Microbial cells in coral were enriched for metagenomic analysis and a high-quality draft genome of "Candidatus Prosthecochloris korallensis" was recovered by metagenome assembly and genome binning. Phylogenetic analysis shows "Ca. P. korallensis" belongs to the Prosthecochloris clade and is clustered with two Prosthecochloris clones derived from Caribbean corals. Genomic analysis reveals "Ca. P. korallensis" has potentially important ecological functions including anoxygenic photosynthesis, carbon fixation via the reductive tricarboxylic acid (rTCA) cycle, nitrogen fixation, and sulfur oxidization. Core metabolic pathway analysis suggests "Ca. P. korallensis" is a green sulfur bacterium capable of photoautotrophy or mixotrophy. Potential host-microbial interaction reveals a symbiotic relationship: "Ca. P. korallensis" might provide organic and nitrogenous nutrients to its host and detoxify sulfide for the host; the host might provide "Ca. P. korallensis" with an anaerobic environment for survival, carbon dioxide and acetate for growth, and hydrogen sulfide as an electron donor for photosynthesis.

  18. BioNano genome mapping of individual chromosomes supports physical mapping and sequence assembly in complex plant genomes.

    PubMed

    Staňková, Helena; Hastie, Alex R; Chan, Saki; Vrána, Jan; Tulpová, Zuzana; Kubaláková, Marie; Visendi, Paul; Hayashi, Satomi; Luo, Mingcheng; Batley, Jacqueline; Edwards, David; Doležel, Jaroslav; Šimková, Hana

    2016-07-01

    The assembly of a reference genome sequence of bread wheat is challenging due to its specific features such as the genome size of 17 Gbp, polyploid nature and prevalence of repetitive sequences. BAC-by-BAC sequencing based on chromosomal physical maps, adopted by the International Wheat Genome Sequencing Consortium as the key strategy, reduces problems caused by the genome complexity and polyploidy, but the repeat content still hampers the sequence assembly. Availability of a high-resolution genomic map to guide sequence scaffolding and validate physical map and sequence assemblies would be highly beneficial to obtaining an accurate and complete genome sequence. Here, we chose the short arm of chromosome 7D (7DS) as a model to demonstrate for the first time that it is possible to couple chromosome flow sorting with genome mapping in nanochannel arrays and create a de novo genome map of a wheat chromosome. We constructed a high-resolution chromosome map composed of 371 contigs with an N50 of 1.3 Mb. Long DNA molecules achieved by our approach facilitated chromosome-scale analysis of repetitive sequences and revealed a ~800-kb array of tandem repeats intractable to current DNA sequencing technologies. Anchoring 7DS sequence assemblies obtained by clone-by-clone sequencing to the 7DS genome map provided a valuable tool to improve the BAC-contig physical map and validate sequence assembly on a chromosome-arm scale. Our results indicate that creating genome maps for the whole wheat genome in a chromosome-by-chromosome manner is feasible and that they will be an affordable tool to support the production of improved pseudomolecules. © 2016 The Authors. Plant Biotechnology Journal published by Society for Experimental Biology and The Association of Applied Biologists and John Wiley & Sons Ltd.

  19. A metabolite-centric view on flux distributions in genome-scale metabolic models

    PubMed Central

    2013-01-01

    Background Genome-scale metabolic models are important tools in systems biology. They permit the in-silico prediction of cellular phenotypes via mathematical optimisation procedures, most importantly flux balance analysis. Current studies on metabolic models mostly consider reaction fluxes in isolation. Based on a recently proposed metabolite-centric approach, we here describe a set of methods that enable the analysis and interpretation of flux distributions in an integrated metabolite-centric view. We demonstrate how this framework can be used for the refinement of genome-scale metabolic models. Results We applied the metabolite-centric view developed here to the most recent metabolic reconstruction of Escherichia coli. By compiling the balance sheets of a small number of currency metabolites, we were able to fully characterise the energy metabolism as predicted by the model and to identify a possibility for model refinement in NADPH metabolism. Selected branch points were examined in detail in order to demonstrate how a metabolite-centric view allows identifying functional roles of metabolites. Fructose 6-phosphate aldolase and the sedoheptulose bisphosphate bypass were identified as enzymatic reactions that can carry high fluxes in the model but are unlikely to exhibit significant activity in vivo. Performing a metabolite essentiality analysis, unconstrained import and export of iron ions could be identified as potentially problematic for the quality of model predictions. Conclusions The system-wide analysis of split ratios and branch points allows a much deeper insight into the metabolic network than reaction-centric analyses. Extending an earlier metabolite-centric approach, the methods introduced here establish an integrated metabolite-centric framework for the interpretation of flux distributions in genome-scale metabolic networks that can complement the classical reaction-centric framework. Analysing fluxes and their metabolic context simultaneously opens the door to systems biological interpretations that are not apparent from isolated reaction fluxes. Particularly powerful demonstrations of this are the analyses of the complete metabolic contexts of energy metabolism and the folate-dependent one-carbon pool presented in this work. Finally, a metabolite-centric view on flux distributions can guide the refinement of metabolic reconstructions for specific growth scenarios. PMID:23587327

  20. Large-scale gene function analysis with the PANTHER classification system.

    PubMed

    Mi, Huaiyu; Muruganujan, Anushya; Casagrande, John T; Thomas, Paul D

    2013-08-01

    The PANTHER (protein annotation through evolutionary relationship) classification system (http://www.pantherdb.org/) is a comprehensive system that combines gene function, ontology, pathways and statistical analysis tools that enable biologists to analyze large-scale, genome-wide data from sequencing, proteomics or gene expression experiments. The system is built with 82 complete genomes organized into gene families and subfamilies, and their evolutionary relationships are captured in phylogenetic trees, multiple sequence alignments and statistical models (hidden Markov models or HMMs). Genes are classified according to their function in several different ways: families and subfamilies are annotated with ontology terms (Gene Ontology (GO) and PANTHER protein class), and sequences are assigned to PANTHER pathways. The PANTHER website includes a suite of tools that enable users to browse and query gene functions, and to analyze large-scale experimental data with a number of statistical tests. It is widely used by bench scientists, bioinformaticians, computer scientists and systems biologists. In the 2013 release of PANTHER (v.8.0), in addition to an update of the data content, we redesigned the website interface to improve both user experience and the system's analytical capability. This protocol provides a detailed description of how to analyze genome-wide experimental data with the PANTHER classification system.

  1. Reverse engineering and analysis of large genome-scale gene networks

    PubMed Central

    Aluru, Maneesha; Zola, Jaroslaw; Nettleton, Dan; Aluru, Srinivas

    2013-01-01

    Reverse engineering the whole-genome networks of complex multicellular organisms continues to remain a challenge. While simpler models easily scale to large number of genes and gene expression datasets, more accurate models are compute intensive limiting their scale of applicability. To enable fast and accurate reconstruction of large networks, we developed Tool for Inferring Network of Genes (TINGe), a parallel mutual information (MI)-based program. The novel features of our approach include: (i) B-spline-based formulation for linear-time computation of MI, (ii) a novel algorithm for direct permutation testing and (iii) development of parallel algorithms to reduce run-time and facilitate construction of large networks. We assess the quality of our method by comparison with ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) and GeneNet and demonstrate its unique capability by reverse engineering the whole-genome network of Arabidopsis thaliana from 3137 Affymetrix ATH1 GeneChips in just 9 min on a 1024-core cluster. We further report on the development of a new software Gene Network Analyzer (GeNA) for extracting context-specific subnetworks from a given set of seed genes. Using TINGe and GeNA, we performed analysis of 241 Arabidopsis AraCyc 8.0 pathways, and the results are made available through the web. PMID:23042249

  2. Genomic analysis reveals the biotechnological and industrial potential of levan producing halophilic extremophile, Halomonas smyrnensis AAD6T.

    PubMed

    Diken, Elif; Ozer, Tugba; Arikan, Muzaffer; Emrence, Zeliha; Oner, Ebru Toksoy; Ustek, Duran; Arga, Kazim Yalcin

    2015-01-01

    Halomonas smyrnensis AAD6T is a gram negative, aerobic, and moderately halophilic bacterium, and is known to produce high levels of levan with many potential uses in foods, feeds, cosmetics, pharmaceutical and chemical industries due to its outstanding properties. Here, the whole-genome analysis was performed to gain more insight about the biological mechanisms, and the whole-genome organization of the bacterium. Industrially crucial genes, including the levansucrase, were detected and the genome-scale metabolic model of H. smyrnensis AAD6T was reconstructed. The bacterium was found to have many potential applications in biotechnology not only being a levan producer, but also because of its capacity to produce Pel exopolysaccharide, polyhydroxyalkanoates, and osmoprotectants. The genomic information presented here will not only provide additional information to enhance our understanding of the genetic and metabolic network of halophilic bacteria, but also accelerate the research on systematical design of engineering strategies for biotechnology applications.

  3. High-Throughput Live-Cell Microscopy Analysis of Association Between Chromosome Domains and the Nucleolus in S. cerevisiae.

    PubMed

    Wang, Renjie; Normand, Christophe; Gadal, Olivier

    2016-01-01

    Spatial organization of the genome has important impacts on all aspects of chromosome biology, including transcription, replication, and DNA repair. Frequent interactions of some chromosome domains with specific nuclear compartments, such as the nucleolus, are now well documented using genome-scale methods. However, direct measurement of distance and interaction frequency between loci requires microscopic observation of specific genomic domains and the nucleolus, followed by image analysis to allow quantification. The fluorescent repressor operator system (FROS) is an invaluable method to fluorescently tag DNA sequences and investigate chromosome position and dynamics in living cells. This chapter describes a combination of methods to define motion and region of confinement of a locus relative to the nucleolus in cell's nucleus, from fluorescence acquisition to automated image analysis using two dedicated pipelines.

  4. Pathway Analysis and Omics Data Visualization Using Pathway Genome Databases: FragariaCyc, a Case Study.

    PubMed

    Naithani, Sushma; Jaiswal, Pankaj

    2017-01-01

    The species-specific plant Pathway Genome Databases (PGDBs) based on the BioCyc platform provide a conceptual model of the cellular metabolic network of an organism. Such frameworks allow analysis of the genome-scale expression data to understand changes in the overall metabolisms of an organism (or organs, tissues, and cells) in response to various extrinsic (e.g. developmental and differentiation) and/or extrinsic signals (e.g. pathogens and abiotic stresses) from the surrounding environment. Using FragariaCyc, a pathway database for the diploid strawberry Fragaria vesca, we show (1) the basic navigation across a PGDB; (2) a case study of pathway comparison across plant species; and (3) an example of RNA-Seq data analysis using Omics Viewer tool. The protocols described here generally apply to other Pathway Tools-based PGDBs.

  5. ENGINES: exploring single nucleotide variation in entire human genomes.

    PubMed

    Amigo, Jorge; Salas, Antonio; Phillips, Christopher

    2011-04-19

    Next generation ultra-sequencing technologies are starting to produce extensive quantities of data from entire human genome or exome sequences, and therefore new software is needed to present and analyse this vast amount of information. The 1000 Genomes project has recently released raw data for 629 complete genomes representing several human populations through their Phase I interim analysis and, although there are certain public tools available that allow exploration of these genomes, to date there is no tool that permits comprehensive population analysis of the variation catalogued by such data. We have developed a genetic variant site explorer able to retrieve data for Single Nucleotide Variation (SNVs), population by population, from entire genomes without compromising future scalability and agility. ENGINES (ENtire Genome INterface for Exploring SNVs) uses data from the 1000 Genomes Phase I to demonstrate its capacity to handle large amounts of genetic variation (>7.3 billion genotypes and 28 million SNVs), as well as deriving summary statistics of interest for medical and population genetics applications. The whole dataset is pre-processed and summarized into a data mart accessible through a web interface. The query system allows the combination and comparison of each available population sample, while searching by rs-number list, chromosome region, or genes of interest. Frequency and FST filters are available to further refine queries, while results can be visually compared with other large-scale Single Nucleotide Polymorphism (SNP) repositories such as HapMap or Perlegen. ENGINES is capable of accessing large-scale variation data repositories in a fast and comprehensive manner. It allows quick browsing of whole genome variation, while providing statistical information for each variant site such as allele frequency, heterozygosity or FST values for genetic differentiation. Access to the data mart generating scripts and to the web interface is granted from http://spsmart.cesga.es/engines.php. © 2011 Amigo et al; licensee BioMed Central Ltd.

  6. MEMOSys: Bioinformatics platform for genome-scale metabolic models

    PubMed Central

    2011-01-01

    Background Recent advances in genomic sequencing have enabled the use of genome sequencing in standard biological and biotechnological research projects. The challenge is how to integrate the large amount of data in order to gain novel biological insights. One way to leverage sequence data is to use genome-scale metabolic models. We have therefore designed and implemented a bioinformatics platform which supports the development of such metabolic models. Results MEMOSys (MEtabolic MOdel research and development System) is a versatile platform for the management, storage, and development of genome-scale metabolic models. It supports the development of new models by providing a built-in version control system which offers access to the complete developmental history. Moreover, the integrated web board, the authorization system, and the definition of user roles allow collaborations across departments and institutions. Research on existing models is facilitated by a search system, references to external databases, and a feature-rich comparison mechanism. MEMOSys provides customizable data exchange mechanisms using the SBML format to enable analysis in external tools. The web application is based on the Java EE framework and offers an intuitive user interface. It currently contains six annotated microbial metabolic models. Conclusions We have developed a web-based system designed to provide researchers a novel application facilitating the management and development of metabolic models. The system is freely available at http://www.icbi.at/MEMOSys. PMID:21276275

  7. Genome-Wide Analysis of Transposon and Retroviral Insertions Reveals Preferential Integrations in Regions of DNA Flexibility.

    PubMed

    Vrljicak, Pavle; Tao, Shijie; Varshney, Gaurav K; Quach, Helen Ngoc Bao; Joshi, Adita; LaFave, Matthew C; Burgess, Shawn M; Sampath, Karuna

    2016-04-07

    DNA transposons and retroviruses are important transgenic tools for genome engineering. An important consideration affecting the choice of transgenic vector is their insertion site preferences. Previous large-scale analyses of Ds transposon integration sites in plants were done on the basis of reporter gene expression or germ-line transmission, making it difficult to discern vertebrate integration preferences. Here, we compare over 1300 Ds transposon integration sites in zebrafish with Tol2 transposon and retroviral integration sites. Genome-wide analysis shows that Ds integration sites in the presence or absence of marker selection are remarkably similar and distributed throughout the genome. No strict motif was found, but a preference for structural features in the target DNA associated with DNA flexibility (Twist, Tilt, Rise, Roll, Shift, and Slide) was observed. Remarkably, this feature is also found in transposon and retroviral integrations in maize and mouse cells. Our findings show that structural features influence the integration of heterologous DNA in genomes, and have implications for targeted genome engineering. Copyright © 2016 Vrljicak et al.

  8. Proteinortho: detection of (co-)orthologs in large-scale analysis.

    PubMed

    Lechner, Marcus; Findeiss, Sven; Steiner, Lydia; Marz, Manja; Stadler, Peter F; Prohaska, Sonja J

    2011-04-28

    Orthology analysis is an important part of data analysis in many areas of bioinformatics such as comparative genomics and molecular phylogenetics. The ever-increasing flood of sequence data, and hence the rapidly increasing number of genomes that can be compared simultaneously, calls for efficient software tools as brute-force approaches with quadratic memory requirements become infeasible in practise. The rapid pace at which new data become available, furthermore, makes it desirable to compute genome-wide orthology relations for a given dataset rather than relying on relations listed in databases. The program Proteinortho described here is a stand-alone tool that is geared towards large datasets and makes use of distributed computing techniques when run on multi-core hardware. It implements an extended version of the reciprocal best alignment heuristic. We apply Proteinortho to compute orthologous proteins in the complete set of all 717 eubacterial genomes available at NCBI at the beginning of 2009. We identified thirty proteins present in 99% of all bacterial proteomes. Proteinortho significantly reduces the required amount of memory for orthology analysis compared to existing tools, allowing such computations to be performed on off-the-shelf hardware.

  9. CoryneRegNet: an ontology-based data warehouse of corynebacterial transcription factors and regulatory networks.

    PubMed

    Baumbach, Jan; Brinkrolf, Karina; Czaja, Lisa F; Rahmann, Sven; Tauch, Andreas

    2006-02-14

    The application of DNA microarray technology in post-genomic analysis of bacterial genome sequences has allowed the generation of huge amounts of data related to regulatory networks. This data along with literature-derived knowledge on regulation of gene expression has opened the way for genome-wide reconstruction of transcriptional regulatory networks. These large-scale reconstructions can be converted into in silico models of bacterial cells that allow a systematic analysis of network behavior in response to changing environmental conditions. CoryneRegNet was designed to facilitate the genome-wide reconstruction of transcriptional regulatory networks of corynebacteria relevant in biotechnology and human medicine. During the import and integration process of data derived from experimental studies or literature knowledge CoryneRegNet generates links to genome annotations, to identified transcription factors and to the corresponding cis-regulatory elements. CoryneRegNet is based on a multi-layered, hierarchical and modular concept of transcriptional regulation and was implemented by using the relational database management system MySQL and an ontology-based data structure. Reconstructed regulatory networks can be visualized by using the yFiles JAVA graph library. As an application example of CoryneRegNet, we have reconstructed the global transcriptional regulation of a cellular module involved in SOS and stress response of corynebacteria. CoryneRegNet is an ontology-based data warehouse that allows a pertinent data management of regulatory interactions along with the genome-scale reconstruction of transcriptional regulatory networks. These models can further be combined with metabolic networks to build integrated models of cellular function including both metabolism and its transcriptional regulation.

  10. The COG database: a tool for genome-scale analysis of protein functions and evolution

    PubMed Central

    Tatusov, Roman L.; Galperin, Michael Y.; Natale, Darren A.; Koonin, Eugene V.

    2000-01-01

    Rational classification of proteins encoded in sequenced genomes is critical for making the genome sequences maximally useful for functional and evolutionary studies. The database of Clusters of Orthologous Groups of proteins (COGs) is an attempt on a phylogenetic classification of the proteins encoded in 21 complete genomes of bacteria, archaea and eukaryotes (http://www.ncbi.nlm.nih.gov/COG ). The COGs were constructed by applying the criterion of consistency of genome-specific best hits to the results of an exhaustive comparison of all protein sequences from these genomes. The database comprises 2091 COGs that include 56–83% of the gene products from each of the complete bacterial and archaeal genomes and ~35% of those from the yeast Saccharomyces cerevisiae genome. The COG database is accompanied by the COGNITOR program that is used to fit new proteins into the COGs and can be applied to functional and phylogenetic annotation of newly sequenced genomes. PMID:10592175

  11. SOBA: sequence ontology bioinformatics analysis.

    PubMed

    Moore, Barry; Fan, Guozhen; Eilbeck, Karen

    2010-07-01

    The advent of cheaper, faster sequencing technologies has pushed the task of sequence annotation from the exclusive domain of large-scale multi-national sequencing projects to that of research laboratories and small consortia. The bioinformatics burden placed on these laboratories, some with very little programming experience can be daunting. Fortunately, there exist software libraries and pipelines designed with these groups in mind, to ease the transition from an assembled genome to an annotated and accessible genome resource. We have developed the Sequence Ontology Bioinformatics Analysis (SOBA) tool to provide a simple statistical and graphical summary of an annotated genome. We envisage its use during annotation jamborees, genome comparison and for use by developers for rapid feedback during annotation software development and testing. SOBA also provides annotation consistency feedback to ensure correct use of terminology within annotations, and guides users to add new terms to the Sequence Ontology when required. SOBA is available at http://www.sequenceontology.org/cgi-bin/soba.cgi.

  12. RegPrecise 3.0--a resource for genome-scale exploration of transcriptional regulation in bacteria.

    PubMed

    Novichkov, Pavel S; Kazakov, Alexey E; Ravcheev, Dmitry A; Leyn, Semen A; Kovaleva, Galina Y; Sutormin, Roman A; Kazanov, Marat D; Riehl, William; Arkin, Adam P; Dubchak, Inna; Rodionov, Dmitry A

    2013-11-01

    Genome-scale prediction of gene regulation and reconstruction of transcriptional regulatory networks in prokaryotes is one of the critical tasks of modern genomics. Bacteria from different taxonomic groups, whose lifestyles and natural environments are substantially different, possess highly diverged transcriptional regulatory networks. The comparative genomics approaches are useful for in silico reconstruction of bacterial regulons and networks operated by both transcription factors (TFs) and RNA regulatory elements (riboswitches). RegPrecise (http://regprecise.lbl.gov) is a web resource for collection, visualization and analysis of transcriptional regulons reconstructed by comparative genomics. We significantly expanded a reference collection of manually curated regulons we introduced earlier. RegPrecise 3.0 provides access to inferred regulatory interactions organized by phylogenetic, structural and functional properties. Taxonomy-specific collections include 781 TF regulogs inferred in more than 160 genomes representing 14 taxonomic groups of Bacteria. TF-specific collections include regulogs for a selected subset of 40 TFs reconstructed across more than 30 taxonomic lineages. Novel collections of regulons operated by RNA regulatory elements (riboswitches) include near 400 regulogs inferred in 24 bacterial lineages. RegPrecise 3.0 provides four classifications of the reference regulons implemented as controlled vocabularies: 55 TF protein families; 43 RNA motif families; ~150 biological processes or metabolic pathways; and ~200 effectors or environmental signals. Genome-wide visualization of regulatory networks and metabolic pathways covered by the reference regulons are available for all studied genomes. A separate section of RegPrecise 3.0 contains draft regulatory networks in 640 genomes obtained by an conservative propagation of the reference regulons to closely related genomes. RegPrecise 3.0 gives access to the transcriptional regulons reconstructed in bacterial genomes. Analytical capabilities include exploration of: regulon content, structure and function; TF binding site motifs; conservation and variations in genome-wide regulatory networks across all taxonomic groups of Bacteria. RegPrecise 3.0 was selected as a core resource on transcriptional regulation of the Department of Energy Systems Biology Knowledgebase, an emerging software and data environment designed to enable researchers to collaboratively generate, test and share new hypotheses about gene and protein functions, perform large-scale analyses, and model interactions in microbes, plants, and their communities.

  13. Cloud-based interactive analytics for terabytes of genomic variants data.

    PubMed

    Pan, Cuiping; McInnes, Gregory; Deflaux, Nicole; Snyder, Michael; Bingham, Jonathan; Datta, Somalee; Tsao, Philip S

    2017-12-01

    Large scale genomic sequencing is now widely used to decipher questions in diverse realms such as biological function, human diseases, evolution, ecosystems, and agriculture. With the quantity and diversity these data harbor, a robust and scalable data handling and analysis solution is desired. We present interactive analytics using a cloud-based columnar database built on Dremel to perform information compression, comprehensive quality controls, and biological information retrieval in large volumes of genomic data. We demonstrate such Big Data computing paradigms can provide orders of magnitude faster turnaround for common genomic analyses, transforming long-running batch jobs submitted via a Linux shell into questions that can be asked from a web browser in seconds. Using this method, we assessed a study population of 475 deeply sequenced human genomes for genomic call rate, genotype and allele frequency distribution, variant density across the genome, and pharmacogenomic information. Our analysis framework is implemented in Google Cloud Platform and BigQuery. Codes are available at https://github.com/StanfordBioinformatics/mvp_aaa_codelabs. cuiping@stanford.edu or ptsao@stanford.edu. Supplementary data are available at Bioinformatics online. Published by Oxford University Press 2017. This work is written by US Government employees and are in the public domain in the US.

  14. Population-based 3D genome structure analysis reveals driving forces in spatial genome organization

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tjong, Harianto; Li, Wenyuan; Kalhor, Reza

    Conformation capture technologies (e.g., Hi-C) chart physical interactions between chromatin regions on a genome-wide scale. However, the structural variability of the genome between cells poses a great challenge to interpreting ensemble-averaged Hi-C data, particularly for long-range and interchromosomal interactions. Here, we present a probabilistic approach for deconvoluting Hi-C data into a model population of distinct diploid 3D genome structures, which facilitates the detection of chromatin interactions likely to co-occur in individual cells. Here, our approach incorporates the stochastic nature of chromosome conformations and allows a detailed analysis of alternative chromatin structure states. For example, we predict and experimentally confirm themore » presence of large centromere clusters with distinct chromosome compositions varying between individual cells. The stability of these clusters varies greatly with their chromosome identities. We show that these chromosome-specific clusters can play a key role in the overall chromosome positioning in the nucleus and stabilizing specific chromatin interactions. By explicitly considering genome structural variability, our population-based method provides an important tool for revealing novel insights into the key factors shaping the spatial genome organization.« less

  15. Population-based 3D genome structure analysis reveals driving forces in spatial genome organization

    DOE PAGES

    Tjong, Harianto; Li, Wenyuan; Kalhor, Reza; ...

    2016-03-07

    Conformation capture technologies (e.g., Hi-C) chart physical interactions between chromatin regions on a genome-wide scale. However, the structural variability of the genome between cells poses a great challenge to interpreting ensemble-averaged Hi-C data, particularly for long-range and interchromosomal interactions. Here, we present a probabilistic approach for deconvoluting Hi-C data into a model population of distinct diploid 3D genome structures, which facilitates the detection of chromatin interactions likely to co-occur in individual cells. Here, our approach incorporates the stochastic nature of chromosome conformations and allows a detailed analysis of alternative chromatin structure states. For example, we predict and experimentally confirm themore » presence of large centromere clusters with distinct chromosome compositions varying between individual cells. The stability of these clusters varies greatly with their chromosome identities. We show that these chromosome-specific clusters can play a key role in the overall chromosome positioning in the nucleus and stabilizing specific chromatin interactions. By explicitly considering genome structural variability, our population-based method provides an important tool for revealing novel insights into the key factors shaping the spatial genome organization.« less

  16. Cloud-based interactive analytics for terabytes of genomic variants data

    PubMed Central

    Pan, Cuiping; McInnes, Gregory; Deflaux, Nicole; Snyder, Michael; Bingham, Jonathan; Datta, Somalee; Tsao, Philip S

    2017-01-01

    Abstract Motivation Large scale genomic sequencing is now widely used to decipher questions in diverse realms such as biological function, human diseases, evolution, ecosystems, and agriculture. With the quantity and diversity these data harbor, a robust and scalable data handling and analysis solution is desired. Results We present interactive analytics using a cloud-based columnar database built on Dremel to perform information compression, comprehensive quality controls, and biological information retrieval in large volumes of genomic data. We demonstrate such Big Data computing paradigms can provide orders of magnitude faster turnaround for common genomic analyses, transforming long-running batch jobs submitted via a Linux shell into questions that can be asked from a web browser in seconds. Using this method, we assessed a study population of 475 deeply sequenced human genomes for genomic call rate, genotype and allele frequency distribution, variant density across the genome, and pharmacogenomic information. Availability and implementation Our analysis framework is implemented in Google Cloud Platform and BigQuery. Codes are available at https://github.com/StanfordBioinformatics/mvp_aaa_codelabs. Contact cuiping@stanford.edu or ptsao@stanford.edu Supplementary information Supplementary data are available at Bioinformatics online. PMID:28961771

  17. A Method to Constrain Genome-Scale Models with 13C Labeling Data

    PubMed Central

    García Martín, Héctor; Kumar, Vinay Satish; Weaver, Daniel; Ghosh, Amit; Chubukov, Victor; Mukhopadhyay, Aindrila; Arkin, Adam; Keasling, Jay D.

    2015-01-01

    Current limitations in quantitatively predicting biological behavior hinder our efforts to engineer biological systems to produce biofuels and other desired chemicals. Here, we present a new method for calculating metabolic fluxes, key targets in metabolic engineering, that incorporates data from 13C labeling experiments and genome-scale models. The data from 13C labeling experiments provide strong flux constraints that eliminate the need to assume an evolutionary optimization principle such as the growth rate optimization assumption used in Flux Balance Analysis (FBA). This effective constraining is achieved by making the simple but biologically relevant assumption that flux flows from core to peripheral metabolism and does not flow back. The new method is significantly more robust than FBA with respect to errors in genome-scale model reconstruction. Furthermore, it can provide a comprehensive picture of metabolite balancing and predictions for unmeasured extracellular fluxes as constrained by 13C labeling data. A comparison shows that the results of this new method are similar to those found through 13C Metabolic Flux Analysis (13C MFA) for central carbon metabolism but, additionally, it provides flux estimates for peripheral metabolism. The extra validation gained by matching 48 relative labeling measurements is used to identify where and why several existing COnstraint Based Reconstruction and Analysis (COBRA) flux prediction algorithms fail. We demonstrate how to use this knowledge to refine these methods and improve their predictive capabilities. This method provides a reliable base upon which to improve the design of biological systems. PMID:26379153

  18. IMG-ABC. A knowledge base to fuel discovery of biosynthetic gene clusters and novel secondary metabolites

    DOE PAGES

    Hadjithomas, Michalis; Chen, I-Min Amy; Chu, Ken; ...

    2015-07-14

    In the discovery of secondary metabolites, analysis of sequence data is a promising exploration path that remains largely underutilized due to the lack of computational platforms that enable such a systematic approach on a large scale. In this work, we present IMG-ABC (https://img.jgi.doe.gov/abc), an atlas of biosynthetic gene clusters within the Integrated Microbial Genomes (IMG) system, which is aimed at harnessing the power of “big” genomic data for discovering small molecules. IMG-ABC relies on IMG’s comprehensive integrated structural and functional genomic data for the analysis of biosynthetic gene clusters (BCs) and associated secondary metabolites (SMs). SMs and BCs serve asmore » the two main classes of objects in IMG-ABC, each with a rich collection of attributes. A unique feature of IMG-ABC is the incorporation of both experimentally validated and computationally predicted BCs in genomes as well as metagenomes, thus identifying BCs in uncultured populations and rare taxa. We demonstrate the strength of IMG-ABC’s focused integrated analysis tools in enabling the exploration of microbial secondary metabolism on a global scale, through the discovery of phenazine-producing clusters for the first time in lphaproteobacteria. IMG-ABC strives to fill the long-existent void of resources for computational exploration of the secondary metabolism universe; its underlying scalable framework enables traversal of uncovered phylogenetic and chemical structure space, serving as a doorway to a new era in the discovery of novel molecules. IMG-ABC is the largest publicly available database of predicted and experimental biosynthetic gene clusters and the secondary metabolites they produce. The system also includes powerful search and analysis tools that are integrated with IMG’s extensive genomic/metagenomic data and analysis tool kits. As new research on biosynthetic gene clusters and secondary metabolites is published and more genomes are sequenced, IMG-ABC will continue to expand, with the goal of becoming an essential component of any bioinformatic exploration of the secondary metabolism world.« less

  19. Genome-scale fluxes predicted under the guidance of enzyme abundance using a novel hyper-cube shrink algorithm.

    PubMed

    Xie, Zhengwei; Zhang, Tianyu; Ouyang, Qi

    2018-02-01

    One of the long-expected goals of genome-scale metabolic modelling is to evaluate the influence of the perturbed enzymes on flux distribution. Both ordinary differential equation (ODE) models and constraint-based models, like Flux balance analysis (FBA), lack the capacity to perform metabolic control analysis (MCA) for large-scale networks. In this study, we developed a hyper-cube shrink algorithm (HCSA) to incorporate the enzymatic properties into the FBA model by introducing a pseudo reaction V constrained by enzymatic parameters. Our algorithm uses the enzymatic information quantitatively rather than qualitatively. We first demonstrate the concept by applying HCSA to a simple three-node network, whereby we obtained a good correlation between flux and enzyme abundance. We then validate its prediction by comparison with ODE and with a synthetic network producing voilacein and analogues in Saccharomyces cerevisiae. We show that HCSA can mimic the state-state results of ODE. Finally, we show its capability of predicting the flux distribution in genome-scale networks by applying it to sporulation in yeast. We show the ability of HCSA to operate without biomass flux and perform MCA to determine rate-limiting reactions. Algorithm was implemented by Matlab and C ++. The code is available at https://github.com/kekegg/HCSA. xiezhengwei@hsc.pku.edu.cn or qi@pku.edu.cn. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  20. Human Genome Sequencing in Health and Disease

    PubMed Central

    Gonzaga-Jauregui, Claudia; Lupski, James R.; Gibbs, Richard A.

    2013-01-01

    Following the “finished,” euchromatic, haploid human reference genome sequence, the rapid development of novel, faster, and cheaper sequencing technologies is making possible the era of personalized human genomics. Personal diploid human genome sequences have been generated, and each has contributed to our better understanding of variation in the human genome. We have consequently begun to appreciate the vastness of individual genetic variation from single nucleotide to structural variants. Translation of genome-scale variation into medically useful information is, however, in its infancy. This review summarizes the initial steps undertaken in clinical implementation of personal genome information, and describes the application of whole-genome and exome sequencing to identify the cause of genetic diseases and to suggest adjuvant therapies. Better analysis tools and a deeper understanding of the biology of our genome are necessary in order to decipher, interpret, and optimize clinical utility of what the variation in the human genome can teach us. Personal genome sequencing may eventually become an instrument of common medical practice, providing information that assists in the formulation of a differential diagnosis. We outline herein some of the remaining challenges. PMID:22248320

  1. Plant functional genomics

    NASA Astrophysics Data System (ADS)

    Holtorf, Hauke; Guitton, Marie-Christine; Reski, Ralf

    2002-04-01

    Functional genome analysis of plants has entered the high-throughput stage. The complete genome information from key species such as Arabidopsis thaliana and rice is now available and will further boost the application of a range of new technologies to functional plant gene analysis. To broadly assign functions to unknown genes, different fast and multiparallel approaches are currently used and developed. These new technologies are based on known methods but are adapted and improved to accommodate for comprehensive, large-scale gene analysis, i.e. such techniques are novel in the sense that their design allows researchers to analyse many genes at the same time and at an unprecedented pace. Such methods allow analysis of the different constituents of the cell that help to deduce gene function, namely the transcripts, proteins and metabolites. Similarly the phenotypic variations of entire mutant collections can now be analysed in a much faster and more efficient way than before. The different methodologies have developed to form their own fields within the functional genomics technological platform and are termed transcriptomics, proteomics, metabolomics and phenomics. Gene function, however, cannot solely be inferred by using only one such approach. Rather, it is only by bringing together all the information collected by different functional genomic tools that one will be able to unequivocally assign functions to unknown plant genes. This review focuses on current technical developments and their impact on the field of plant functional genomics. The lower plant Physcomitrella is introduced as a new model system for gene function analysis, owing to its high rate of homologous recombination.

  2. Legume genome evolution viewed through the Medicago truncatula and Lotus japonicus genomes

    PubMed Central

    Cannon, Steven B.; Sterck, Lieven; Rombauts, Stephane; Sato, Shusei; Cheung, Foo; Gouzy, Jérôme; Wang, Xiaohong; Mudge, Joann; Vasdewani, Jayprakash; Schiex, Thomas; Spannagl, Manuel; Monaghan, Erin; Nicholson, Christine; Humphray, Sean J.; Schoof, Heiko; Mayer, Klaus F. X.; Rogers, Jane; Quétier, Francis; Oldroyd, Giles E.; Debellé, Frédéric; Cook, Douglas R.; Retzel, Ernest F.; Roe, Bruce A.; Town, Christopher D.; Tabata, Satoshi; Van de Peer, Yves; Young, Nevin D.

    2006-01-01

    Genome sequencing of the model legumes, Medicago truncatula and Lotus japonicus, provides an opportunity for large-scale sequence-based comparison of two genomes in the same plant family. Here we report synteny comparisons between these species, including details about chromosome relationships, large-scale synteny blocks, microsynteny within blocks, and genome regions lacking clear correspondence. The Lotus and Medicago genomes share a minimum of 10 large-scale synteny blocks, each with substantial collinearity and frequently extending the length of whole chromosome arms. The proportion of genes syntenic and collinear within each synteny block is relatively homogeneous. Medicago–Lotus comparisons also indicate similar and largely homogeneous gene densities, although gene-containing regions in Mt occupy 20–30% more space than Lj counterparts, primarily because of larger numbers of Mt retrotransposons. Because the interpretation of genome comparisons is complicated by large-scale genome duplications, we describe synteny, synonymous substitutions and phylogenetic analyses to identify and date a probable whole-genome duplication event. There is no direct evidence for any recent large-scale genome duplication in either Medicago or Lotus but instead a duplication predating speciation. Phylogenetic comparisons place this duplication within the Rosid I clade, clearly after the split between legumes and Salicaceae (poplar). PMID:17003129

  3. GEnomes Management Application (GEM.app): a new software tool for large-scale collaborative genome analysis.

    PubMed

    Gonzalez, Michael A; Lebrigio, Rafael F Acosta; Van Booven, Derek; Ulloa, Rick H; Powell, Eric; Speziani, Fiorella; Tekin, Mustafa; Schüle, Rebecca; Züchner, Stephan

    2013-06-01

    Novel genes are now identified at a rapid pace for many Mendelian disorders, and increasingly, for genetically complex phenotypes. However, new challenges have also become evident: (1) effectively managing larger exome and/or genome datasets, especially for smaller labs; (2) direct hands-on analysis and contextual interpretation of variant data in large genomic datasets; and (3) many small and medium-sized clinical and research-based investigative teams around the world are generating data that, if combined and shared, will significantly increase the opportunities for the entire community to identify new genes. To address these challenges, we have developed GEnomes Management Application (GEM.app), a software tool to annotate, manage, visualize, and analyze large genomic datasets (https://genomics.med.miami.edu/). GEM.app currently contains ∼1,600 whole exomes from 50 different phenotypes studied by 40 principal investigators from 15 different countries. The focus of GEM.app is on user-friendly analysis for nonbioinformaticians to make next-generation sequencing data directly accessible. Yet, GEM.app provides powerful and flexible filter options, including single family filtering, across family/phenotype queries, nested filtering, and evaluation of segregation in families. In addition, the system is fast, obtaining results within 4 sec across ∼1,200 exomes. We believe that this system will further enhance identification of genetic causes of human disease. © 2013 Wiley Periodicals, Inc.

  4. Recurrence time statistics: versatile tools for genomic DNA sequence analysis.

    PubMed

    Cao, Yinhe; Tung, Wen-Wen; Gao, J B

    2004-01-01

    With the completion of the human and a few model organisms' genomes, and the genomes of many other organisms waiting to be sequenced, it has become increasingly important to develop faster computational tools which are capable of easily identifying the structures and extracting features from DNA sequences. One of the more important structures in a DNA sequence is repeat-related. Often they have to be masked before protein coding regions along a DNA sequence are to be identified or redundant expressed sequence tags (ESTs) are to be sequenced. Here we report a novel recurrence time based method for sequence analysis. The method can conveniently study all kinds of periodicity and exhaustively find all repeat-related features from a genomic DNA sequence. An efficient codon index is also derived from the recurrence time statistics, which has the salient features of being largely species-independent and working well on very short sequences. Efficient codon indices are key elements of successful gene finding algorithms, and are particularly useful for determining whether a suspected EST belongs to a coding or non-coding region. We illustrate the power of the method by studying the genomes of E. coli, the yeast S. cervisivae, the nematode worm C. elegans, and the human, Homo sapiens. Computationally, our method is very efficient. It allows us to carry out analysis of genomes on the whole genomic scale by a PC.

  5. The detailed 3D multi-loop aggregate/rosette chromatin architecture and functional dynamic organization of the human and mouse genomes.

    PubMed

    Knoch, Tobias A; Wachsmuth, Malte; Kepper, Nick; Lesnussa, Michael; Abuseiris, Anis; Ali Imam, A M; Kolovos, Petros; Zuin, Jessica; Kockx, Christel E M; Brouwer, Rutger W W; van de Werken, Harmen J G; van IJcken, Wilfred F J; Wendt, Kerstin S; Grosveld, Frank G

    2016-01-01

    The dynamic three-dimensional chromatin architecture of genomes and its co-evolutionary connection to its function-the storage, expression, and replication of genetic information-is still one of the central issues in biology. Here, we describe the much debated 3D architecture of the human and mouse genomes from the nucleosomal to the megabase pair level by a novel approach combining selective high-throughput high-resolution chromosomal interaction capture ( T2C ), polymer simulations, and scaling analysis of the 3D architecture and the DNA sequence. The genome is compacted into a chromatin quasi-fibre with ~5 ± 1 nucleosomes/11 nm, folded into stable ~30-100 kbp loops forming stable loop aggregates/rosettes connected by similar sized linkers. Minor but significant variations in the architecture are seen between cell types and functional states. The architecture and the DNA sequence show very similar fine-structured multi-scaling behaviour confirming their co-evolution and the above. This architecture, its dynamics, and accessibility, balance stability and flexibility ensuring genome integrity and variation enabling gene expression/regulation by self-organization of (in)active units already in proximity. Our results agree with the heuristics of the field and allow "architectural sequencing" at a genome mechanics level to understand the inseparable systems genomic properties.

  6. A Novel Genome-Information Content-Based Statistic for Genome-Wide Association Analysis Designed for Next-Generation Sequencing Data

    PubMed Central

    Luo, Li; Zhu, Yun

    2012-01-01

    Abstract The genome-wide association studies (GWAS) designed for next-generation sequencing data involve testing association of genomic variants, including common, low frequency, and rare variants. The current strategies for association studies are well developed for identifying association of common variants with the common diseases, but may be ill-suited when large amounts of allelic heterogeneity are present in sequence data. Recently, group tests that analyze their collective frequency differences between cases and controls shift the current variant-by-variant analysis paradigm for GWAS of common variants to the collective test of multiple variants in the association analysis of rare variants. However, group tests ignore differences in genetic effects among SNPs at different genomic locations. As an alternative to group tests, we developed a novel genome-information content-based statistics for testing association of the entire allele frequency spectrum of genomic variation with the diseases. To evaluate the performance of the proposed statistics, we use large-scale simulations based on whole genome low coverage pilot data in the 1000 Genomes Project to calculate the type 1 error rates and power of seven alternative statistics: a genome-information content-based statistic, the generalized T2, collapsing method, multivariate and collapsing (CMC) method, individual χ2 test, weighted-sum statistic, and variable threshold statistic. Finally, we apply the seven statistics to published resequencing dataset from ANGPTL3, ANGPTL4, ANGPTL5, and ANGPTL6 genes in the Dallas Heart Study. We report that the genome-information content-based statistic has significantly improved type 1 error rates and higher power than the other six statistics in both simulated and empirical datasets. PMID:22651812

  7. A novel genome-information content-based statistic for genome-wide association analysis designed for next-generation sequencing data.

    PubMed

    Luo, Li; Zhu, Yun; Xiong, Momiao

    2012-06-01

    The genome-wide association studies (GWAS) designed for next-generation sequencing data involve testing association of genomic variants, including common, low frequency, and rare variants. The current strategies for association studies are well developed for identifying association of common variants with the common diseases, but may be ill-suited when large amounts of allelic heterogeneity are present in sequence data. Recently, group tests that analyze their collective frequency differences between cases and controls shift the current variant-by-variant analysis paradigm for GWAS of common variants to the collective test of multiple variants in the association analysis of rare variants. However, group tests ignore differences in genetic effects among SNPs at different genomic locations. As an alternative to group tests, we developed a novel genome-information content-based statistics for testing association of the entire allele frequency spectrum of genomic variation with the diseases. To evaluate the performance of the proposed statistics, we use large-scale simulations based on whole genome low coverage pilot data in the 1000 Genomes Project to calculate the type 1 error rates and power of seven alternative statistics: a genome-information content-based statistic, the generalized T(2), collapsing method, multivariate and collapsing (CMC) method, individual χ(2) test, weighted-sum statistic, and variable threshold statistic. Finally, we apply the seven statistics to published resequencing dataset from ANGPTL3, ANGPTL4, ANGPTL5, and ANGPTL6 genes in the Dallas Heart Study. We report that the genome-information content-based statistic has significantly improved type 1 error rates and higher power than the other six statistics in both simulated and empirical datasets.

  8. Assembly of 500,000 inter-specific catfish expressed sequence tags and large scale gene-associated marker development for whole genome association studies

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Catfish Genome Consortium; Wang, Shaolin; Peatman, Eric

    2010-03-23

    Background-Through the Community Sequencing Program, a catfish EST sequencing project was carried out through a collaboration between the catfish research community and the Department of Energy's Joint Genome Institute. Prior to this project, only a limited EST resource from catfish was available for the purpose of SNP identification. Results-A total of 438,321 quality ESTs were generated from 8 channel catfish (Ictalurus punctatus) and 4 blue catfish (Ictalurus furcatus) libraries, bringing the number of catfish ESTs to nearly 500,000. Assembly of all catfish ESTs resulted in 45,306 contigs and 66,272 singletons. Over 35percent of the unique sequences had significant similarities tomore » known genes, allowing the identification of 14,776 unique genes in catfish. Over 300,000 putative SNPs have been identified, of which approximately 48,000 are high-quality SNPs identified from contigs with at least four sequences and the minor allele presence of at least two sequences in the contig. The EST resource should be valuable for identification of microsatellites, genome annotation, large-scale expression analysis, and comparative genome analysis. Conclusions-This project generated a large EST resource for catfish that captured the majority of the catfish transcriptome. The parallel analysis of ESTs from two closely related Ictalurid catfishes should also provide powerful means for the evaluation of ancient and recent gene duplications, and for the development of high-density microarrays in catfish. The inter- and intra-specific SNPs identified from all catfish EST dataset assembly will greatly benefit the catfish introgression breeding program and whole genome association studies.« less

  9. Copy-number and gene dependency analysis reveals partial copy loss of wild-type SF3B1 as a novel cancer vulnerability. | Office of Cancer Genomics

    Cancer.gov

    Genomic instability is a hallmark of human cancer, and results in widespread somatic copy number alterations. We used a genome-scale shRNA viability screen in human cancer cell lines to systematically identify genes that are essential in the context of particular copy-number alterations (copy-number associated gene dependencies). The most enriched class of copy-number associated gene dependencies was CYCLOPS (Copy-number alterations Yielding Cancer Liabilities Owing to Partial losS) genes, and spliceosome components were the most prevalent.

  10. Kernel methods for large-scale genomic data analysis

    PubMed Central

    Xing, Eric P.; Schaid, Daniel J.

    2015-01-01

    Machine learning, particularly kernel methods, has been demonstrated as a promising new tool to tackle the challenges imposed by today’s explosive data growth in genomics. They provide a practical and principled approach to learning how a large number of genetic variants are associated with complex phenotypes, to help reveal the complexity in the relationship between the genetic markers and the outcome of interest. In this review, we highlight the potential key role it will have in modern genomic data processing, especially with regard to integration with classical methods for gene prioritizing, prediction and data fusion. PMID:25053743

  11. Ergatis: a web interface and scalable software system for bioinformatics workflows

    PubMed Central

    Orvis, Joshua; Crabtree, Jonathan; Galens, Kevin; Gussman, Aaron; Inman, Jason M.; Lee, Eduardo; Nampally, Sreenath; Riley, David; Sundaram, Jaideep P.; Felix, Victor; Whitty, Brett; Mahurkar, Anup; Wortman, Jennifer; White, Owen; Angiuoli, Samuel V.

    2010-01-01

    Motivation: The growth of sequence data has been accompanied by an increasing need to analyze data on distributed computer clusters. The use of these systems for routine analysis requires scalable and robust software for data management of large datasets. Software is also needed to simplify data management and make large-scale bioinformatics analysis accessible and reproducible to a wide class of target users. Results: We have developed a workflow management system named Ergatis that enables users to build, execute and monitor pipelines for computational analysis of genomics data. Ergatis contains preconfigured components and template pipelines for a number of common bioinformatics tasks such as prokaryotic genome annotation and genome comparisons. Outputs from many of these components can be loaded into a Chado relational database. Ergatis was designed to be accessible to a broad class of users and provides a user friendly, web-based interface. Ergatis supports high-throughput batch processing on distributed compute clusters and has been used for data management in a number of genome annotation and comparative genomics projects. Availability: Ergatis is an open-source project and is freely available at http://ergatis.sourceforge.net Contact: jorvis@users.sourceforge.net PMID:20413634

  12. Annotated Draft Genome Assemblies for the Northern Bobwhite (Colinus virginianus) and the Scaled Quail (Callipepla squamata) Reveal Disparate Estimates of Modern Genome Diversity and Historic Effective Population Size.

    PubMed

    Oldeschulte, David L; Halley, Yvette A; Wilson, Miranda L; Bhattarai, Eric K; Brashear, Wesley; Hill, Joshua; Metz, Richard P; Johnson, Charles D; Rollins, Dale; Peterson, Markus J; Bickhart, Derek M; Decker, Jared E; Sewell, John F; Seabury, Christopher M

    2017-09-07

    Northern bobwhite ( Colinus virginianus ; hereafter bobwhite) and scaled quail ( Callipepla squamata ) populations have suffered precipitous declines across most of their US ranges. Illumina-based first- (v1.0) and second- (v2.0) generation draft genome assemblies for the scaled quail and the bobwhite produced N50 scaffold sizes of 1.035 and 2.042 Mb, thereby producing a 45-fold improvement in contiguity over the existing bobwhite assembly, and ≥90% of the assembled genomes were captured within 1313 and 8990 scaffolds, respectively. The scaled quail assembly (v1.0 = 1.045 Gb) was ∼20% smaller than the bobwhite (v2.0 = 1.254 Gb), which was supported by kmer-based estimates of genome size. Nevertheless, estimates of GC content (41.72%; 42.66%), genome-wide repetitive content (10.40%; 10.43%), and MAKER-predicted protein coding genes (17,131; 17,165) were similar for the scaled quail (v1.0) and bobwhite (v2.0) assemblies, respectively. BUSCO analyses utilizing 3023 single-copy orthologs revealed a high level of assembly completeness for the scaled quail (v1.0; 84.8%) and the bobwhite (v2.0; 82.5%), as verified by comparison with well-established avian genomes. We also detected 273 putative segmental duplications in the scaled quail genome (v1.0), and 711 in the bobwhite genome (v2.0), including some that were shared among both species. Autosomal variant prediction revealed ∼2.48 and 4.17 heterozygous variants per kilobase within the scaled quail (v1.0) and bobwhite (v2.0) genomes, respectively, and estimates of historic effective population size were uniformly higher for the bobwhite across all time points in a coalescent model. However, large-scale declines were predicted for both species beginning ∼15-20 KYA. Copyright © 2017 Oldeschulte et al.

  13. Allying with armored snails: the complete genome of gammaproteobacterial endosymbiont.

    PubMed

    Nakagawa, Satoshi; Shimamura, Shigeru; Takaki, Yoshihiro; Suzuki, Yohey; Murakami, Shun-ichi; Watanabe, Tamaki; Fujiyoshi, So; Mino, Sayaka; Sawabe, Tomoo; Maeda, Takahiro; Makita, Hiroko; Nemoto, Suguru; Nishimura, Shin-Ichiro; Watanabe, Hiromi; Watsuji, Tomo-o; Takai, Ken

    2014-01-01

    Deep-sea vents harbor dense populations of various animals that have their specific symbiotic bacteria. Scaly-foot gastropods, which are snails with mineralized scales covering the sides of its foot, have a gammaproteobacterial endosymbiont in their enlarged esophageal glands and diverse epibionts on the surface of their scales. In this study, we report the complete genome sequencing of gammaproteobacterial endosymbiont. The endosymbiont genome displays features consistent with ongoing genome reduction such as large proportions of pseudogenes and insertion elements. The genome encodes functions commonly found in deep-sea vent chemoautotrophs such as sulfur oxidation and carbon fixation. Stable carbon isotope ((13)C)-labeling experiments confirmed the endosymbiont chemoautotrophy. The genome also includes an intact hydrogenase gene cluster that potentially has been horizontally transferred from phylogenetically distant bacteria. Notable findings include the presence and transcription of genes for flagellar assembly, through which proteins are potentially exported from bacterium to the host. Symbionts of snail individuals exhibited extreme genetic homogeneity, showing only two synonymous changes in 19 different genes (13 810 positions in total) determined for 32 individual gastropods collected from a single colony at one time. The extremely low genetic individuality in endosymbionts probably reflects that the stringent symbiont selection by host prevents the random genetic drift in the small population of horizontally transmitted symbiont. This study is the first complete genome analysis of gastropod endosymbiont and offers an opportunity to study genome evolution in a recently evolved endosymbiont.

  14. Genome-scale reconstruction of the Streptococcus pyogenes M49 metabolic network reveals growth requirements and indicates potential drug targets.

    PubMed

    Levering, Jennifer; Fiedler, Tomas; Sieg, Antje; van Grinsven, Koen W A; Hering, Silvio; Veith, Nadine; Olivier, Brett G; Klett, Lara; Hugenholtz, Jeroen; Teusink, Bas; Kreikemeyer, Bernd; Kummer, Ursula

    2016-08-20

    Genome-scale metabolic models comprise stoichiometric relations between metabolites, as well as associations between genes and metabolic reactions and facilitate the analysis of metabolism. We computationally reconstructed the metabolic network of the lactic acid bacterium Streptococcus pyogenes M49. Initially, we based the reconstruction on genome annotations and already existing and curated metabolic networks of Bacillus subtilis, Escherichia coli, Lactobacillus plantarum and Lactococcus lactis. This initial draft was manually curated with the final reconstruction accounting for 480 genes associated with 576 reactions and 558 metabolites. In order to constrain the model further, we performed growth experiments of wild type and arcA deletion strains of S. pyogenes M49 in a chemically defined medium and calculated nutrient uptake and production fluxes. We additionally performed amino acid auxotrophy experiments to test the consistency of the model. The established genome-scale model can be used to understand the growth requirements of the human pathogen S. pyogenes and define optimal and suboptimal conditions, but also to describe differences and similarities between S. pyogenes and related lactic acid bacteria such as L. lactis in order to find strategies to reduce the growth of the pathogen and propose drug targets. Copyright © 2016 Elsevier B.V. All rights reserved.

  15. Atlas2 Cloud: a framework for personal genome analysis in the cloud

    PubMed Central

    2012-01-01

    Background Until recently, sequencing has primarily been carried out in large genome centers which have invested heavily in developing the computational infrastructure that enables genomic sequence analysis. The recent advancements in next generation sequencing (NGS) have led to a wide dissemination of sequencing technologies and data, to highly diverse research groups. It is expected that clinical sequencing will become part of diagnostic routines shortly. However, limited accessibility to computational infrastructure and high quality bioinformatic tools, and the demand for personnel skilled in data analysis and interpretation remains a serious bottleneck. To this end, the cloud computing and Software-as-a-Service (SaaS) technologies can help address these issues. Results We successfully enabled the Atlas2 Cloud pipeline for personal genome analysis on two different cloud service platforms: a community cloud via the Genboree Workbench, and a commercial cloud via the Amazon Web Services using Software-as-a-Service model. We report a case study of personal genome analysis using our Atlas2 Genboree pipeline. We also outline a detailed cost structure for running Atlas2 Amazon on whole exome capture data, providing cost projections in terms of storage, compute and I/O when running Atlas2 Amazon on a large data set. Conclusions We find that providing a web interface and an optimized pipeline clearly facilitates usage of cloud computing for personal genome analysis, but for it to be routinely used for large scale projects there needs to be a paradigm shift in the way we develop tools, in standard operating procedures, and in funding mechanisms. PMID:23134663

  16. Atlas2 Cloud: a framework for personal genome analysis in the cloud.

    PubMed

    Evani, Uday S; Challis, Danny; Yu, Jin; Jackson, Andrew R; Paithankar, Sameer; Bainbridge, Matthew N; Jakkamsetti, Adinarayana; Pham, Peter; Coarfa, Cristian; Milosavljevic, Aleksandar; Yu, Fuli

    2012-01-01

    Until recently, sequencing has primarily been carried out in large genome centers which have invested heavily in developing the computational infrastructure that enables genomic sequence analysis. The recent advancements in next generation sequencing (NGS) have led to a wide dissemination of sequencing technologies and data, to highly diverse research groups. It is expected that clinical sequencing will become part of diagnostic routines shortly. However, limited accessibility to computational infrastructure and high quality bioinformatic tools, and the demand for personnel skilled in data analysis and interpretation remains a serious bottleneck. To this end, the cloud computing and Software-as-a-Service (SaaS) technologies can help address these issues. We successfully enabled the Atlas2 Cloud pipeline for personal genome analysis on two different cloud service platforms: a community cloud via the Genboree Workbench, and a commercial cloud via the Amazon Web Services using Software-as-a-Service model. We report a case study of personal genome analysis using our Atlas2 Genboree pipeline. We also outline a detailed cost structure for running Atlas2 Amazon on whole exome capture data, providing cost projections in terms of storage, compute and I/O when running Atlas2 Amazon on a large data set. We find that providing a web interface and an optimized pipeline clearly facilitates usage of cloud computing for personal genome analysis, but for it to be routinely used for large scale projects there needs to be a paradigm shift in the way we develop tools, in standard operating procedures, and in funding mechanisms.

  17. Integration, warehousing, and analysis strategies of Omics data.

    PubMed

    Gedela, Srinubabu

    2011-01-01

    "-Omics" is a current suffix for numerous types of large-scale biological data generation procedures, which naturally demand the development of novel algorithms for data storage and analysis. With next generation genome sequencing burgeoning, it is pivotal to decipher a coding site on the genome, a gene's function, and information on transcripts next to the pure availability of sequence information. To explore a genome and downstream molecular processes, we need umpteen results at the various levels of cellular organization by utilizing different experimental designs, data analysis strategies and methodologies. Here comes the need for controlled vocabularies and data integration to annotate, store, and update the flow of experimental data. This chapter explores key methodologies to merge Omics data by semantic data carriers, discusses controlled vocabularies as eXtensible Markup Languages (XML), and provides practical guidance, databases, and software links supporting the integration of Omics data.

  18. Revealing Less Derived Nature of Cartilaginous Fish Genomes with Their Evolutionary Time Scale Inferred with Nuclear Genes

    PubMed Central

    Renz, Adina J.; Meyer, Axel; Kuraku, Shigehiro

    2013-01-01

    Cartilaginous fishes, divided into Holocephali (chimaeras) and Elasmoblanchii (sharks, rays and skates), occupy a key phylogenetic position among extant vertebrates in reconstructing their evolutionary processes. Their accurate evolutionary time scale is indispensable for better understanding of the relationship between phenotypic and molecular evolution of cartilaginous fishes. However, our current knowledge on the time scale of cartilaginous fish evolution largely relies on estimates using mitochondrial DNA sequences. In this study, making the best use of the still partial, but large-scale sequencing data of cartilaginous fish species, we estimate the divergence times between the major cartilaginous fish lineages employing nuclear genes. By rigorous orthology assessment based on available genomic and transcriptomic sequence resources for cartilaginous fishes, we selected 20 protein-coding genes in the nuclear genome, spanning 2973 amino acid residues. Our analysis based on the Bayesian inference resulted in the mean divergence time of 421 Ma, the late Silurian, for the Holocephali-Elasmobranchii split, and 306 Ma, the late Carboniferous, for the split between sharks and rays/skates. By applying these results and other documented divergence times, we measured the relative evolutionary rate of the Hox A cluster sequences in the cartilaginous fish lineages, which resulted in a lower substitution rate with a factor of at least 2.4 in comparison to tetrapod lineages. The obtained time scale enables mapping phenotypic and molecular changes in a quantitative framework. It is of great interest to corroborate the less derived nature of cartilaginous fish at the molecular level as a genome-wide phenomenon. PMID:23825540

  19. Revealing less derived nature of cartilaginous fish genomes with their evolutionary time scale inferred with nuclear genes.

    PubMed

    Renz, Adina J; Meyer, Axel; Kuraku, Shigehiro

    2013-01-01

    Cartilaginous fishes, divided into Holocephali (chimaeras) and Elasmoblanchii (sharks, rays and skates), occupy a key phylogenetic position among extant vertebrates in reconstructing their evolutionary processes. Their accurate evolutionary time scale is indispensable for better understanding of the relationship between phenotypic and molecular evolution of cartilaginous fishes. However, our current knowledge on the time scale of cartilaginous fish evolution largely relies on estimates using mitochondrial DNA sequences. In this study, making the best use of the still partial, but large-scale sequencing data of cartilaginous fish species, we estimate the divergence times between the major cartilaginous fish lineages employing nuclear genes. By rigorous orthology assessment based on available genomic and transcriptomic sequence resources for cartilaginous fishes, we selected 20 protein-coding genes in the nuclear genome, spanning 2973 amino acid residues. Our analysis based on the Bayesian inference resulted in the mean divergence time of 421 Ma, the late Silurian, for the Holocephali-Elasmobranchii split, and 306 Ma, the late Carboniferous, for the split between sharks and rays/skates. By applying these results and other documented divergence times, we measured the relative evolutionary rate of the Hox A cluster sequences in the cartilaginous fish lineages, which resulted in a lower substitution rate with a factor of at least 2.4 in comparison to tetrapod lineages. The obtained time scale enables mapping phenotypic and molecular changes in a quantitative framework. It is of great interest to corroborate the less derived nature of cartilaginous fish at the molecular level as a genome-wide phenomenon.

  20. Rice-Map: a new-generation rice genome browser.

    PubMed

    Wang, Jun; Kong, Lei; Zhao, Shuqi; Zhang, He; Tang, Liang; Li, Zhe; Gu, Xiaocheng; Luo, Jingchu; Gao, Ge

    2011-03-30

    The concurrent release of rice genome sequences for two subspecies (Oryza sativa L. ssp. japonica and Oryza sativa L. ssp. indica) facilitates rice studies at the whole genome level. Since the advent of high-throughput analysis, huge amounts of functional genomics data have been delivered rapidly, making an integrated online genome browser indispensable for scientists to visualize and analyze these data. Based on next-generation web technologies and high-throughput experimental data, we have developed Rice-Map, a novel genome browser for researchers to navigate, analyze and annotate rice genome interactively. More than one hundred annotation tracks (81 for japonica and 82 for indica) have been compiled and loaded into Rice-Map. These pre-computed annotations cover gene models, transcript evidences, expression profiling, epigenetic modifications, inter-species and intra-species homologies, genetic markers and other genomic features. In addition to these pre-computed tracks, registered users can interactively add comments and research notes to Rice-Map as User-Defined Annotation entries. By smoothly scrolling, dragging and zooming, users can browse various genomic features simultaneously at multiple scales. On-the-fly analysis for selected entries could be performed through dedicated bioinformatic analysis platforms such as WebLab and Galaxy. Furthermore, a BioMart-powered data warehouse "Rice Mart" is offered for advanced users to fetch bulk datasets based on complex criteria. Rice-Map delivers abundant up-to-date japonica and indica annotations, providing a valuable resource for both computational and bench biologists. Rice-Map is publicly accessible at http://www.ricemap.org/, with all data available for free downloading.

  1. Network-assisted crop systems genetics: network inference and integrative analysis.

    PubMed

    Lee, Tak; Kim, Hyojin; Lee, Insuk

    2015-04-01

    Although next-generation sequencing (NGS) technology has enabled the decoding of many crop species genomes, most of the underlying genetic components for economically important crop traits remain to be determined. Network approaches have proven useful for the study of the reference plant, Arabidopsis thaliana, and the success of network-based crop genetics will also require the availability of a genome-scale functional networks for crop species. In this review, we discuss how to construct functional networks and elucidate the holistic view of a crop system. The crop gene network then can be used for gene prioritization and the analysis of resequencing-based genome-wide association study (GWAS) data, the amount of which will rapidly grow in the field of crop science in the coming years. Copyright © 2015 Elsevier Ltd. All rights reserved.

  2. Modeling and Simulation of Optimal Resource Management during the Diurnal Cycle in Emiliania huxleyi by Genome-Scale Reconstruction and an Extended Flux Balance Analysis Approach.

    PubMed

    Knies, David; Wittmüß, Philipp; Appel, Sebastian; Sawodny, Oliver; Ederer, Michael; Feuer, Ronny

    2015-10-28

    The coccolithophorid unicellular alga Emiliania huxleyi is known to form large blooms, which have a strong effect on the marine carbon cycle. As a photosynthetic organism, it is subjected to a circadian rhythm due to the changing light conditions throughout the day. For a better understanding of the metabolic processes under these periodically-changing environmental conditions, a genome-scale model based on a genome reconstruction of the E. huxleyi strain CCMP 1516 was created. It comprises 410 reactions and 363 metabolites. Biomass composition is variable based on the differentiation into functional biomass components and storage metabolites. The model is analyzed with a flux balance analysis approach called diurnal flux balance analysis (diuFBA) that was designed for organisms with a circadian rhythm. It allows storage metabolites to accumulate or be consumed over the diurnal cycle, while keeping the structure of a classical FBA problem. A feature of this approach is that the production and consumption of storage metabolites is not defined externally via the biomass composition, but the result of optimal resource management adapted to the diurnally-changing environmental conditions. The model in combination with this approach is able to simulate the variable biomass composition during the diurnal cycle in proximity to literature data.

  3. Comparative genomics of parasitic silkworm microsporidia reveal an association between genome expansion and host adaptation

    PubMed Central

    2013-01-01

    Background Microsporidian Nosema bombycis has received much attention because the pébrine disease of domesticated silkworms results in great economic losses in the silkworm industry. So far, no effective treatment could be found for pébrine. Compared to other known Nosema parasites, N. bombycis can unusually parasitize a broad range of hosts. To gain some insights into the underlying genetic mechanism of pathological ability and host range expansion in this parasite, a comparative genomic approach is conducted. The genome of two Nosema parasites, N. bombycis and N. antheraeae (an obligatory parasite to undomesticated silkworms Antheraea pernyi), were sequenced and compared with their distantly related species, N. ceranae (an obligatory parasite to honey bees). Results Our comparative genomics analysis show that the N. bombycis genome has greatly expanded due to the following three molecular mechanisms: 1) the proliferation of host-derived transposable elements, 2) the acquisition of many horizontally transferred genes from bacteria, and 3) the production of abundnant gene duplications. To our knowledge, duplicated genes derived not only from small-scale events (e.g., tandem duplications) but also from large-scale events (e.g., segmental duplications) have never been seen so abundant in any reported microsporidia genomes. Our relative dating analysis further indicated that these duplication events have arisen recently over very short evolutionary time. Furthermore, several duplicated genes involving in the cytotoxic metabolic pathway were found to undergo positive selection, suggestive of the role of duplicated genes on the adaptive evolution of pathogenic ability. Conclusions Genome expansion is rarely considered as the evolutionary outcome acting on those highly reduced and compact parasitic microsporidian genomes. This study, for the first time, demonstrates that the parasitic genomes can expand, instead of shrink, through several common molecular mechanisms such as gene duplication, horizontal gene transfer, and transposable element expansion. We also showed that the duplicated genes can serve as raw materials for evolutionary innovations possibly contributing to the increase of pathologenic ability. Based on our research, we propose that duplicated genes of N. bombycis should be treated as primary targets for treatment designs against pébrine. PMID:23496955

  4. Missing Data and Influential Sites: Choice of Sites for Phylogenetic Analysis Can Be As Important As Taxon Sampling and Model Choice

    PubMed Central

    Shavit Grievink, Liat; Penny, David; Holland, Barbara R.

    2013-01-01

    Phylogenetic studies based on molecular sequence alignments are expected to become more accurate as the number of sites in the alignments increases. With the advent of genomic-scale data, where alignments have very large numbers of sites, bootstrap values close to 100% and posterior probabilities close to 1 are the norm, suggesting that the number of sites is now seldom a limiting factor on phylogenetic accuracy. This provokes the question, should we be fussy about the sites we choose to include in a genomic-scale phylogenetic analysis? If some sites contain missing data, ambiguous character states, or gaps, then why not just throw them away before conducting the phylogenetic analysis? Indeed, this is exactly the approach taken in many phylogenetic studies. Here, we present an example where the decision on how to treat sites with missing data is of equal importance to decisions on taxon sampling and model choice, and we introduce a graphical method for illustrating this. PMID:23471508

  5. Finding elementary flux modes in metabolic networks based on flux balance analysis and flux coupling analysis: application to the analysis of Escherichia coli metabolism.

    PubMed

    Tabe-Bordbar, Shayan; Marashi, Sayed-Amir

    2013-12-01

    Elementary modes (EMs) are steady-state metabolic flux vectors with minimal set of active reactions. Each EM corresponds to a metabolic pathway. Therefore, studying EMs is helpful for analyzing the production of biotechnologically important metabolites. However, memory requirements for computing EMs may hamper their applicability as, in most genome-scale metabolic models, no EM can be computed due to running out of memory. In this study, we present a method for computing randomly sampled EMs. In this approach, a network reduction algorithm is used for EM computation, which is based on flux balance-based methods. We show that this approach can be used to recover the EMs in the medium- and genome-scale metabolic network models, while the EMs are sampled in an unbiased way. The applicability of such results is shown by computing “estimated” control-effective flux values in Escherichia coli metabolic network.

  6. Effect of amino acid supplementation on titer and glycosylation distribution in hybridoma cell cultures-Systems biology-based interpretation using genome-scale metabolic flux balance model and multivariate data analysis.

    PubMed

    Reimonn, Thomas M; Park, Seo-Young; Agarabi, Cyrus D; Brorson, Kurt A; Yoon, Seongkyu

    2016-09-01

    Genome-scale flux balance analysis (FBA) is a powerful systems biology tool to characterize intracellular reaction fluxes during cell cultures. FBA estimates intracellular reaction rates by optimizing an objective function, subject to the constraints of a metabolic model and media uptake/excretion rates. A dynamic extension to FBA, dynamic flux balance analysis (DFBA), can calculate intracellular reaction fluxes as they change during cell cultures. In a previous study by Read et al. (2013), a series of informed amino acid supplementation experiments were performed on twelve parallel murine hybridoma cell cultures, and this data was leveraged for further analysis (Read et al., Biotechnol Prog. 2013;29:745-753). In order to understand the effects of media changes on the model murine hybridoma cell line, a systems biology approach is applied in the current study. Dynamic flux balance analysis was performed using a genome-scale mouse metabolic model, and multivariate data analysis was used for interpretation. The calculated reaction fluxes were examined using partial least squares and partial least squares discriminant analysis. The results indicate media supplementation increases product yield because it raises nutrient levels extending the growth phase, and the increased cell density allows for greater culture performance. At the same time, the directed supplementation does not change the overall metabolism of the cells. This supports the conclusion that product quality, as measured by glycoform assays, remains unchanged because the metabolism remains in a similar state. Additionally, the DFBA shows that metabolic state varies more at the beginning of the culture but less by the middle of the growth phase, possibly due to stress on the cells during inoculation. © 2016 American Institute of Chemical Engineers Biotechnol. Prog., 32:1163-1173, 2016. © 2016 American Institute of Chemical Engineers.

  7. The USC Epigenome Center.

    PubMed

    Laird, Peter W

    2009-10-01

    The University of Southern California (USC, CA, USA) has a long tradition of excellence in epigenetics. With the recent explosive growth and technological maturation of the field of epigenetics, it became clear that a dedicated high-throughput epigenomic data production facility would be needed to remain at the forefront of epigenetic research. To address this need, USC launched the USC Epigenome Center as the first large-scale center in academics dedicated to epigenomic research. The Center is providing high-throughput data production for large-scale genomic and epigenomic studies, and developing novel analysis tools for epigenomic research. This unique facility promises to be a valuable resource for multidisciplinary research, education and training in genomics, epigenomics, bioinformatics, and translational medicine.

  8. LDSplitDB: a database for studies of meiotic recombination hotspots in MHC using human genomic data.

    PubMed

    Guo, Jing; Chen, Hao; Yang, Peng; Lee, Yew Ti; Wu, Min; Przytycka, Teresa M; Kwoh, Chee Keong; Zheng, Jie

    2018-04-20

    Meiotic recombination happens during the process of meiosis when chromosomes inherited from two parents exchange genetic materials to generate chromosomes in the gamete cells. The recombination events tend to occur in narrow genomic regions called recombination hotspots. Its dysregulation could lead to serious human diseases such as birth defects. Although the regulatory mechanism of recombination events is still unclear, DNA sequence polymorphisms have been found to play crucial roles in the regulation of recombination hotspots. To facilitate the studies of the underlying mechanism, we developed a database named LDSplitDB which provides an integrative and interactive data mining and visualization platform for the genome-wide association studies of recombination hotspots. It contains the pre-computed association maps of the major histocompatibility complex (MHC) region in the 1000 Genomes Project and the HapMap Phase III datasets, and a genome-scale study of the European population from the HapMap Phase II dataset. Besides the recombination profiles, related data of genes, SNPs and different types of epigenetic modifications, which could be associated with meiotic recombination, are provided for comprehensive analysis. To meet the computational requirement of the rapidly increasing population genomics data, we prepared a lookup table of 400 haplotypes for recombination rate estimation using the well-known LDhat algorithm which includes all possible two-locus haplotype configurations. To the best of our knowledge, LDSplitDB is the first large-scale database for the association analysis of human recombination hotspots with DNA sequence polymorphisms. It provides valuable resources for the discovery of the mechanism of meiotic recombination hotspots. The information about MHC in this database could help understand the roles of recombination in human immune system. DATABASE URL: http://histone.scse.ntu.edu.sg/LDSplitDB.

  9. CoryneRegNet: An ontology-based data warehouse of corynebacterial transcription factors and regulatory networks

    PubMed Central

    Baumbach, Jan; Brinkrolf, Karina; Czaja, Lisa F; Rahmann, Sven; Tauch, Andreas

    2006-01-01

    Background The application of DNA microarray technology in post-genomic analysis of bacterial genome sequences has allowed the generation of huge amounts of data related to regulatory networks. This data along with literature-derived knowledge on regulation of gene expression has opened the way for genome-wide reconstruction of transcriptional regulatory networks. These large-scale reconstructions can be converted into in silico models of bacterial cells that allow a systematic analysis of network behavior in response to changing environmental conditions. Description CoryneRegNet was designed to facilitate the genome-wide reconstruction of transcriptional regulatory networks of corynebacteria relevant in biotechnology and human medicine. During the import and integration process of data derived from experimental studies or literature knowledge CoryneRegNet generates links to genome annotations, to identified transcription factors and to the corresponding cis-regulatory elements. CoryneRegNet is based on a multi-layered, hierarchical and modular concept of transcriptional regulation and was implemented by using the relational database management system MySQL and an ontology-based data structure. Reconstructed regulatory networks can be visualized by using the yFiles JAVA graph library. As an application example of CoryneRegNet, we have reconstructed the global transcriptional regulation of a cellular module involved in SOS and stress response of corynebacteria. Conclusion CoryneRegNet is an ontology-based data warehouse that allows a pertinent data management of regulatory interactions along with the genome-scale reconstruction of transcriptional regulatory networks. These models can further be combined with metabolic networks to build integrated models of cellular function including both metabolism and its transcriptional regulation. PMID:16478536

  10. Toward systems metabolic engineering of Aspergillus and Pichia species for the production of chemicals and biofuels.

    PubMed

    Caspeta, Luis; Nielsen, Jens

    2013-05-01

    Recently genome sequence data have become available for Aspergillus and Pichia species of industrial interest. This has stimulated the use of systems biology approaches for large-scale analysis of the molecular and metabolic responses of Aspergillus and Pichia under defined conditions, which has resulted in much new biological information. Case-specific contextualization of this information has been performed using comparative and functional genomic tools. Genomics data are also the basis for constructing genome-scale metabolic models, and these models have helped in the contextualization of knowledge on the fundamental biology of Aspergillus and Pichia species. Furthermore, with the availability of these models, the engineering of Aspergillus and Pichia is moving from traditional approaches, such as random mutagenesis, to a systems metabolic engineering approach. Here we review the recent trends in systems biology of Aspergillus and Pichia species, highlighting the relevance of these developments for systems metabolic engineering of these organisms for the production of hydrolytic enzymes, biofuels and chemicals from biomass. Copyright © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  11. Application of industrial scale genomics to discovery of therapeutic targets in heart failure.

    PubMed

    Mehraban, F; Tomlinson, J E

    2001-12-01

    In recent years intense activity in both academic and industrial sectors has provided a wealth of information on the human genome with an associated impressive increase in the number of novel gene sequences deposited in sequence data repositories and patent applications. This genomic industrial revolution has transformed the way in which drug target discovery is now approached. In this article we discuss how various differential gene expression (DGE) technologies are being utilized for cardiovascular disease (CVD) drug target discovery. Other approaches such as sequencing cDNA from cardiovascular derived tissues and cells coupled with bioinformatic sequence analysis are used with the aim of identifying novel gene sequences that may be exploited towards target discovery. Additional leverage from gene sequence information is obtained through identification of polymorphisms that may confer disease susceptibility and/or affect drug responsiveness. Pharmacogenomic studies are described wherein gene expression-based techniques are used to evaluate drug response and/or efficacy. Industrial-scale genomics supports and addresses not only novel target gene discovery but also the burgeoning issues in pharmaceutical and clinical cardiovascular medicine relative to polymorphic gene responses.

  12. Combining Genome-Scale Experimental and Computational Methods To Identify Essential Genes in Rhodobacter sphaeroides

    DOE PAGES

    Burger, Brian T.; Imam, Saheed; Scarborough, Matthew J.; ...

    2017-06-06

    Rhodobacter sphaeroides is one of the best-studied alphaproteobacteria from biochemical, genetic, and genomic perspectives. To gain a better systems-level understanding of this organism, we generated a large transposon mutant library and used transposon sequencing (Tn-seq) to identify genes that are essential under several growth conditions. Using newly developed Tn-seq analysis software (TSAS), we identified 493 genes as essential for aerobic growth on a rich medium. We then used the mutant library to identify conditionally essential genes under two laboratory growth conditions, identifying 85 additional genes required for aerobic growth in a minimal medium and 31 additional genes required for photosyntheticmore » growth. In all instances, our analyses confirmed essentiality for many known genes and identified genes not previously considered to be essential. We used the resulting Tn-seq data to refine and improve a genome-scale metabolic network model (GEM) for R. sphaeroides. Together, we demonstrate how genetic, genomic, and computational approaches can be combined to obtain a systems-level understanding of the genetic framework underlying metabolic diversity in bacterial species.« less

  13. Whole genome comparisons of Fragaria, Prunus and Malus reveal different modes of evolution between Rosaceous subfamilies

    PubMed Central

    2012-01-01

    Background Rosaceae include numerous economically important and morphologically diverse species. Comparative mapping between the member species in Rosaceae have indicated some level of synteny. Recently the whole genome of three crop species, peach, apple and strawberry, which belong to different genera of the Rosaceae family, have been sequenced, allowing in-depth comparison of these genomes. Results Our analysis using the whole genome sequences of peach, apple and strawberry identified 1399 orthologous regions between the three genomes, with a mean length of around 100 kb. Each peach chromosome showed major orthology mostly to one strawberry chromosome, but to more than two apple chromosomes, suggesting that the apple genome went through more chromosomal fissions in addition to the whole genome duplication after the divergence of the three genera. However, the distribution of contiguous ancestral regions, identified using the multiple genome rearrangements and ancestors (MGRA) algorithm, suggested that the Fragaria genome went through a greater number of small scale rearrangements compared to the other genomes since they diverged from a common ancestor. Using the contiguous ancestral regions, we reconstructed a hypothetical ancestral genome for the Rosaceae 7 composed of nine chromosomes and propose the evolutionary steps from the ancestral genome to the extant Fragaria, Prunus and Malus genomes. Conclusion Our analysis shows that different modes of evolution may have played major roles in different subfamilies of Rosaceae. The hypothetical ancestral genome of Rosaceae and the evolutionary steps that lead to three different lineages of Rosaceae will facilitate our understanding of plant genome evolution as well as have a practical impact on knowledge transfer among member species of Rosaceae. PMID:22475018

  14. Reconstruction of the genome-scale co-expression network for the Hippo signaling pathway in colorectal cancer.

    PubMed

    Dehghanian, Fariba; Hojati, Zohreh; Hosseinkhan, Nazanin; Mousavian, Zaynab; Masoudi-Nejad, Ali

    2018-05-26

    The Hippo signaling pathway (HSP) has been identified as an essential and complex signaling pathway for tumor suppression that coordinates proliferation, differentiation, cell death, cell growth and stemness. In the present study, we conducted a genome-scale co-expression analysis to reconstruct the HSP in colorectal cancer (CRC). Five key modules were detected through network clustering, and a detailed discussion of two modules containing respectively 18 and 13 over and down-regulated members of HSP was provided. Our results suggest new potential regulatory factors in the HSP. The detected modules also suggest novel genes contributing to CRC. Moreover, differential expression analysis confirmed the differential expression pattern of HSP members and new suggested regulatory factors between tumor and normal samples. These findings can further reveal the importance of HSP in CRC. Copyright © 2018 Elsevier Ltd. All rights reserved.

  15. Single-molecule optical genome mapping of a human HapMap and a colorectal cancer cell line.

    PubMed

    Teo, Audrey S M; Verzotto, Davide; Yao, Fei; Nagarajan, Niranjan; Hillmer, Axel M

    2015-01-01

    Next-generation sequencing (NGS) technologies have changed our understanding of the variability of the human genome. However, the identification of genome structural variations based on NGS approaches with read lengths of 35-300 bases remains a challenge. Single-molecule optical mapping technologies allow the analysis of DNA molecules of up to 2 Mb and as such are suitable for the identification of large-scale genome structural variations, and for de novo genome assemblies when combined with short-read NGS data. Here we present optical mapping data for two human genomes: the HapMap cell line GM12878 and the colorectal cancer cell line HCT116. High molecular weight DNA was obtained by embedding GM12878 and HCT116 cells, respectively, in agarose plugs, followed by DNA extraction under mild conditions. Genomic DNA was digested with KpnI and 310,000 and 296,000 DNA molecules (≥ 150 kb and 10 restriction fragments), respectively, were analyzed per cell line using the Argus optical mapping system. Maps were aligned to the human reference by OPTIMA, a new glocal alignment method. Genome coverage of 6.8× and 5.7× was obtained, respectively; 2.9× and 1.7× more than the coverage obtained with previously available software. Optical mapping allows the resolution of large-scale structural variations of the genome, and the scaffold extension of NGS-based de novo assemblies. OPTIMA is an efficient new alignment method; our optical mapping data provide a resource for genome structure analyses of the human HapMap reference cell line GM12878, and the colorectal cancer cell line HCT116.

  16. Proteinortho: Detection of (Co-)orthologs in large-scale analysis

    PubMed Central

    2011-01-01

    Background Orthology analysis is an important part of data analysis in many areas of bioinformatics such as comparative genomics and molecular phylogenetics. The ever-increasing flood of sequence data, and hence the rapidly increasing number of genomes that can be compared simultaneously, calls for efficient software tools as brute-force approaches with quadratic memory requirements become infeasible in practise. The rapid pace at which new data become available, furthermore, makes it desirable to compute genome-wide orthology relations for a given dataset rather than relying on relations listed in databases. Results The program Proteinortho described here is a stand-alone tool that is geared towards large datasets and makes use of distributed computing techniques when run on multi-core hardware. It implements an extended version of the reciprocal best alignment heuristic. We apply Proteinortho to compute orthologous proteins in the complete set of all 717 eubacterial genomes available at NCBI at the beginning of 2009. We identified thirty proteins present in 99% of all bacterial proteomes. Conclusions Proteinortho significantly reduces the required amount of memory for orthology analysis compared to existing tools, allowing such computations to be performed on off-the-shelf hardware. PMID:21526987

  17. Applicability of SCAR markers to food genomics: olive oil traceability.

    PubMed

    Pafundo, Simona; Agrimonti, Caterina; Maestri, Elena; Marmiroli, Nelson

    2007-07-25

    DNA analysis with molecular markers has opened a shortcut toward a genomic comprehension of complex organisms. The availability of micro-DNA extraction methods, coupled with selective amplification of the smallest extracted fragments with molecular markers, could equally bring a breakthrough in food genomics: the identification of original components in food. Amplified fragment length polymorphisms (AFLPs) have been instrumental in plant genomics because they may allow rapid and reliable analysis of multiple and potentially polymorphic sites. Nevertheless, their direct application to the analysis of DNA extracted from food matrixes is complicated by the low quality of DNA extracted: its high degradation and the presence of inhibitors of enzymatic reactions. The conversion of an AFLP fragment to a robust and specific single-locus PCR-based marker, therefore, could extend the use of molecular markers to large-scale analysis of complex agro-food matrixes. In the present study is reported the development of sequence characterized amplified regions (SCARs) starting from AFLP profiles of monovarietal olive oils analyzed on agarose gel; one of these was used to identify differences among 56 olive cultivars. All the developed markers were purposefully amplified in olive oils to apply them to olive oil traceability.

  18. Genome-scale biological models for industrial microbial systems.

    PubMed

    Xu, Nan; Ye, Chao; Liu, Liming

    2018-04-01

    The primary aims and challenges associated with microbial fermentation include achieving faster cell growth, higher productivity, and more robust production processes. Genome-scale biological models, predicting the formation of an interaction among genetic materials, enzymes, and metabolites, constitute a systematic and comprehensive platform to analyze and optimize the microbial growth and production of biological products. Genome-scale biological models can help optimize microbial growth-associated traits by simulating biomass formation, predicting growth rates, and identifying the requirements for cell growth. With regard to microbial product biosynthesis, genome-scale biological models can be used to design product biosynthetic pathways, accelerate production efficiency, and reduce metabolic side effects, leading to improved production performance. The present review discusses the development of microbial genome-scale biological models since their emergence and emphasizes their pertinent application in improving industrial microbial fermentation of biological products.

  19. Complete Genome Sequence of a New Ruminococcaceae Bacterium Isolated from Anaerobic Biomass Hydrolysis.

    PubMed

    Hahnke, Sarah; Abendroth, Christian; Langer, Thomas; Codoñer, Francisco M; Ramm, Patrice; Porcar, Manuel; Luschnig, Olaf; Klocke, Michael

    2018-04-05

    A new Ruminococcaceae bacterium, strain HV4-5-B5C, participating in the anaerobic digestion of grass, was isolated from a mesophilic two-stage laboratory-scale leach bed biogas system. The draft annotated genome sequence presented in this study and 16S rRNA gene sequence analysis indicated the affiliation of HV4-5-B5C with the family Ruminococcaceae outside recently described genera. Copyright © 2018 Hahnke et al.

  20. A New Perspective on Polyploid Fragaria (Strawberry) Genome Composition Based on Large-Scale, Multi-Locus Phylogenetic Analysis

    PubMed Central

    Yang, Yilong

    2017-01-01

    Abstract The subgenomic compositions of the octoploid (2n = 8× = 56) strawberry (Fragaria) species, including the economically important cultivated species Fragaria x ananassa, have been a topic of long-standing interest. Phylogenomic approaches utilizing next-generation sequencing technologies offer a new window into species relationships and the subgenomic compositions of polyploids. We have conducted a large-scale phylogenetic analysis of Fragaria (strawberry) species using the Fluidigm Access Array system and 454 sequencing platform. About 24 single-copy or low-copy nuclear genes distributed across the genome were amplified and sequenced from 96 genomic DNA samples representing 16 Fragaria species from diploid (2×) to decaploid (10×), including the most extensive sampling of octoploid taxa yet reported. Individual gene trees were constructed by different tree-building methods. Mosaic genomic structures of diploid Fragaria species consisting of sequences at different phylogenetic positions were observed. Our findings support the presence in octoploid species of genetic signatures from at least five diploid ancestors (F. vesca, F. iinumae, F. bucharica, F. viridis, and at least one additional allele contributor of unknown identity), and questions the extent to which distinct subgenomes are preserved over evolutionary time in the allopolyploid Fragaria species. In addition, our data support divergence between the two wild octoploid species, F. virginiana and F. chiloensis. PMID:29045639

  1. Fast flux module detection using matroid theory.

    PubMed

    Reimers, Arne C; Bruggeman, Frank J; Olivier, Brett G; Stougie, Leen

    2015-05-01

    Flux balance analysis (FBA) is one of the most often applied methods on genome-scale metabolic networks. Although FBA uniquely determines the optimal yield, the pathway that achieves this is usually not unique. The analysis of the optimal-yield flux space has been an open challenge. Flux variability analysis is only capturing some properties of the flux space, while elementary mode analysis is intractable due to the enormous number of elementary modes. However, it has been found by Kelk et al. (2012) that the space of optimal-yield fluxes decomposes into flux modules. These decompositions allow a much easier but still comprehensive analysis of the optimal-yield flux space. Using the mathematical definition of module introduced by Müller and Bockmayr (2013b), we discovered useful connections to matroid theory, through which efficient algorithms enable us to compute the decomposition into modules in a few seconds for genome-scale networks. Using that every module can be represented by one reaction that represents its function, in this article, we also present a method that uses this decomposition to visualize the interplay of modules. We expect the new method to replace flux variability analysis in the pipelines for metabolic networks.

  2. An enhanced genome-scale metabolic reconstruction of Streptomyces clavuligerus identifies novel strain improvement strategies.

    PubMed

    Toro, León; Pinilla, Laura; Avignone-Rossa, Claudio; Ríos-Estepa, Rigoberto

    2018-05-01

    In this work, we expanded and updated a genome-scale metabolic model of Streptomyces clavuligerus. The model includes 1021 genes and 1494 biochemical reactions; genome-reaction information was curated and new features related to clavam metabolism and to the biomass synthesis equation were incorporated. The model was validated using experimental data from the literature and simulations were performed to predict cellular growth and clavulanic acid biosynthesis. Flux balance analysis (FBA) showed that limiting concentrations of phosphate and an excess of ammonia accumulation are unfavorable for growth and clavulanic acid biosynthesis. The evaluation of different objective functions for FBA showed that maximization of ATP yields the best predictions for cellular behavior in continuous cultures, while the maximization of growth rate provides better predictions for batch cultures. Through gene essentiality analysis, 130 essential genes were found using a limited in silico media, while 100 essential genes were identified in amino acid-supplemented media. Finally, a strain design was carried out to identify candidate genes to be overexpressed or knocked out so as to maximize antibiotic biosynthesis. Interestingly, potential metabolic engineering targets, identified in this study, have not been tested experimentally.

  3. An integrated approach to reconstructing genome-scale transcriptional regulatory networks

    DOE PAGES

    Imam, Saheed; Noguera, Daniel R.; Donohue, Timothy J.; ...

    2015-02-27

    Transcriptional regulatory networks (TRNs) program cells to dynamically alter their gene expression in response to changing internal or environmental conditions. In this study, we develop a novel workflow for generating large-scale TRN models that integrates comparative genomics data, global gene expression analyses, and intrinsic properties of transcription factors (TFs). An assessment of this workflow using benchmark datasets for the well-studied γ-proteobacterium Escherichia coli showed that it outperforms expression-based inference approaches, having a significantly larger area under the precision-recall curve. Further analysis indicated that this integrated workflow captures different aspects of the E. coli TRN than expression-based approaches, potentially making themmore » highly complementary. We leveraged this new workflow and observations to build a large-scale TRN model for the α-Proteobacterium Rhodobacter sphaeroides that comprises 120 gene clusters, 1211 genes (including 93 TFs), 1858 predicted protein-DNA interactions and 76 DNA binding motifs. We found that ~67% of the predicted gene clusters in this TRN are enriched for functions ranging from photosynthesis or central carbon metabolism to environmental stress responses. We also found that members of many of the predicted gene clusters were consistent with prior knowledge in R. sphaeroides and/or other bacteria. Experimental validation of predictions from this R. sphaeroides TRN model showed that high precision and recall was also obtained for TFs involved in photosynthesis (PpsR), carbon metabolism (RSP_0489) and iron homeostasis (RSP_3341). In addition, this integrative approach enabled generation of TRNs with increased information content relative to R. sphaeroides TRN models built via other approaches. We also show how this approach can be used to simultaneously produce TRN models for each related organism used in the comparative genomics analysis. Our results highlight the advantages of integrating comparative genomics of closely related organisms with gene expression data to assemble large-scale TRN models with high-quality predictions.« less

  4. How life changes itself: the Read-Write (RW) genome.

    PubMed

    Shapiro, James A

    2013-09-01

    The genome has traditionally been treated as a Read-Only Memory (ROM) subject to change by copying errors and accidents. In this review, I propose that we need to change that perspective and understand the genome as an intricately formatted Read-Write (RW) data storage system constantly subject to cellular modifications and inscriptions. Cells operate under changing conditions and are continually modifying themselves by genome inscriptions. These inscriptions occur over three distinct time-scales (cell reproduction, multicellular development and evolutionary change) and involve a variety of different processes at each time scale (forming nucleoprotein complexes, epigenetic formatting and changes in DNA sequence structure). Research dating back to the 1930s has shown that genetic change is the result of cell-mediated processes, not simply accidents or damage to the DNA. This cell-active view of genome change applies to all scales of DNA sequence variation, from point mutations to large-scale genome rearrangements and whole genome duplications (WGDs). This conceptual change to active cell inscriptions controlling RW genome functions has profound implications for all areas of the life sciences. © 2013 Elsevier B.V. All rights reserved.

  5. Lessons learnt on the analysis of large sequence data in animal genomics.

    PubMed

    Biscarini, F; Cozzi, P; Orozco-Ter Wengel, P

    2018-04-06

    The 'omics revolution has made a large amount of sequence data available to researchers and the industry. This has had a profound impact in the field of bioinformatics, stimulating unprecedented advancements in this discipline. Mostly, this is usually looked at from the perspective of human 'omics, in particular human genomics. Plant and animal genomics, however, have also been deeply influenced by next-generation sequencing technologies, with several genomics applications now popular among researchers and the breeding industry. Genomics tends to generate huge amounts of data, and genomic sequence data account for an increasing proportion of big data in biological sciences, due largely to decreasing sequencing and genotyping costs and to large-scale sequencing and resequencing projects. The analysis of big data poses a challenge to scientists, as data gathering currently takes place at a faster pace than does data processing and analysis, and the associated computational burden is increasingly taxing, making even simple manipulation, visualization and transferring of data a cumbersome operation. The time consumed by the processing and analysing of huge data sets may be at the expense of data quality assessment and critical interpretation. Additionally, when analysing lots of data, something is likely to go awry-the software may crash or stop-and it can be very frustrating to track the error. We herein review the most relevant issues related to tackling these challenges and problems, from the perspective of animal genomics, and provide researchers that lack extensive computing experience with guidelines that will help when processing large genomic data sets. © 2018 Stichting International Foundation for Animal Genetics.

  6. PSP: rapid identification of orthologous coding genes under positive selection across multiple closely related prokaryotic genomes.

    PubMed

    Su, Fei; Ou, Hong-Yu; Tao, Fei; Tang, Hongzhi; Xu, Ping

    2013-12-27

    With genomic sequences of many closely related bacterial strains made available by deep sequencing, it is now possible to investigate trends in prokaryotic microevolution. Positive selection is a sub-process of microevolution, in which a particular mutation is favored, causing the allele frequency to continuously shift in one direction. Wide scanning of prokaryotic genomes has shown that positive selection at the molecular level is much more frequent than expected. Genes with significant positive selection may play key roles in bacterial adaption to different environmental pressures. However, selection pressure analyses are computationally intensive and awkward to configure. Here we describe an open access web server, which is designated as PSP (Positive Selection analysis for Prokaryotic genomes) for performing evolutionary analysis on orthologous coding genes, specially designed for rapid comparison of dozens of closely related prokaryotic genomes. Remarkably, PSP facilitates functional exploration at the multiple levels by assignments and enrichments of KO, GO or COG terms. To illustrate this user-friendly tool, we analyzed Escherichia coli and Bacillus cereus genomes and found that several genes, which play key roles in human infection and antibiotic resistance, show significant evidence of positive selection. PSP is freely available to all users without any login requirement at: http://db-mml.sjtu.edu.cn/PSP/. PSP ultimately allows researchers to do genome-scale analysis for evolutionary selection across multiple prokaryotic genomes rapidly and easily, and identify the genes undergoing positive selection, which may play key roles in the interactions of host-pathogen and/or environmental adaptation.

  7. In silico identification and comparative analysis of differentially expressed genes in human and mouse tissues

    PubMed Central

    Pao, Sheng-Ying; Lin, Win-Li; Hwang, Ming-Jing

    2006-01-01

    Background Screening for differentially expressed genes on the genomic scale and comparative analysis of the expression profiles of orthologous genes between species to study gene function and regulation are becoming increasingly feasible. Expressed sequence tags (ESTs) are an excellent source of data for such studies using bioinformatic approaches because of the rich libraries and tremendous amount of data now available in the public domain. However, any large-scale EST-based bioinformatics analysis must deal with the heterogeneous, and often ambiguous, tissue and organ terms used to describe EST libraries. Results To deal with the issue of tissue source, in this work, we carefully screened and organized more than 8 million human and mouse ESTs into 157 human and 108 mouse tissue/organ categories, to which we applied an established statistic test using different thresholds of the p value to identify genes differentially expressed in different tissues. Further analysis of the tissue distribution and level of expression of human and mouse orthologous genes showed that tissue-specific orthologs tended to have more similar expression patterns than those lacking significant tissue specificity. On the other hand, a number of orthologs were found to have significant disparity in their expression profiles, hinting at novel functions, divergent regulation, or new ortholog relationships. Conclusion Comprehensive statistics on the tissue-specific expression of human and mouse genes were obtained in this very large-scale, EST-based analysis. These statistical results have been organized into a database, freely accessible at our website , for easy searching of human and mouse tissue-specific genes and for investigating gene expression profiles in the context of comparative genomics. Comparative analysis showed that, although highly tissue-specific genes tend to exhibit similar expression profiles in human and mouse, there are significant exceptions, indicating that orthologous genes, while sharing basic genomic properties, could result in distinct phenotypes. PMID:16626500

  8. Evaluation of a genome-scale in silico metabolic model for Geobacter metallireducens by using proteomic data from a field biostimulation experiment.

    PubMed

    Fang, Yilin; Wilkins, Michael J; Yabusaki, Steven B; Lipton, Mary S; Long, Philip E

    2012-12-01

    Accurately predicting the interactions between microbial metabolism and the physical subsurface environment is necessary to enhance subsurface energy development, soil and groundwater cleanup, and carbon management. This study was an initial attempt to confirm the metabolic functional roles within an in silico model using environmental proteomic data collected during field experiments. Shotgun global proteomics data collected during a subsurface biostimulation experiment were used to validate a genome-scale metabolic model of Geobacter metallireducens-specifically, the ability of the metabolic model to predict metal reduction, biomass yield, and growth rate under dynamic field conditions. The constraint-based in silico model of G. metallireducens relates an annotated genome sequence to the physiological functions with 697 reactions controlled by 747 enzyme-coding genes. Proteomic analysis showed that 180 of the 637 G. metallireducens proteins detected during the 2008 experiment were associated with specific metabolic reactions in the in silico model. When the field-calibrated Fe(III) terminal electron acceptor process reaction in a reactive transport model for the field experiments was replaced with the genome-scale model, the model predicted that the largest metabolic fluxes through the in silico model reactions generally correspond to the highest abundances of proteins that catalyze those reactions. Central metabolism predicted by the model agrees well with protein abundance profiles inferred from proteomic analysis. Model discrepancies with the proteomic data, such as the relatively low abundances of proteins associated with amino acid transport and metabolism, revealed pathways or flux constraints in the in silico model that could be updated to more accurately predict metabolic processes that occur in the subsurface environment.

  9. A genomic investigation of ecological differentiation between free-living and Drosophila-associated bacteria.

    PubMed

    Winans, Nathan J; Walter, Alec; Chouaia, Bessem; Chaston, John M; Douglas, Angela E; Newell, Peter D

    2017-09-01

    Various bacterial taxa have been identified both in association with animals and in the external environment, but the extent to which related bacteria from the two habitat types are ecologically and evolutionarily distinct is largely unknown. This study investigated the scale and pattern of genetic differentiation between bacteria of the family Acetobacteraceae isolated from the guts of Drosophila fruit flies, plant material and industrial fermentations. Genome-scale analysis of the phylogenetic relationships and predicted functions was conducted on 44 Acetobacteraceae isolates, including newly sequenced genomes from 18 isolates from wild and laboratory Drosophila. Isolates from the external environment and Drosophila could not be assigned to distinct phylogenetic groups, nor are their genomes enriched for any different sets of genes or category of predicted gene functions. In contrast, analysis of bacteria from laboratory Drosophila showed they were genetically distinct in their universal capacity to degrade uric acid (a major nitrogenous waste product of Drosophila) and absence of flagellar motility, while these traits vary among wild Drosophila isolates. Analysis of the competitive fitness of Acetobacter discordant for these traits revealed a significant fitness deficit for bacteria that cannot degrade uric acid in culture with Drosophila. We propose that, for wild populations, frequent cycling of Acetobacter between Drosophila and the external environment prevents genetic differentiation by maintaining selection for traits adaptive in both the gut and external habitats. However, laboratory isolates bear the signs of adaptation to persistent association with the Drosophila host under tightly defined environmental conditions. © 2017 John Wiley & Sons Ltd.

  10. Resources and costs for microbial sequence analysis evaluated using virtual machines and cloud computing.

    PubMed

    Angiuoli, Samuel V; White, James R; Matalka, Malcolm; White, Owen; Fricke, W Florian

    2011-01-01

    The widespread popularity of genomic applications is threatened by the "bioinformatics bottleneck" resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly. We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers. Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers.

  11. Resources and Costs for Microbial Sequence Analysis Evaluated Using Virtual Machines and Cloud Computing

    PubMed Central

    Angiuoli, Samuel V.; White, James R.; Matalka, Malcolm; White, Owen; Fricke, W. Florian

    2011-01-01

    Background The widespread popularity of genomic applications is threatened by the “bioinformatics bottleneck” resulting from uncertainty about the cost and infrastructure needed to meet increasing demands for next-generation sequence analysis. Cloud computing services have been discussed as potential new bioinformatics support systems but have not been evaluated thoroughly. Results We present benchmark costs and runtimes for common microbial genomics applications, including 16S rRNA analysis, microbial whole-genome shotgun (WGS) sequence assembly and annotation, WGS metagenomics and large-scale BLAST. Sequence dataset types and sizes were selected to correspond to outputs typically generated by small- to midsize facilities equipped with 454 and Illumina platforms, except for WGS metagenomics where sampling of Illumina data was used. Automated analysis pipelines, as implemented in the CloVR virtual machine, were used in order to guarantee transparency, reproducibility and portability across different operating systems, including the commercial Amazon Elastic Compute Cloud (EC2), which was used to attach real dollar costs to each analysis type. We found considerable differences in computational requirements, runtimes and costs associated with different microbial genomics applications. While all 16S analyses completed on a single-CPU desktop in under three hours, microbial genome and metagenome analyses utilized multi-CPU support of up to 120 CPUs on Amazon EC2, where each analysis completed in under 24 hours for less than $60. Representative datasets were used to estimate maximum data throughput on different cluster sizes and to compare costs between EC2 and comparable local grid servers. Conclusions Although bioinformatics requirements for microbial genomics depend on dataset characteristics and the analysis protocols applied, our results suggests that smaller sequencing facilities (up to three Roche/454 or one Illumina GAIIx sequencer) invested in 16S rRNA amplicon sequencing, microbial single-genome and metagenomics WGS projects can achieve cost-efficient bioinformatics support using CloVR in combination with Amazon EC2 as an alternative to local computing centers. PMID:22028928

  12. Is this the real time for genomics?

    PubMed

    Guarnaccia, Maria; Gentile, Giulia; Alessi, Enrico; Schneider, Claudio; Petralia, Salvatore; Cavallaro, Sebastiano

    2014-01-01

    In the last decades, molecular biology has moved from gene-by-gene analysis to more complex studies using a genome-wide scale. Thanks to high-throughput genomic technologies, such as microarrays and next-generation sequencing, a huge amount of information has been generated, expanding our knowledge on the genetic basis of various diseases. Although some of this information could be transferred to clinical diagnostics, the technologies available are not suitable for this purpose. In this review, we will discuss the drawbacks associated with the use of traditional DNA microarrays in diagnostics, pointing out emerging platforms that could overcome these obstacles and offer a more reproducible, qualitative and quantitative multigenic analysis. New miniaturized and automated devices, called Lab-on-Chip, begin to integrate PCR and microarray on the same platform, offering integrated sample-to-result systems. The introduction of this kind of innovative devices may facilitate the transition of genome-based tests into clinical routine. Copyright © 2014. Published by Elsevier Inc.

  13. Gene expression profiling--Opening the black box of plant ecosystem responses to global change

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Leakey, A.D.B.; Ainsworth, E.A.; Bernard, S.M.

    The use of genomic techniques to address ecological questions is emerging as the field of genomic ecology. Experimentation under environmentally realistic conditions to investigate the molecular response of plants to meaningful changes in growth conditions and ecological interactions is the defining feature of genomic ecology. Since the impact of global change factors on plant performance are mediated by direct effects at the molecular, biochemical and physiological scales, gene expression analysis promises important advances in understanding factors that have previously been consigned to the 'black box' of unknown mechanism. Various tools and approaches are available for assessing gene expression in modelmore » and non-model species as part of global change biology studies. Each approach has its own unique advantages and constraints. A first generation of genomic ecology studies in managed ecosystems and mesocosms have provided a testbed for the approach and have begun to reveal how the experimental design and data analysis of gene expression studies can be tailored for use in an ecological context.« less

  14. Sequence and Analysis of the Tomato JOINTLESS Locus1

    PubMed Central

    Mao, Long; Begum, Dilara; Goff, Stephen A.; Wing, Rod A.

    2001-01-01

    A 119-kb bacterial artificial chromosome from the JOINTLESS locus on the tomato (Lycopersicon esculentum) chromosome 11 contained 15 putative genes. Repetitive sequences in this region include one copia-like LTR retrotransposon, 13 simple sequence repeats, three copies of a novel type III foldback transposon, and four putative short DNA repeats. Database searches showed that the foldback transposon and the short DNA repeats seemed to be associated preferably with genes. The predicted tomato genes were compared with the complete Arabidopsis genome. Eleven out of 15 tomato open reading frames were found to be colinear with segments on five Arabidopsis bacterial artificial chromosome/P1-derived artificial chromosome clones. The synteny patterns, however, did not reveal duplicated segments in Arabidopsis, where over half of the genome is duplicated. Our analysis indicated that the microsynteny between the tomato and Arabidopsis genomes was still conserved at a very small scale but was complicated by the large number of gene families in the Arabidopsis genome. PMID:11457984

  15. Genomic Sequence around Butterfly Wing Development Genes: Annotation and Comparative Analysis

    PubMed Central

    Conceição, Inês C.; Long, Anthony D.; Gruber, Jonathan D.; Beldade, Patrícia

    2011-01-01

    Background Analysis of genomic sequence allows characterization of genome content and organization, and access beyond gene-coding regions for identification of functional elements. BAC libraries, where relatively large genomic regions are made readily available, are especially useful for species without a fully sequenced genome and can increase genomic coverage of phylogenetic and biological diversity. For example, no butterfly genome is yet available despite the unique genetic and biological properties of this group, such as diversified wing color patterns. The evolution and development of these patterns is being studied in a few target species, including Bicyclus anynana, where a whole-genome BAC library allows targeted access to large genomic regions. Methodology/Principal Findings We characterize ∼1.3 Mb of genomic sequence around 11 selected genes expressed in B. anynana developing wings. Extensive manual curation of in silico predictions, also making use of a large dataset of expressed genes for this species, identified repetitive elements and protein coding sequence, and highlighted an expansion of Alcohol dehydrogenase genes. Comparative analysis with orthologous regions of the lepidopteran reference genome allowed assessment of conservation of fine-scale synteny (with detection of new inversions and translocations) and of DNA sequence (with detection of high levels of conservation of non-coding regions around some, but not all, developmental genes). Conclusions The general properties and organization of the available B. anynana genomic sequence are similar to the lepidopteran reference, despite the more than 140 MY divergence. Our results lay the groundwork for further studies of new interesting findings in relation to both coding and non-coding sequence: 1) the Alcohol dehydrogenase expansion with higher similarity between the five tandemly-repeated B. anynana paralogs than with the corresponding B. mori orthologs, and 2) the high conservation of non-coding sequence around the genes wingless and Ecdysone receptor, both involved in multiple developmental processes including wing pattern formation. PMID:21909358

  16. Analysis of Multiallelic CNVs by Emulsion Haplotype Fusion PCR.

    PubMed

    Tyson, Jess; Armour, John A L

    2017-01-01

    Emulsion-fusion PCR recovers long-range sequence information by combining products in cis from individual genomic DNA molecules. Emulsion droplets act as very numerous small reaction chambers in which different PCR products from a single genomic DNA molecule are condensed into short joint products, to unite sequences in cis from widely separated genomic sites. These products can therefore provide information about the arrangement of sequences and variants at a larger scale than established long-read sequencing methods. The method has been useful in defining the phase of variants in haplotypes, the typing of inversions, and determining the configuration of sequence variants in multiallelic CNVs. In this description we outline the rationale for the application of emulsion-fusion PCR methods to the analysis of multiallelic CNVs, and give practical details for our own implementation of the method in that context.

  17. CanvasDB: a local database infrastructure for analysis of targeted- and whole genome re-sequencing projects

    PubMed Central

    Ameur, Adam; Bunikis, Ignas; Enroth, Stefan; Gyllensten, Ulf

    2014-01-01

    CanvasDB is an infrastructure for management and analysis of genetic variants from massively parallel sequencing (MPS) projects. The system stores SNP and indel calls in a local database, designed to handle very large datasets, to allow for rapid analysis using simple commands in R. Functional annotations are included in the system, making it suitable for direct identification of disease-causing mutations in human exome- (WES) or whole-genome sequencing (WGS) projects. The system has a built-in filtering function implemented to simultaneously take into account variant calls from all individual samples. This enables advanced comparative analysis of variant distribution between groups of samples, including detection of candidate causative mutations within family structures and genome-wide association by sequencing. In most cases, these analyses are executed within just a matter of seconds, even when there are several hundreds of samples and millions of variants in the database. We demonstrate the scalability of canvasDB by importing the individual variant calls from all 1092 individuals present in the 1000 Genomes Project into the system, over 4.4 billion SNPs and indels in total. Our results show that canvasDB makes it possible to perform advanced analyses of large-scale WGS projects on a local server. Database URL: https://github.com/UppsalaGenomeCenter/CanvasDB PMID:25281234

  18. CanvasDB: a local database infrastructure for analysis of targeted- and whole genome re-sequencing projects.

    PubMed

    Ameur, Adam; Bunikis, Ignas; Enroth, Stefan; Gyllensten, Ulf

    2014-01-01

    CanvasDB is an infrastructure for management and analysis of genetic variants from massively parallel sequencing (MPS) projects. The system stores SNP and indel calls in a local database, designed to handle very large datasets, to allow for rapid analysis using simple commands in R. Functional annotations are included in the system, making it suitable for direct identification of disease-causing mutations in human exome- (WES) or whole-genome sequencing (WGS) projects. The system has a built-in filtering function implemented to simultaneously take into account variant calls from all individual samples. This enables advanced comparative analysis of variant distribution between groups of samples, including detection of candidate causative mutations within family structures and genome-wide association by sequencing. In most cases, these analyses are executed within just a matter of seconds, even when there are several hundreds of samples and millions of variants in the database. We demonstrate the scalability of canvasDB by importing the individual variant calls from all 1092 individuals present in the 1000 Genomes Project into the system, over 4.4 billion SNPs and indels in total. Our results show that canvasDB makes it possible to perform advanced analyses of large-scale WGS projects on a local server. Database URL: https://github.com/UppsalaGenomeCenter/CanvasDB. © The Author(s) 2014. Published by Oxford University Press.

  19. Comparison of carnivore, omnivore, and herbivore mammalian genomes with a new leopard assembly.

    PubMed

    Kim, Soonok; Cho, Yun Sung; Kim, Hak-Min; Chung, Oksung; Kim, Hyunho; Jho, Sungwoong; Seomun, Hong; Kim, Jeongho; Bang, Woo Young; Kim, Changmu; An, Junghwa; Bae, Chang Hwan; Bhak, Youngjune; Jeon, Sungwon; Yoon, Hyejun; Kim, Yumi; Jun, JeHoon; Lee, HyeJin; Cho, Suan; Uphyrkina, Olga; Kostyria, Aleksey; Goodrich, John; Miquelle, Dale; Roelke, Melody; Lewis, John; Yurchenko, Andrey; Bankevich, Anton; Cho, Juok; Lee, Semin; Edwards, Jeremy S; Weber, Jessica A; Cook, Jo; Kim, Sangsoo; Lee, Hang; Manica, Andrea; Lee, Ilbeum; O'Brien, Stephen J; Bhak, Jong; Yeo, Joo-Hong

    2016-10-11

    There are three main dietary groups in mammals: carnivores, omnivores, and herbivores. Currently, there is limited comparative genomics insight into the evolution of dietary specializations in mammals. Due to recent advances in sequencing technologies, we were able to perform in-depth whole genome analyses of representatives of these three dietary groups. We investigated the evolution of carnivory by comparing 18 representative genomes from across Mammalia with carnivorous, omnivorous, and herbivorous dietary specializations, focusing on Felidae (domestic cat, tiger, lion, cheetah, and leopard), Hominidae, and Bovidae genomes. We generated a new high-quality leopard genome assembly, as well as two wild Amur leopard whole genomes. In addition to a clear contraction in gene families for starch and sucrose metabolism, the carnivore genomes showed evidence of shared evolutionary adaptations in genes associated with diet, muscle strength, agility, and other traits responsible for successful hunting and meat consumption. Additionally, an analysis of highly conserved regions at the family level revealed molecular signatures of dietary adaptation in each of Felidae, Hominidae, and Bovidae. However, unlike carnivores, omnivores and herbivores showed fewer shared adaptive signatures, indicating that carnivores are under strong selective pressure related to diet. Finally, felids showed recent reductions in genetic diversity associated with decreased population sizes, which may be due to the inflexible nature of their strict diet, highlighting their vulnerability and critical conservation status. Our study provides a large-scale family level comparative genomic analysis to address genomic changes associated with dietary specialization. Our genomic analyses also provide useful resources for diet-related genetic and health research.

  20. Development of a database system for mapping insertional mutations onto the mouse genome with large-scale experimental data

    PubMed Central

    2009-01-01

    Background Insertional mutagenesis is an effective method for functional genomic studies in various organisms. It can rapidly generate easily tractable mutations. A large-scale insertional mutagenesis with the piggyBac (PB) transposon is currently performed in mice at the Institute of Developmental Biology and Molecular Medicine (IDM), Fudan University in Shanghai, China. This project is carried out via collaborations among multiple groups overseeing interconnected experimental steps and generates a large volume of experimental data continuously. Therefore, the project calls for an efficient database system for recording, management, statistical analysis, and information exchange. Results This paper presents a database application called MP-PBmice (insertional mutation mapping system of PB Mutagenesis Information Center), which is developed to serve the on-going large-scale PB insertional mutagenesis project. A lightweight enterprise-level development framework Struts-Spring-Hibernate is used here to ensure constructive and flexible support to the application. The MP-PBmice database system has three major features: strict access-control, efficient workflow control, and good expandability. It supports the collaboration among different groups that enter data and exchange information on daily basis, and is capable of providing real time progress reports for the whole project. MP-PBmice can be easily adapted for other large-scale insertional mutation mapping projects and the source code of this software is freely available at http://www.idmshanghai.cn/PBmice. Conclusion MP-PBmice is a web-based application for large-scale insertional mutation mapping onto the mouse genome, implemented with the widely used framework Struts-Spring-Hibernate. This system is already in use by the on-going genome-wide PB insertional mutation mapping project at IDM, Fudan University. PMID:19958505

  1. A Genome-Wide Scan for Breast Cancer Risk Haplotypes among African American Women

    PubMed Central

    Song, Chi; Chen, Gary K.; Millikan, Robert C.; Ambrosone, Christine B.; John, Esther M.; Bernstein, Leslie; Zheng, Wei; Hu, Jennifer J.; Ziegler, Regina G.; Nyante, Sarah; Bandera, Elisa V.; Ingles, Sue A.; Press, Michael F.; Deming, Sandra L.; Rodriguez-Gil, Jorge L.; Chanock, Stephen J.; Wan, Peggy; Sheng, Xin; Pooler, Loreall C.; Van Den Berg, David J.; Le Marchand, Loic; Kolonel, Laurence N.; Henderson, Brian E.; Haiman, Chris A.; Stram, Daniel O.

    2013-01-01

    Genome-wide association studies (GWAS) simultaneously investigating hundreds of thousands of single nucleotide polymorphisms (SNP) have become a powerful tool in the investigation of new disease susceptibility loci. Haplotypes are sometimes thought to be superior to SNPs and are promising in genetic association analyses. The application of genome-wide haplotype analysis, however, is hindered by the complexity of haplotypes themselves and sophistication in computation. We systematically analyzed the haplotype effects for breast cancer risk among 5,761 African American women (3,016 cases and 2,745 controls) using a sliding window approach on the genome-wide scale. Three regions on chromosomes 1, 4 and 18 exhibited moderate haplotype effects. Furthermore, among 21 breast cancer susceptibility loci previously established in European populations, 10p15 and 14q24 are likely to harbor novel haplotype effects. We also proposed a heuristic of determining the significance level and the effective number of independent tests by the permutation analysis on chromosome 22 data. It suggests that the effective number was approximately half of the total (7,794 out of 15,645), thus the half number could serve as a quick reference to evaluating genome-wide significance if a similar sliding window approach of haplotype analysis is adopted in similar populations using similar genotype density. PMID:23468962

  2. EuPathDB: the eukaryotic pathogen genomics database resource

    PubMed Central

    Aurrecoechea, Cristina; Barreto, Ana; Basenko, Evelina Y.; Brestelli, John; Brunk, Brian P.; Cade, Shon; Crouch, Kathryn; Doherty, Ryan; Falke, Dave; Fischer, Steve; Gajria, Bindu; Harb, Omar S.; Heiges, Mark; Hertz-Fowler, Christiane; Hu, Sufen; Iodice, John; Kissinger, Jessica C.; Lawrence, Cris; Li, Wei; Pinney, Deborah F.; Pulman, Jane A.; Roos, David S.; Shanmugasundram, Achchuthan; Silva-Franco, Fatima; Steinbiss, Sascha; Stoeckert, Christian J.; Spruill, Drew; Wang, Haiming; Warrenfeltz, Susanne; Zheng, Jie

    2017-01-01

    The Eukaryotic Pathogen Genomics Database Resource (EuPathDB, http://eupathdb.org) is a collection of databases covering 170+ eukaryotic pathogens (protists & fungi), along with relevant free-living and non-pathogenic species, and select pathogen hosts. To facilitate the discovery of meaningful biological relationships, the databases couple preconfigured searches with visualization and analysis tools for comprehensive data mining via intuitive graphical interfaces and APIs. All data are analyzed with the same workflows, including creation of gene orthology profiles, so data are easily compared across data sets, data types and organisms. EuPathDB is updated with numerous new analysis tools, features, data sets and data types. New tools include GO, metabolic pathway and word enrichment analyses plus an online workspace for analysis of personal, non-public, large-scale data. Expanded data content is mostly genomic and functional genomic data while new data types include protein microarray, metabolic pathways, compounds, quantitative proteomics, copy number variation, and polysomal transcriptomics. New features include consistent categorization of searches, data sets and genome browser tracks; redesigned gene pages; effective integration of alternative transcripts; and a EuPathDB Galaxy instance for private analyses of a user's data. Forthcoming upgrades include user workspaces for private integration of data with existing EuPathDB data and improved integration and presentation of host–pathogen interactions. PMID:27903906

  3. VariantSpark: population scale clustering of genotype information.

    PubMed

    O'Brien, Aidan R; Saunders, Neil F W; Guo, Yi; Buske, Fabian A; Scott, Rodney J; Bauer, Denis C

    2015-12-10

    Genomic information is increasingly used in medical practice giving rise to the need for efficient analysis methodology able to cope with thousands of individuals and millions of variants. The widely used Hadoop MapReduce architecture and associated machine learning library, Mahout, provide the means for tackling computationally challenging tasks. However, many genomic analyses do not fit the Map-Reduce paradigm. We therefore utilise the recently developed SPARK engine, along with its associated machine learning library, MLlib, which offers more flexibility in the parallelisation of population-scale bioinformatics tasks. The resulting tool, VARIANTSPARK provides an interface from MLlib to the standard variant format (VCF), offers seamless genome-wide sampling of variants and provides a pipeline for visualising results. To demonstrate the capabilities of VARIANTSPARK, we clustered more than 3,000 individuals with 80 Million variants each to determine the population structure in the dataset. VARIANTSPARK is 80 % faster than the SPARK-based genome clustering approach, ADAM, the comparable implementation using Hadoop/Mahout, as well as ADMIXTURE, a commonly used tool for determining individual ancestries. It is over 90 % faster than traditional implementations using R and Python. The benefits of speed, resource consumption and scalability enables VARIANTSPARK to open up the usage of advanced, efficient machine learning algorithms to genomic data.

  4. Smoking Gun or Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants.

    PubMed

    Gagliano, Sarah A; Ravji, Reena; Barnes, Michael R; Weale, Michael E; Knight, Jo

    2015-08-24

    Although technology has triumphed in facilitating routine genome sequencing, new challenges have been created for the data-analyst. Genome-scale surveys of human variation generate volumes of data that far exceed capabilities for laboratory characterization. By incorporating functional annotations as predictors, statistical learning has been widely investigated for prioritizing genetic variants likely to be associated with complex disease. We compared three published prioritization procedures, which use different statistical learning algorithms and different predictors with regard to the quantity, type and coding. We also explored different combinations of algorithm and annotation set. As an application, we tested which methodology performed best for prioritizing variants using data from a large schizophrenia meta-analysis by the Psychiatric Genomics Consortium. Results suggest that all methods have considerable (and similar) predictive accuracies (AUCs 0.64-0.71) in test set data, but there is more variability in the application to the schizophrenia GWAS. In conclusion, a variety of algorithms and annotations seem to have a similar potential to effectively enrich true risk variants in genome-scale datasets, however none offer more than incremental improvement in prediction. We discuss how methods might be evolved for risk variant prediction to address the impending bottleneck of the new generation of genome re-sequencing studies.

  5. A hybrid computational strategy to address WGS variant analysis in >5000 samples.

    PubMed

    Huang, Zhuoyi; Rustagi, Navin; Veeraraghavan, Narayanan; Carroll, Andrew; Gibbs, Richard; Boerwinkle, Eric; Venkata, Manjunath Gorentla; Yu, Fuli

    2016-09-10

    The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies. We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms. Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants.

  6. Integrated network analysis and effective tools in plant systems biology

    PubMed Central

    Fukushima, Atsushi; Kanaya, Shigehiko; Nishida, Kozo

    2014-01-01

    One of the ultimate goals in plant systems biology is to elucidate the genotype-phenotype relationship in plant cellular systems. Integrated network analysis that combines omics data with mathematical models has received particular attention. Here we focus on the latest cutting-edge computational advances that facilitate their combination. We highlight (1) network visualization tools, (2) pathway analyses, (3) genome-scale metabolic reconstruction, and (4) the integration of high-throughput experimental data and mathematical models. Multi-omics data that contain the genome, transcriptome, proteome, and metabolome and mathematical models are expected to integrate and expand our knowledge of complex plant metabolisms. PMID:25408696

  7. A genomic regulatory network for development

    NASA Technical Reports Server (NTRS)

    Davidson, Eric H.; Rast, Jonathan P.; Oliveri, Paola; Ransick, Andrew; Calestani, Cristina; Yuh, Chiou-Hwa; Minokawa, Takuya; Amore, Gabriele; Hinman, Veronica; Arenas-Mena, Cesar; hide

    2002-01-01

    Development of the body plan is controlled by large networks of regulatory genes. A gene regulatory network that controls the specification of endoderm and mesoderm in the sea urchin embryo is summarized here. The network was derived from large-scale perturbation analyses, in combination with computational methodologies, genomic data, cis-regulatory analysis, and molecular embryology. The network contains over 40 genes at present, and each node can be directly verified at the DNA sequence level by cis-regulatory analysis. Its architecture reveals specific and general aspects of development, such as how given cells generate their ordained fates in the embryo and why the process moves inexorably forward in developmental time.

  8. Inferring transposons activity chronology by TRANScendence - TEs database and de-novo mining tool.

    PubMed

    Startek, Michał Piotr; Nogły, Jakub; Gromadka, Agnieszka; Grzebelus, Dariusz; Gambin, Anna

    2017-10-16

    The constant progress in sequencing technology leads to ever increasing amounts of genomic data. In the light of current evidence transposable elements (TEs for short) are becoming useful tools for learning about the evolution of host genome. Therefore the software for genome-wide detection and analysis of TEs is of great interest. Here we describe the computational tool for mining, classifying and storing TEs from newly sequenced genomes. This is an online, web-based, user-friendly service, enabling users to upload their own genomic data, and perform de-novo searches for TEs. The detected TEs are automatically analyzed, compared to reference databases, annotated, clustered into families, and stored in TEs repository. Also, the genome-wide nesting structure of found elements are detected and analyzed by new method for inferring evolutionary history of TEs. We illustrate the functionality of our tool by performing a full-scale analyses of TE landscape in Medicago truncatula genome. TRANScendence is an effective tool for the de-novo annotation and classification of transposable elements in newly-acquired genomes. Its streamlined interface makes it well-suited for evolutionary studies.

  9. Developing eThread pipeline using SAGA-pilot abstraction for large-scale structural bioinformatics.

    PubMed

    Ragothaman, Anjani; Boddu, Sairam Chowdary; Kim, Nayong; Feinstein, Wei; Brylinski, Michal; Jha, Shantenu; Kim, Joohyun

    2014-01-01

    While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread--a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure.

  10. Developing eThread Pipeline Using SAGA-Pilot Abstraction for Large-Scale Structural Bioinformatics

    PubMed Central

    Ragothaman, Anjani; Feinstein, Wei; Jha, Shantenu; Kim, Joohyun

    2014-01-01

    While most of computational annotation approaches are sequence-based, threading methods are becoming increasingly attractive because of predicted structural information that could uncover the underlying function. However, threading tools are generally compute-intensive and the number of protein sequences from even small genomes such as prokaryotes is large typically containing many thousands, prohibiting their application as a genome-wide structural systems biology tool. To leverage its utility, we have developed a pipeline for eThread—a meta-threading protein structure modeling tool, that can use computational resources efficiently and effectively. We employ a pilot-based approach that supports seamless data and task-level parallelism and manages large variation in workload and computational requirements. Our scalable pipeline is deployed on Amazon EC2 and can efficiently select resources based upon task requirements. We present runtime analysis to characterize computational complexity of eThread and EC2 infrastructure. Based on results, we suggest a pathway to an optimized solution with respect to metrics such as time-to-solution or cost-to-solution. Our eThread pipeline can scale to support a large number of sequences and is expected to be a viable solution for genome-scale structural bioinformatics and structure-based annotation, particularly, amenable for small genomes such as prokaryotes. The developed pipeline is easily extensible to other types of distributed cyberinfrastructure. PMID:24995285

  11. Morphometrics of Daucus (Apiaceae): A counterpart to a phylogenomic study

    USDA-ARS?s Scientific Manuscript database

    Molecular phylogenetics of genome-scale data sets (phylogenomics) often produces phylogenetic trees with unprecedented resolution. A companion phylogenomics analysis of Daucus (carrots) using 94 conserved nuclear orthologs supported many of the traditional species but showed unexpected results that ...

  12. Combining functional genomics and chemical biology to identify targets of bioactive compounds.

    PubMed

    Ho, Cheuk Hei; Piotrowski, Jeff; Dixon, Scott J; Baryshnikova, Anastasia; Costanzo, Michael; Boone, Charles

    2011-02-01

    Genome sequencing projects have revealed thousands of suspected genes, challenging researchers to develop efficient large-scale functional analysis methodologies. Determining the function of a gene product generally requires a means to alter its function. Genetically tractable model organisms have been widely exploited for the isolation and characterization of activating and inactivating mutations in genes encoding proteins of interest. Chemical genetics represents a complementary approach involving the use of small molecules capable of either inactivating or activating their targets. Saccharomyces cerevisiae has been an important test bed for the development and application of chemical genomic assays aimed at identifying targets and modes of action of known and uncharacterized compounds. Here we review yeast chemical genomic assays strategies for drug target identification. Copyright © 2010 Elsevier Ltd. All rights reserved.

  13. A Simple Measure of the Dynamics of Segmented Genomes: An Application to Influenza

    NASA Astrophysics Data System (ADS)

    Aris-Brosou, Stéphane

    The severity of influenza epidemics, which can potentially become a pandemic, has been very difficult to predict. However, past efforts were focusing on gene-by-gene approaches, while it is acknowledged that the whole genome dynamics contribute to the severity of an epidemic. Here, putting this rationale into action, I describe a simple measure of the amount of reassortment that affects influenza at a genomic scale during a particular year. The analysis of 530 complete genomes of the H1N1 subtype, sampled over eleven years, shows that the proposed measure explains 58% of the variance in the prevalence of H1 influenza in the US population. The proposed measure, denoted nRF, could therefore improve influenza surveillance programs at a minimal cost.

  14. Looking for Trouble: Preventive Genomic Sequencing in the General Population and the Role of Patient Choice

    PubMed Central

    Lázaro-Muñoz, Gabriel; Conley, John M.; Davis, Arlene M.; Van Riper, Marcia; Walker, Rebecca L.; Juengst, Eric T.

    2015-01-01

    Advances in genomics have led to calls for developing population-based preventive genomic sequencing (PGS) programs with the goal of identifying genetic health risks in adults without known risk factors. One critical issue for minimizing the harms and maximizing the benefits of PGS is determining the kind and degree of control individuals should have over the generation, use, and handling of their genomic information. In this article we examine whether PGS programs should offer individuals the opportunity to selectively opt-out of the sequencing or analysis of specific genomic conditions (the menu approach) or whether PGS should be implemented using an all-or-nothing panel approach. We conclude that any responsible scale up of PGS will require a menu approach that may seem impractical to some, but which draws its justification from a rich mix of normative, legal, and practical considerations. PMID:26147254

  15. k-merSNP discovery: Software for alignment-and reference-free scalable SNP discovery, phylogenetics, and annotation for hundreds of microbial genomes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    With the flood of whole genome finished and draft microbial sequences, we need faster, more scalable bioinformatics tools for sequence comparison. An algorithm is described to find single nucleotide polymorphisms (SNPs) in whole genome data. It scales to hundreds of bacterial or viral genomes, and can be used for finished and/or draft genomes available as unassembled contigs or raw, unassembled reads. The method is fast to compute, finding SNPs and building a SNP phylogeny in minutes to hours, depending on the size and diversity of the input sequences. The SNP-based trees that result are consistent with known taxonomy and treesmore » determined in other studies. The approach we describe can handle many gigabases of sequence in a single run. The algorithm is based on k-mer analysis.« less

  16. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gardner, Shea; Slezak, Tom

    With the flood of whole genome finished and draft microbial sequences, we need faster, more scalable bioinformatics tools for sequence comparison. An algorithm is described to find single nucleotide polymorphisms (SNPs) in whole genome data. It scales to hundreds of bacterial or viral genomes, and can be used for finished and/or draft genomes available as unassembled contigs. The method is fast to compute, finding SNPs and building a SNP phylogeny in seconds to hours. We use it to identify thousands of putative SNPs from all publicly available Filoviridae, Poxviridae, foot-and-mouth disease virus, Bacillus, and Escherichia coli genomes and plasmids. Themore » SNP-based trees that result are consistent with known taxonomy and trees determined in other studies. The approach we describe can handle as input hundreds of gigabases of sequence in a single run. The algorithm is based on k-mer analysis using a suffix array, so we call it saSNP.« less

  17. Development of Bioinformatics Infrastructure for Genomics Research.

    PubMed

    Mulder, Nicola J; Adebiyi, Ezekiel; Adebiyi, Marion; Adeyemi, Seun; Ahmed, Azza; Ahmed, Rehab; Akanle, Bola; Alibi, Mohamed; Armstrong, Don L; Aron, Shaun; Ashano, Efejiro; Baichoo, Shakuntala; Benkahla, Alia; Brown, David K; Chimusa, Emile R; Fadlelmola, Faisal M; Falola, Dare; Fatumo, Segun; Ghedira, Kais; Ghouila, Amel; Hazelhurst, Scott; Isewon, Itunuoluwa; Jung, Segun; Kassim, Samar Kamal; Kayondo, Jonathan K; Mbiyavanga, Mamana; Meintjes, Ayton; Mohammed, Somia; Mosaku, Abayomi; Moussa, Ahmed; Muhammd, Mustafa; Mungloo-Dilmohamud, Zahra; Nashiru, Oyekanmi; Odia, Trust; Okafor, Adaobi; Oladipo, Olaleye; Osamor, Victor; Oyelade, Jellili; Sadki, Khalid; Salifu, Samson Pandam; Soyemi, Jumoke; Panji, Sumir; Radouani, Fouzia; Souiai, Oussama; Tastan Bishop, Özlem

    2017-06-01

    Although pockets of bioinformatics excellence have developed in Africa, generally, large-scale genomic data analysis has been limited by the availability of expertise and infrastructure. H3ABioNet, a pan-African bioinformatics network, was established to build capacity specifically to enable H3Africa (Human Heredity and Health in Africa) researchers to analyze their data in Africa. Since the inception of the H3Africa initiative, H3ABioNet's role has evolved in response to changing needs from the consortium and the African bioinformatics community. H3ABioNet set out to develop core bioinformatics infrastructure and capacity for genomics research in various aspects of data collection, transfer, storage, and analysis. Various resources have been developed to address genomic data management and analysis needs of H3Africa researchers and other scientific communities on the continent. NetMap was developed and used to build an accurate picture of network performance within Africa and between Africa and the rest of the world, and Globus Online has been rolled out to facilitate data transfer. A participant recruitment database was developed to monitor participant enrollment, and data is being harmonized through the use of ontologies and controlled vocabularies. The standardized metadata will be integrated to provide a search facility for H3Africa data and biospecimens. Because H3Africa projects are generating large-scale genomic data, facilities for analysis and interpretation are critical. H3ABioNet is implementing several data analysis platforms that provide a large range of bioinformatics tools or workflows, such as Galaxy, the Job Management System, and eBiokits. A set of reproducible, portable, and cloud-scalable pipelines to support the multiple H3Africa data types are also being developed and dockerized to enable execution on multiple computing infrastructures. In addition, new tools have been developed for analysis of the uniquely divergent African data and for downstream interpretation of prioritized variants. To provide support for these and other bioinformatics queries, an online bioinformatics helpdesk backed by broad consortium expertise has been established. Further support is provided by means of various modes of bioinformatics training. For the past 4 years, the development of infrastructure support and human capacity through H3ABioNet, have significantly contributed to the establishment of African scientific networks, data analysis facilities, and training programs. Here, we describe the infrastructure and how it has affected genomics and bioinformatics research in Africa. Copyright © 2017 World Heart Federation (Geneva). Published by Elsevier B.V. All rights reserved.

  18. iCN718, an Updated and Improved Genome-Scale Metabolic Network Reconstruction of Acinetobacter baumannii AYE.

    PubMed

    Norsigian, Charles J; Kavvas, Erol; Seif, Yara; Palsson, Bernhard O; Monk, Jonathan M

    2018-01-01

    Acinetobacter baumannii has become an urgent clinical threat due to the recent emergence of multi-drug resistant strains. There is thus a significant need to discover new therapeutic targets in this organism. One means for doing so is through the use of high-quality genome-scale reconstructions. Well-curated and accurate genome-scale models (GEMs) of A. baumannii would be useful for improving treatment options. We present an updated and improved genome-scale reconstruction of A. baumannii AYE, named iCN718, that improves and standardizes previous A. baumannii AYE reconstructions. iCN718 has 80% accuracy for predicting gene essentiality data and additionally can predict large-scale phenotypic data with as much as 89% accuracy, a new capability for an A. baumannii reconstruction. We further demonstrate that iCN718 can be used to analyze conserved metabolic functions in the A. baumannii core genome and to build strain-specific GEMs of 74 other A. baumannii strains from genome sequence alone. iCN718 will serve as a resource to integrate and synthesize new experimental data being generated for this urgent threat pathogen.

  19. Modeling Neisseria meningitidis metabolism: from genome to metabolic fluxes

    PubMed Central

    Baart, Gino JE; Zomer, Bert; de Haan, Alex; van der Pol, Leo A; Beuvery, E Coen; Tramper, Johannes; Martens, Dirk E

    2007-01-01

    Background Neisseria meningitidis is a human pathogen that can infect diverse sites within the human host. The major diseases caused by N. meningitidis are responsible for death and disability, especially in young infants. In general, most of the recent work on N. meningitidis focuses on potential antigens and their functions, immunogenicity, and pathogenicity mechanisms. Very little work has been carried out on Neisseria primary metabolism over the past 25 years. Results Using the genomic database of N. meningitidis serogroup B together with biochemical and physiological information in the literature we constructed a genome-scale flux model for the primary metabolism of N. meningitidis. The validity of a simplified metabolic network derived from the genome-scale metabolic network was checked using flux-balance analysis in chemostat cultures. Several useful predictions were obtained from in silico experiments, including substrate preference. A minimal medium for growth of N. meningitidis was designed and tested succesfully in batch and chemostat cultures. Conclusion The verified metabolic model describes the primary metabolism of N. meningitidis in a chemostat in steady state. The genome-scale model is valuable because it offers a framework to study N. meningitidis metabolism as a whole, or certain aspects of it, and it can also be used for the purpose of vaccine process development (for example, the design of growth media). The flux distribution of the main metabolic pathways (that is, the pentose phosphate pathway and the Entner-Douderoff pathway) indicates that the major part of pyruvate (69%) is synthesized through the ED-cleavage, a finding that is in good agreement with literature. PMID:17617894

  20. The Psychiatric Genomics Consortium Posttraumatic Stress Disorder Workgroup: Posttraumatic Stress Disorder Enters the Age of Large-Scale Genomic Collaboration

    PubMed Central

    Logue, Mark W; Amstadter, Ananda B; Baker, Dewleen G; Duncan, Laramie; Koenen, Karestan C; Liberzon, Israel; Miller, Mark W; Morey, Rajendra A; Nievergelt, Caroline M; Ressler, Kerry J; Smith, Alicia K; Smoller, Jordan W; Stein, Murray B; Sumner, Jennifer A; Uddin, Monica

    2015-01-01

    The development of posttraumatic stress disorder (PTSD) is influenced by genetic factors. Although there have been some replicated candidates, the identification of risk variants for PTSD has lagged behind genetic research of other psychiatric disorders such as schizophrenia, autism, and bipolar disorder. Psychiatric genetics has moved beyond examination of specific candidate genes in favor of the genome-wide association study (GWAS) strategy of very large numbers of samples, which allows for the discovery of previously unsuspected genes and molecular pathways. The successes of genetic studies of schizophrenia and bipolar disorder have been aided by the formation of a large-scale GWAS consortium: the Psychiatric Genomics Consortium (PGC). In contrast, only a handful of GWAS of PTSD have appeared in the literature to date. Here we describe the formation of a group dedicated to large-scale study of PTSD genetics: the PGC-PTSD. The PGC-PTSD faces challenges related to the contingency on trauma exposure and the large degree of ancestral genetic diversity within and across participating studies. Using the PGC analysis pipeline supplemented by analyses tailored to address these challenges, we anticipate that our first large-scale GWAS of PTSD will comprise over 10 000 cases and 30 000 trauma-exposed controls. Following in the footsteps of our PGC forerunners, this collaboration—of a scope that is unprecedented in the field of traumatic stress—will lead the search for replicable genetic associations and new insights into the biological underpinnings of PTSD. PMID:25904361

  1. Genetic Variation in the Acorn Barnacle from Allozymes to Population Genomics

    PubMed Central

    Flight, Patrick A.; Rand, David M.

    2012-01-01

    Understanding the patterns of genetic variation within and among populations is a central problem in population and evolutionary genetics. We examine this question in the acorn barnacle, Semibalanus balanoides, in which the allozyme loci Mpi and Gpi have been implicated in balancing selection due to varying selective pressures at different spatial scales. We review the patterns of genetic variation at the Mpi locus, compare this to levels of population differentiation at mtDNA and microsatellites, and place these data in the context of genome-wide variation from high-throughput sequencing of population samples spanning the North Atlantic. Despite considerable geographic variation in the patterns of selection at the Mpi allozyme, this locus shows rather low levels of population differentiation at ecological and trans-oceanic scales (FST ∼ 5%). Pooled population sequencing was performed on samples from Rhode Island (RI), Maine (ME), and Southwold, England (UK). Analysis of more than 650 million reads identified approximately 335,000 high-quality SNPs in 19 million base pairs of the S. balanoides genome. Much variation is shared across the Atlantic, but there are significant examples of strong population differentiation among samples from RI, ME, and UK. An FST outlier screen of more than 22,000 contigs provided a genome-wide context for interpretation of earlier studies on allozymes, mtDNA, and microsatellites. FST values for allozymes, mtDNA and microsatellites are close to the genome-wide average for random SNPs, with the exception of the trans-Atlantic FST for mtDNA. The majority of FST outliers were unique between individual pairs of populations, but some genes show shared patterns of excess differentiation. These data indicate that gene flow is high, that selection is strong on a subset of genes, and that a variety of genes are experiencing diversifying selection at large spatial scales. This survey of polymorphism in S. balanoides provides a number of genomic tools that promise to make this a powerful model for ecological genomics of the rocky intertidal. PMID:22767487

  2. Metabolic reconstruction and flux analysis of industrial Pichia yeasts.

    PubMed

    Chung, Bevan Kai-Sheng; Lakshmanan, Meiyappan; Klement, Maximilian; Ching, Chi Bun; Lee, Dong-Yup

    2013-03-01

    Pichia yeasts have been recognized as important microbial cell factories in the biotechnological industry. Notably, the Pichia pastoris and Pichia stipitis species have attracted much research interest due to their unique cellular physiology and metabolic capability: P. pastoris has the ability to utilize methanol for cell growth and recombinant protein production, while P. stipitis is capable of assimilating xylose to produce ethanol under oxygen-limited conditions. To harness these characteristics for biotechnological applications, it is highly required to characterize their metabolic behavior. Recently, following the genome sequencing of these two Pichia species, genome-scale metabolic networks have been reconstructed to model the yeasts' metabolism from a systems perspective. To date, there are three genome-scale models available for each of P. pastoris and P. stipitis. In this mini-review, we provide an overview of the models, discuss certain limitations of previous studies, and propose potential future works that can be conducted to better understand and engineer Pichia yeasts for industrial applications.

  3. Genome-scale modelling of microbial metabolism with temporal and spatial resolution.

    PubMed

    Henson, Michael A

    2015-12-01

    Most natural microbial systems have evolved to function in environments with temporal and spatial variations. A major limitation to understanding such complex systems is the lack of mathematical modelling frameworks that connect the genomes of individual species and temporal and spatial variations in the environment to system behaviour. The goal of this review is to introduce the emerging field of spatiotemporal metabolic modelling based on genome-scale reconstructions of microbial metabolism. The extension of flux balance analysis (FBA) to account for both temporal and spatial variations in the environment is termed spatiotemporal FBA (SFBA). Following a brief overview of FBA and its established dynamic extension, the SFBA problem is introduced and recent progress is described. Three case studies are reviewed to illustrate the current state-of-the-art and possible future research directions are outlined. The author posits that SFBA is the next frontier for microbial metabolic modelling and a rapid increase in methods development and system applications is anticipated. © 2015 Authors; published by Portland Press Limited.

  4. CyanoGEBA: A Better Understanding of Cynobacterial Diversity through Large-Scale Genomics (JGI Seventh Annual User Meeting 2012: Genomics of Energy and Environment)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shih, Patrick

    2012-03-22

    Patrick Shih, representing both the University of California, Berkeley and JGI, gives a talk titled "CyanoGEBA: A Better Understanding of Cynobacterial Diversity through Large-scale Genomics" at the JGI 7th Annual Users Meeting: Genomics of Energy & Environment Meeting on March 22, 2012 in Walnut Creek, California.

  5. CyanoGEBA: A Better Understanding of Cynobacterial Diversity through Large-Scale Genomics (JGI Seventh Annual User Meeting 2012: Genomics of Energy and Environment)

    ScienceCinema

    Shih, Patrick

    2018-01-10

    Patrick Shih, representing both the University of California, Berkeley and JGI, gives a talk titled "CyanoGEBA: A Better Understanding of Cynobacterial Diversity through Large-scale Genomics" at the JGI 7th Annual Users Meeting: Genomics of Energy & Environment Meeting on March 22, 2012 in Walnut Creek, California.

  6. Ensembl comparative genomics resources.

    PubMed

    Herrero, Javier; Muffato, Matthieu; Beal, Kathryn; Fitzgerald, Stephen; Gordon, Leo; Pignatelli, Miguel; Vilella, Albert J; Searle, Stephen M J; Amode, Ridwan; Brent, Simon; Spooner, William; Kulesha, Eugene; Yates, Andrew; Flicek, Paul

    2016-01-01

    Evolution provides the unifying framework with which to understand biology. The coherent investigation of genic and genomic data often requires comparative genomics analyses based on whole-genome alignments, sets of homologous genes and other relevant datasets in order to evaluate and answer evolutionary-related questions. However, the complexity and computational requirements of producing such data are substantial: this has led to only a small number of reference resources that are used for most comparative analyses. The Ensembl comparative genomics resources are one such reference set that facilitates comprehensive and reproducible analysis of chordate genome data. Ensembl computes pairwise and multiple whole-genome alignments from which large-scale synteny, per-base conservation scores and constrained elements are obtained. Gene alignments are used to define Ensembl Protein Families, GeneTrees and homologies for both protein-coding and non-coding RNA genes. These resources are updated frequently and have a consistent informatics infrastructure and data presentation across all supported species. Specialized web-based visualizations are also available including synteny displays, collapsible gene tree plots, a gene family locator and different alignment views. The Ensembl comparative genomics infrastructure is extensively reused for the analysis of non-vertebrate species by other projects including Ensembl Genomes and Gramene and much of the information here is relevant to these projects. The consistency of the annotation across species and the focus on vertebrates makes Ensembl an ideal system to perform and support vertebrate comparative genomic analyses. We use robust software and pipelines to produce reference comparative data and make it freely available. Database URL: http://www.ensembl.org. © The Author(s) 2016. Published by Oxford University Press.

  7. Ensembl comparative genomics resources

    PubMed Central

    Muffato, Matthieu; Beal, Kathryn; Fitzgerald, Stephen; Gordon, Leo; Pignatelli, Miguel; Vilella, Albert J.; Searle, Stephen M. J.; Amode, Ridwan; Brent, Simon; Spooner, William; Kulesha, Eugene; Yates, Andrew; Flicek, Paul

    2016-01-01

    Evolution provides the unifying framework with which to understand biology. The coherent investigation of genic and genomic data often requires comparative genomics analyses based on whole-genome alignments, sets of homologous genes and other relevant datasets in order to evaluate and answer evolutionary-related questions. However, the complexity and computational requirements of producing such data are substantial: this has led to only a small number of reference resources that are used for most comparative analyses. The Ensembl comparative genomics resources are one such reference set that facilitates comprehensive and reproducible analysis of chordate genome data. Ensembl computes pairwise and multiple whole-genome alignments from which large-scale synteny, per-base conservation scores and constrained elements are obtained. Gene alignments are used to define Ensembl Protein Families, GeneTrees and homologies for both protein-coding and non-coding RNA genes. These resources are updated frequently and have a consistent informatics infrastructure and data presentation across all supported species. Specialized web-based visualizations are also available including synteny displays, collapsible gene tree plots, a gene family locator and different alignment views. The Ensembl comparative genomics infrastructure is extensively reused for the analysis of non-vertebrate species by other projects including Ensembl Genomes and Gramene and much of the information here is relevant to these projects. The consistency of the annotation across species and the focus on vertebrates makes Ensembl an ideal system to perform and support vertebrate comparative genomic analyses. We use robust software and pipelines to produce reference comparative data and make it freely available. Database URL: http://www.ensembl.org. PMID:26896847

  8. The genome sequence of the biocontrol fungus Metarhizium anisopliae and comparative genomics of Metarhizium species.

    PubMed

    Pattemore, Julie A; Hane, James K; Williams, Angela H; Wilson, Bree A L; Stodart, Ben J; Ash, Gavin J

    2014-08-07

    Metarhizium anisopliae is an important fungal biocontrol agent of insect pests of agricultural crops. Genomics can aid the successful commercialization of biopesticides by identification of key genes differentiating closely related species, selection of virulent microbial isolates which are amenable to industrial scale production and formulation and through the reduction of phenotypic variability. The genome of Metarhizium isolate ARSEF23 was recently published as a model for M. anisopliae, however phylogenetic analysis has since re-classified this isolate as M. robertsii. We present a new annotated genome sequence of M. anisopliae (isolate Ma69) and whole genome comparison to M. robertsii (ARSEF23) and M. acridum (CQMa 102). Whole genome analysis of M. anisopliae indicates significant macrosynteny with M. robertsii but with some large genomic inversions. In comparison to M. acridum, the genome of M. anisopliae shares lower sequence homology. While alignments overall are co-linear, the genome of M. acridum is not contiguous enough to conclusively observe macrosynteny. Mating type gene analysis revealed both MAT1-1 and MAT1-2 genes present in M. anisopliae suggesting putative homothallism, despite having no known teleomorph, in contrast with the putatively heterothallic M. acridum isolate CQMa 102 (MAT1-2) and M. robertsii isolate ARSEF23 (altered MAT1-1). Repetitive DNA and RIP analysis revealed M. acridum to have twice the repetitive content of the other two species and M. anisopliae to be five times more RIP affected than M. robertsii. We also present an initial bioinformatic survey of candidate pathogenicity genes in M. anisopliae. The annotated genome of M. anisopliae is an important resource for the identification of virulence genes specific to M. anisopliae and development of species- and strain- specific assays. New insight into the possibility of homothallism and RIP affectedness has important implications for the development of M. anisopliae as a biopesticide as it may indicate the potential for greater inherent diversity in this species than the other species. This could present opportunities to select isolates with unique combinations of pathogenicity factors, or it may point to instability in the species, a negative attribute in a biopesticide.

  9. A new tool called DISSECT for analysing large genomic data sets using a Big Data approach

    PubMed Central

    Canela-Xandri, Oriol; Law, Andy; Gray, Alan; Woolliams, John A.; Tenesa, Albert

    2015-01-01

    Large-scale genetic and genomic data are increasingly available and the major bottleneck in their analysis is a lack of sufficiently scalable computational tools. To address this problem in the context of complex traits analysis, we present DISSECT. DISSECT is a new and freely available software that is able to exploit the distributed-memory parallel computational architectures of compute clusters, to perform a wide range of genomic and epidemiologic analyses, which currently can only be carried out on reduced sample sizes or under restricted conditions. We demonstrate the usefulness of our new tool by addressing the challenge of predicting phenotypes from genotype data in human populations using mixed-linear model analysis. We analyse simulated traits from 470,000 individuals genotyped for 590,004 SNPs in ∼4 h using the combined computational power of 8,400 processor cores. We find that prediction accuracies in excess of 80% of the theoretical maximum could be achieved with large sample sizes. PMID:26657010

  10. GAMES identifies and annotates mutations in next-generation sequencing projects.

    PubMed

    Sana, Maria Elena; Iascone, Maria; Marchetti, Daniela; Palatini, Jeff; Galasso, Marco; Volinia, Stefano

    2011-01-01

    Next-generation sequencing (NGS) methods have the potential for changing the landscape of biomedical science, but at the same time pose several problems in analysis and interpretation. Currently, there are many commercial and public software packages that analyze NGS data. However, the limitations of these applications include output which is insufficiently annotated and of difficult functional comprehension to end users. We developed GAMES (Genomic Analysis of Mutations Extracted by Sequencing), a pipeline aiming to serve as an efficient middleman between data deluge and investigators. GAMES attains multiple levels of filtering and annotation, such as aligning the reads to a reference genome, performing quality control and mutational analysis, integrating results with genome annotations and sorting each mismatch/deletion according to a range of parameters. Variations are matched to known polymorphisms. The prediction of functional mutations is achieved by using different approaches. Overall GAMES enables an effective complexity reduction in large-scale DNA-sequencing projects. GAMES is available free of charge to academic users and may be obtained from http://aqua.unife.it/GAMES.

  11. A mixed-integer linear programming approach to the reduction of genome-scale metabolic networks.

    PubMed

    Röhl, Annika; Bockmayr, Alexander

    2017-01-03

    Constraint-based analysis has become a widely used method to study metabolic networks. While some of the associated algorithms can be applied to genome-scale network reconstructions with several thousands of reactions, others are limited to small or medium-sized models. In 2015, Erdrich et al. introduced a method called NetworkReducer, which reduces large metabolic networks to smaller subnetworks, while preserving a set of biological requirements that can be specified by the user. Already in 2001, Burgard et al. developed a mixed-integer linear programming (MILP) approach for computing minimal reaction sets under a given growth requirement. Here we present an MILP approach for computing minimum subnetworks with the given properties. The minimality (with respect to the number of active reactions) is not guaranteed by NetworkReducer, while the method by Burgard et al. does not allow specifying the different biological requirements. Our procedure is about 5-10 times faster than NetworkReducer and can enumerate all minimum subnetworks in case there exist several ones. This allows identifying common reactions that are present in all subnetworks, and reactions appearing in alternative pathways. Applying complex analysis methods to genome-scale metabolic networks is often not possible in practice. Thus it may become necessary to reduce the size of the network while keeping important functionalities. We propose a MILP solution to this problem. Compared to previous work, our approach is more efficient and allows computing not only one, but even all minimum subnetworks satisfying the required properties.

  12. Network Thermodynamic Curation of Human and Yeast Genome-Scale Metabolic Models

    PubMed Central

    Martínez, Verónica S.; Quek, Lake-Ee; Nielsen, Lars K.

    2014-01-01

    Genome-scale models are used for an ever-widening range of applications. Although there has been much focus on specifying the stoichiometric matrix, the predictive power of genome-scale models equally depends on reaction directions. Two-thirds of reactions in the two eukaryotic reconstructions Homo sapiens Recon 1 and Yeast 5 are specified as irreversible. However, these specifications are mainly based on biochemical textbooks or on their similarity to other organisms and are rarely underpinned by detailed thermodynamic analysis. In this study, a to our knowledge new workflow combining network-embedded thermodynamic and flux variability analysis was used to evaluate existing irreversibility constraints in Recon 1 and Yeast 5 and to identify new ones. A total of 27 and 16 new irreversible reactions were identified in Recon 1 and Yeast 5, respectively, whereas only four reactions were found with directions incorrectly specified against thermodynamics (three in Yeast 5 and one in Recon 1). The workflow further identified for both models several isolated internal loops that require further curation. The framework also highlighted the need for substrate channeling (in human) and ATP hydrolysis (in yeast) for the essential reaction catalyzed by phosphoribosylaminoimidazole carboxylase in purine metabolism. Finally, the framework highlighted differences in proline metabolism between yeast (cytosolic anabolism and mitochondrial catabolism) and humans (exclusively mitochondrial metabolism). We conclude that network-embedded thermodynamics facilitates the specification and validation of irreversibility constraints in compartmentalized metabolic models, at the same time providing further insight into network properties. PMID:25028891

  13. Large Scale Comparative Visualisation of Regulatory Networks with TRNDiff

    DOE PAGES

    Chua, Xin-Yi; Buckingham, Lawrence; Hogan, James M.; ...

    2015-06-01

    The advent of Next Generation Sequencing (NGS) technologies has seen explosive growth in genomic datasets, and dense coverage of related organisms, supporting study of subtle, strain-specific variations as a determinant of function. Such data collections present fresh and complex challenges for bioinformatics, those of comparing models of complex relationships across hundreds and even thousands of sequences. Transcriptional Regulatory Network (TRN) structures document the influence of regulatory proteins called Transcription Factors (TFs) on associated Target Genes (TGs). TRNs are routinely inferred from model systems or iterative search, and analysis at these scales requires simultaneous displays of multiple networks well beyond thosemore » of existing network visualisation tools [1]. In this paper we describe TRNDiff, an open source system supporting the comparative analysis and visualization of TRNs (and similarly structured data) from many genomes, allowing rapid identification of functional variations within species. The approach is demonstrated through a small scale multiple TRN analysis of the Fur iron-uptake system of Yersinia, suggesting a number of candidate virulence factors; and through a larger study exploiting integration with the RegPrecise database (http://regprecise.lbl.gov; [2]) - a collection of hundreds of manually curated and predicted transcription factor regulons drawn from across the entire spectrum of prokaryotic organisms.« less

  14. Integrative microbial community analysis reveals full-scale enhanced biological phosphorus removal under tropical conditions

    PubMed Central

    Law, Yingyu; Kirkegaard, Rasmus Hansen; Cokro, Angel Anisa; Liu, Xianghui; Arumugam, Krithika; Xie, Chao; Stokholm-Bjerregaard, Mikkel; Drautz-Moses, Daniela I.; Nielsen, Per Halkjær; Wuertz, Stefan; Williams, Rohan B. H.

    2016-01-01

    Management of phosphorus discharge from human waste is essential for the control of eutrophication in surface waters. Enhanced biological phosphorus removal (EBPR) is a sustainable, efficient way of removing phosphorus from waste water without employing chemical precipitation, but is assumed unachievable in tropical temperatures due to conditions that favour glycogen accumulating organisms (GAOs) over polyphosphate accumulating organisms (PAOs). Here, we show these assumptions are unfounded by studying comparative community dynamics in a full-scale plant following systematic perturbation of operational conditions, which modified community abundance, function and physicochemical state. A statistically significant increase in the relative abundance of the PAO Accumulibacter was associated with improved EBPR activity. GAO relative abundance also increased, challenging the assumption of competition. An Accumulibacter bin-genome was identified from a whole community metagenomic survey, and comparative analysis against extant Accumulibacter genomes suggests a close relationship to Type II. Analysis of the associated metatranscriptome data revealed that genes encoding proteins involved in the tricarboxylic acid cycle and glycolysis pathways were highly expressed, consistent with metabolic modelling results. Our findings show that tropical EBPR is indeed possible, highlight the translational potential of studying competition dynamics in full-scale waste water communities and carry implications for plant design in tropical regions. PMID:27193869

  15. Integrative microbial community analysis reveals full-scale enhanced biological phosphorus removal under tropical conditions

    NASA Astrophysics Data System (ADS)

    Law, Yingyu; Kirkegaard, Rasmus Hansen; Cokro, Angel Anisa; Liu, Xianghui; Arumugam, Krithika; Xie, Chao; Stokholm-Bjerregaard, Mikkel; Drautz-Moses, Daniela I.; Nielsen, Per Halkjær; Wuertz, Stefan; Williams, Rohan B. H.

    2016-05-01

    Management of phosphorus discharge from human waste is essential for the control of eutrophication in surface waters. Enhanced biological phosphorus removal (EBPR) is a sustainable, efficient way of removing phosphorus from waste water without employing chemical precipitation, but is assumed unachievable in tropical temperatures due to conditions that favour glycogen accumulating organisms (GAOs) over polyphosphate accumulating organisms (PAOs). Here, we show these assumptions are unfounded by studying comparative community dynamics in a full-scale plant following systematic perturbation of operational conditions, which modified community abundance, function and physicochemical state. A statistically significant increase in the relative abundance of the PAO Accumulibacter was associated with improved EBPR activity. GAO relative abundance also increased, challenging the assumption of competition. An Accumulibacter bin-genome was identified from a whole community metagenomic survey, and comparative analysis against extant Accumulibacter genomes suggests a close relationship to Type II. Analysis of the associated metatranscriptome data revealed that genes encoding proteins involved in the tricarboxylic acid cycle and glycolysis pathways were highly expressed, consistent with metabolic modelling results. Our findings show that tropical EBPR is indeed possible, highlight the translational potential of studying competition dynamics in full-scale waste water communities and carry implications for plant design in tropical regions.

  16. Inexpensive and Highly Reproducible Cloud-Based Variant Calling of 2,535 Human Genomes

    PubMed Central

    Shringarpure, Suyash S.; Carroll, Andrew; De La Vega, Francisco M.; Bustamante, Carlos D.

    2015-01-01

    Population scale sequencing of whole human genomes is becoming economically feasible; however, data management and analysis remains a formidable challenge for many research groups. Large sequencing studies, like the 1000 Genomes Project, have improved our understanding of human demography and the effect of rare genetic variation in disease. Variant calling on datasets of hundreds or thousands of genomes is time-consuming, expensive, and not easily reproducible given the myriad components of a variant calling pipeline. Here, we describe a cloud-based pipeline for joint variant calling in large samples using the Real Time Genomics population caller. We deployed the population caller on the Amazon cloud with the DNAnexus platform in order to achieve low-cost variant calling. Using our pipeline, we were able to identify 68.3 million variants in 2,535 samples from Phase 3 of the 1000 Genomes Project. By performing the variant calling in a parallel manner, the data was processed within 5 days at a compute cost of $7.33 per sample (a total cost of $18,590 for completed jobs and $21,805 for all jobs). Analysis of cost dependence and running time on the data size suggests that, given near linear scalability, cloud computing can be a cheap and efficient platform for analyzing even larger sequencing studies in the future. PMID:26110529

  17. Distinctive characters of Nostoc genomes in cyanolichens.

    PubMed

    Gagunashvili, Andrey N; Andrésson, Ólafur S

    2018-06-05

    Cyanobacteria of the genus Nostoc are capable of forming symbioses with a wide range of organism, including a diverse assemblage of cyanolichens. Only certain lineages of Nostoc appear to be able to form a close, stable symbiosis, raising the question whether symbiotic competence is determined by specific sets of genes and functionalities. We present the complete genome sequencing, annotation and analysis of two lichen Nostoc strains. Comparison with other Nostoc genomes allowed identification of genes potentially involved in symbioses with a broad range of partners including lichen mycobionts. The presence of additional genes necessary for symbiotic competence is likely reflected in larger genome sizes of symbiotic Nostoc strains. Some of the identified genes are presumably involved in the initial recognition and establishment of the symbiotic association, while others may confer advantage to cyanobionts during cohabitation with a mycobiont in the lichen symbiosis. Our study presents the first genome sequencing and genome-scale analysis of lichen-associated Nostoc strains. These data provide insight into the molecular nature of the cyanolichen symbiosis and pinpoint candidate genes for further studies aimed at deciphering the genetic mechanisms behind the symbiotic competence of Nostoc. Since many phylogenetic studies have shown that Nostoc is a polyphyletic group that includes several lineages, this work also provides an improved molecular basis for demarcation of a Nostoc clade with symbiotic competence.

  18. Evolution and Diversity in Human Herpes Simplex Virus Genomes

    PubMed Central

    Gatherer, Derek; Ochoa, Alejandro; Greenbaum, Benjamin; Dolan, Aidan; Bowden, Rory J.; Enquist, Lynn W.; Legendre, Matthieu; Davison, Andrew J.

    2014-01-01

    Herpes simplex virus 1 (HSV-1) causes a chronic, lifelong infection in >60% of adults. Multiple recent vaccine trials have failed, with viral diversity likely contributing to these failures. To understand HSV-1 diversity better, we comprehensively compared 20 newly sequenced viral genomes from China, Japan, Kenya, and South Korea with six previously sequenced genomes from the United States, Europe, and Japan. In this diverse collection of passaged strains, we found that one-fifth of the newly sequenced members share a gene deletion and one-third exhibit homopolymeric frameshift mutations (HFMs). Individual strains exhibit genotypic and potential phenotypic variation via HFMs, deletions, short sequence repeats, and single-nucleotide polymorphisms, although the protein sequence identity between strains exceeds 90% on average. In the first genome-scale analysis of positive selection in HSV-1, we found signs of selection in specific proteins and residues, including the fusion protein glycoprotein H. We also confirmed previous results suggesting that recombination has occurred with high frequency throughout the HSV-1 genome. Despite this, the HSV-1 strains analyzed clustered by geographic origin during whole-genome distance analysis. These data shed light on likely routes of HSV-1 adaptation to changing environments and will aid in the selection of vaccine antigens that are invariant worldwide. PMID:24227835

  19. Advances in the integration of transcriptional regulatory information into genome-scale metabolic models.

    PubMed

    Vivek-Ananth, R P; Samal, Areejit

    2016-09-01

    A major goal of systems biology is to build predictive computational models of cellular metabolism. Availability of complete genome sequences and wealth of legacy biochemical information has led to the reconstruction of genome-scale metabolic networks in the last 15 years for several organisms across the three domains of life. Due to paucity of information on kinetic parameters associated with metabolic reactions, the constraint-based modelling approach, flux balance analysis (FBA), has proved to be a vital alternative to investigate the capabilities of reconstructed metabolic networks. In parallel, advent of high-throughput technologies has led to the generation of massive amounts of omics data on transcriptional regulation comprising mRNA transcript levels and genome-wide binding profile of transcriptional regulators. A frontier area in metabolic systems biology has been the development of methods to integrate the available transcriptional regulatory information into constraint-based models of reconstructed metabolic networks in order to increase the predictive capabilities of computational models and understand the regulation of cellular metabolism. Here, we review the existing methods to integrate transcriptional regulatory information into constraint-based models of metabolic networks. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  20. Genome sequence of the progenitor of wheat A subgenome Triticum urartu.

    PubMed

    Ling, Hong-Qing; Ma, Bin; Shi, Xiaoli; Liu, Hui; Dong, Lingli; Sun, Hua; Cao, Yinghao; Gao, Qiang; Zheng, Shusong; Li, Ye; Yu, Ying; Du, Huilong; Qi, Ming; Li, Yan; Lu, Hongwei; Yu, Hua; Cui, Yan; Wang, Ning; Chen, Chunlin; Wu, Huilan; Zhao, Yan; Zhang, Juncheng; Li, Yiwen; Zhou, Wenjuan; Zhang, Bairu; Hu, Weijuan; van Eijk, Michiel J T; Tang, Jifeng; Witsenboer, Hanneke M A; Zhao, Shancen; Li, Zhensheng; Zhang, Aimin; Wang, Daowen; Liang, Chengzhi

    2018-05-09

    Triticum urartu (diploid, AA) is the progenitor of the A subgenome of tetraploid (Triticum turgidum, AABB) and hexaploid (Triticum aestivum, AABBDD) wheat 1,2 . Genomic studies of T. urartu have been useful for investigating the structure, function and evolution of polyploid wheat genomes. Here we report the generation of a high-quality genome sequence of T. urartu by combining bacterial artificial chromosome (BAC)-by-BAC sequencing, single molecule real-time whole-genome shotgun sequencing 3 , linked reads and optical mapping 4,5 . We assembled seven chromosome-scale pseudomolecules and identified protein-coding genes, and we suggest a model for the evolution of T. urartu chromosomes. Comparative analyses with genomes of other grasses showed gene loss and amplification in the numbers of transposable elements in the T. urartu genome. Population genomics analysis of 147 T. urartu accessions from across the Fertile Crescent showed clustering of three groups, with differences in altitude and biostress, such as powdery mildew disease. The T. urartu genome assembly provides a valuable resource for studying genetic variation in wheat and related grasses, and promises to facilitate the discovery of genes that could be useful for wheat improvement.

  1. WhopGenome: high-speed access to whole-genome variation and sequence data in R.

    PubMed

    Wittelsbürger, Ulrich; Pfeifer, Bastian; Lercher, Martin J

    2015-02-01

    The statistical programming language R has become a de facto standard for the analysis of many types of biological data, and is well suited for the rapid development of new algorithms. However, variant call data from population-scale resequencing projects are typically too large to be read and processed efficiently with R's built-in I/O capabilities. WhopGenome can efficiently read whole-genome variation data stored in the widely used variant call format (VCF) file format into several R data types. VCF files can be accessed either on local hard drives or on remote servers. WhopGenome can associate variants with annotations such as those available from the UCSC genome browser, and can accelerate the reading process by filtering loci according to user-defined criteria. WhopGenome can also read other Tabix-indexed files and create indices to allow fast selective access to FASTA-formatted sequence files. The WhopGenome R package is available on CRAN at http://cran.r-project.org/web/packages/WhopGenome/. A Bioconductor package has been submitted. lercher@cs.uni-duesseldorf.de. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  2. Oncogenomic portals for the visualization and analysis of genome-wide cancer data

    PubMed Central

    Klonowska, Katarzyna; Czubak, Karol; Wojciechowska, Marzena; Handschuh, Luiza; Zmienko, Agnieszka; Figlerowicz, Marek; Dams-Kozlowska, Hanna; Kozlowski, Piotr

    2016-01-01

    Somatically acquired genomic alterations that drive oncogenic cellular processes are of great scientific and clinical interest. Since the initiation of large-scale cancer genomic projects (e.g., the Cancer Genome Project, The Cancer Genome Atlas, and the International Cancer Genome Consortium cancer genome projects), a number of web-based portals have been created to facilitate access to multidimensional oncogenomic data and assist with the interpretation of the data. The portals provide the visualization of small-size mutations, copy number variations, methylation, and gene/protein expression data that can be correlated with the available clinical, epidemiological, and molecular features. Additionally, the portals enable to analyze the gathered data with the use of various user-friendly statistical tools. Herein, we present a highly illustrated review of seven portals, i.e., Tumorscape, UCSC Cancer Genomics Browser, ICGC Data Portal, COSMIC, cBioPortal, IntOGen, and BioProfiling.de. All of the selected portals are user-friendly and can be exploited by scientists from different cancer-associated fields, including those without bioinformatics background. It is expected that the use of the portals will contribute to a better understanding of cancer molecular etiology and will ultimately accelerate the translation of genomic knowledge into clinical practice. PMID:26484415

  3. Oncogenomic portals for the visualization and analysis of genome-wide cancer data.

    PubMed

    Klonowska, Katarzyna; Czubak, Karol; Wojciechowska, Marzena; Handschuh, Luiza; Zmienko, Agnieszka; Figlerowicz, Marek; Dams-Kozlowska, Hanna; Kozlowski, Piotr

    2016-01-05

    Somatically acquired genomic alterations that drive oncogenic cellular processes are of great scientific and clinical interest. Since the initiation of large-scale cancer genomic projects (e.g., the Cancer Genome Project, The Cancer Genome Atlas, and the International Cancer Genome Consortium cancer genome projects), a number of web-based portals have been created to facilitate access to multidimensional oncogenomic data and assist with the interpretation of the data. The portals provide the visualization of small-size mutations, copy number variations, methylation, and gene/protein expression data that can be correlated with the available clinical, epidemiological, and molecular features. Additionally, the portals enable to analyze the gathered data with the use of various user-friendly statistical tools. Herein, we present a highly illustrated review of seven portals, i.e., Tumorscape, UCSC Cancer Genomics Browser, ICGC Data Portal, COSMIC, cBioPortal, IntOGen, and BioProfiling.de. All of the selected portals are user-friendly and can be exploited by scientists from different cancer-associated fields, including those without bioinformatics background. It is expected that the use of the portals will contribute to a better understanding of cancer molecular etiology and will ultimately accelerate the translation of genomic knowledge into clinical practice.

  4. Genome-wide association meta-analysis in 269,867 individuals identifies new genetic and functional links to intelligence.

    PubMed

    Savage, Jeanne E; Jansen, Philip R; Stringer, Sven; Watanabe, Kyoko; Bryois, Julien; de Leeuw, Christiaan A; Nagel, Mats; Awasthi, Swapnil; Barr, Peter B; Coleman, Jonathan R I; Grasby, Katrina L; Hammerschlag, Anke R; Kaminski, Jakob A; Karlsson, Robert; Krapohl, Eva; Lam, Max; Nygaard, Marianne; Reynolds, Chandra A; Trampush, Joey W; Young, Hannah; Zabaneh, Delilah; Hägg, Sara; Hansell, Narelle K; Karlsson, Ida K; Linnarsson, Sten; Montgomery, Grant W; Muñoz-Manchado, Ana B; Quinlan, Erin B; Schumann, Gunter; Skene, Nathan G; Webb, Bradley T; White, Tonya; Arking, Dan E; Avramopoulos, Dimitrios; Bilder, Robert M; Bitsios, Panos; Burdick, Katherine E; Cannon, Tyrone D; Chiba-Falek, Ornit; Christoforou, Andrea; Cirulli, Elizabeth T; Congdon, Eliza; Corvin, Aiden; Davies, Gail; Deary, Ian J; DeRosse, Pamela; Dickinson, Dwight; Djurovic, Srdjan; Donohoe, Gary; Conley, Emily Drabant; Eriksson, Johan G; Espeseth, Thomas; Freimer, Nelson A; Giakoumaki, Stella; Giegling, Ina; Gill, Michael; Glahn, David C; Hariri, Ahmad R; Hatzimanolis, Alex; Keller, Matthew C; Knowles, Emma; Koltai, Deborah; Konte, Bettina; Lahti, Jari; Le Hellard, Stephanie; Lencz, Todd; Liewald, David C; London, Edythe; Lundervold, Astri J; Malhotra, Anil K; Melle, Ingrid; Morris, Derek; Need, Anna C; Ollier, William; Palotie, Aarno; Payton, Antony; Pendleton, Neil; Poldrack, Russell A; Räikkönen, Katri; Reinvang, Ivar; Roussos, Panos; Rujescu, Dan; Sabb, Fred W; Scult, Matthew A; Smeland, Olav B; Smyrnis, Nikolaos; Starr, John M; Steen, Vidar M; Stefanis, Nikos C; Straub, Richard E; Sundet, Kjetil; Tiemeier, Henning; Voineskos, Aristotle N; Weinberger, Daniel R; Widen, Elisabeth; Yu, Jin; Abecasis, Goncalo; Andreassen, Ole A; Breen, Gerome; Christiansen, Lene; Debrabant, Birgit; Dick, Danielle M; Heinz, Andreas; Hjerling-Leffler, Jens; Ikram, M Arfan; Kendler, Kenneth S; Martin, Nicholas G; Medland, Sarah E; Pedersen, Nancy L; Plomin, Robert; Polderman, Tinca J C; Ripke, Stephan; van der Sluis, Sophie; Sullivan, Patrick F; Vrieze, Scott I; Wright, Margaret J; Posthuma, Danielle

    2018-06-25

    Intelligence is highly heritable 1 and a major determinant of human health and well-being 2 . Recent genome-wide meta-analyses have identified 24 genomic loci linked to variation in intelligence 3-7 , but much about its genetic underpinnings remains to be discovered. Here, we present a large-scale genetic association study of intelligence (n = 269,867), identifying 205 associated genomic loci (190 new) and 1,016 genes (939 new) via positional mapping, expression quantitative trait locus (eQTL) mapping, chromatin interaction mapping, and gene-based association analysis. We find enrichment of genetic effects in conserved and coding regions and associations with 146 nonsynonymous exonic variants. Associated genes are strongly expressed in the brain, specifically in striatal medium spiny neurons and hippocampal pyramidal neurons. Gene set analyses implicate pathways related to nervous system development and synaptic structure. We confirm previous strong genetic correlations with multiple health-related outcomes, and Mendelian randomization analysis results suggest protective effects of intelligence for Alzheimer's disease and ADHD and bidirectional causation with pleiotropic effects for schizophrenia. These results are a major step forward in understanding the neurobiology of cognitive function as well as genetically related neurological and psychiatric disorders.

  5. Genome-wide population structure and evolutionary history of the Frizarta dairy sheep.

    PubMed

    Kominakis, A; Hager-Theodorides, A L; Saridaki, A; Antonakos, G; Tsiamis, G

    2017-10-01

    In the present study, we used genomic data, generated with a medium density single nucleotide polymorphisms (SNP) array, to acquire more information on the population structure and evolutionary history of the synthetic Frizarta dairy sheep. First, two typical measures of linkage disequilibrium (LD) were estimated at various physical distances that were then used to make inferences on the effective population size at key past time points. Population structure was also assessed by both multidimensional scaling analysis and k-means clustering on the distance matrix obtained from the animals' genomic relationships. The Wright's fixation F ST index was also employed to assess herds' genetic homogeneity and to indirectly estimate past migration rates. The Wright's fixation F IS index and genomic inbreeding coefficients based on the genomic relationship matrix as well as on runs of homozygosity were also estimated. The Frizarta breed displays relatively low LD levels with r 2 and |D'| equal to 0.18 and 0.50, respectively, at an average inter-marker distance of 31 kb. Linkage disequilibrium decayed rapidly by distance and persisted over just a few thousand base pairs. Rate of LD decay (β) varied widely among the 26 autosomes with larger values estimated for shorter chromosomes (e.g. β=0.057, for OAR6) and smaller values for longer ones (e.g. β=0.022, for OAR2). The inferred effective population size at the beginning of the breed's formation was as high as 549, was then reduced to 463 in 1981 (end of the breed's formation) and further declined to 187, one generation ago. Multidimensional scaling analysis and k-means clustering suggested a genetically homogenous population, F ST estimates indicated relatively low genetic differentiation between herds, whereas a heat map of the animals' genomic kinship relationships revealed a stratified population, at a herd level. Estimates of genomic inbreeding coefficients suggested that most recent parental relatedness may have been a major determinant of the current effective population size. A denser than the 50k SNP panel may be more beneficial when performing genome wide association studies in the breed.

  6. Large-Scale Cognitive GWAS Meta-Analysis Reveals Tissue-Specific Neural Expression and Potential Nootropic Drug Targets.

    PubMed

    Lam, Max; Trampush, Joey W; Yu, Jin; Knowles, Emma; Davies, Gail; Liewald, David C; Starr, John M; Djurovic, Srdjan; Melle, Ingrid; Sundet, Kjetil; Christoforou, Andrea; Reinvang, Ivar; DeRosse, Pamela; Lundervold, Astri J; Steen, Vidar M; Espeseth, Thomas; Räikkönen, Katri; Widen, Elisabeth; Palotie, Aarno; Eriksson, Johan G; Giegling, Ina; Konte, Bettina; Roussos, Panos; Giakoumaki, Stella; Burdick, Katherine E; Payton, Antony; Ollier, William; Chiba-Falek, Ornit; Attix, Deborah K; Need, Anna C; Cirulli, Elizabeth T; Voineskos, Aristotle N; Stefanis, Nikos C; Avramopoulos, Dimitrios; Hatzimanolis, Alex; Arking, Dan E; Smyrnis, Nikolaos; Bilder, Robert M; Freimer, Nelson A; Cannon, Tyrone D; London, Edythe; Poldrack, Russell A; Sabb, Fred W; Congdon, Eliza; Conley, Emily Drabant; Scult, Matthew A; Dickinson, Dwight; Straub, Richard E; Donohoe, Gary; Morris, Derek; Corvin, Aiden; Gill, Michael; Hariri, Ahmad R; Weinberger, Daniel R; Pendleton, Neil; Bitsios, Panos; Rujescu, Dan; Lahti, Jari; Le Hellard, Stephanie; Keller, Matthew C; Andreassen, Ole A; Deary, Ian J; Glahn, David C; Malhotra, Anil K; Lencz, Todd

    2017-11-28

    Here, we present a large (n = 107,207) genome-wide association study (GWAS) of general cognitive ability ("g"), further enhanced by combining results with a large-scale GWAS of educational attainment. We identified 70 independent genomic loci associated with general cognitive ability. Results showed significant enrichment for genes causing Mendelian disorders with an intellectual disability phenotype. Competitive pathway analysis implicated the biological processes of neurogenesis and synaptic regulation, as well as the gene targets of two pharmacologic agents: cinnarizine, a T-type calcium channel blocker, and LY97241, a potassium channel inhibitor. Transcriptome-wide and epigenome-wide analysis revealed that the implicated loci were enriched for genes expressed across all brain regions (most strongly in the cerebellum). Enrichment was exclusive to genes expressed in neurons but not oligodendrocytes or astrocytes. Finally, we report genetic correlations between cognitive ability and disparate phenotypes including psychiatric disorders, several autoimmune disorders, longevity, and maternal age at first birth. Copyright © 2017 The Author(s). Published by Elsevier Inc. All rights reserved.

  7. Ethanol production improvement driven by genome-scale metabolic modeling and sensitivity analysis in Scheffersomyces stipitis

    PubMed Central

    2017-01-01

    The yeast Scheffersomyces stipitis naturally produces ethanol from xylose, however reaching high ethanol yields is strongly dependent on aeration conditions. It has been reported that changes in the availability of NAD(H/+) cofactors can improve fermentation in some microorganisms. In this work genome-scale metabolic modeling and phenotypic phase plane analysis were used to characterize metabolic response on a range of uptake rates. Sensitivity analysis was used to assess the effect of ARC on ethanol production indicating that modifying ARC by inhibiting the respiratory chain ethanol production can be improved. It was shown experimentally in batch culture using Rotenone as an inhibitor of the mitochondrial NADH dehydrogenase complex I (CINADH), increasing ethanol yield by 18%. Furthermore, trajectories for uptakes rates, specific productivity and specific growth rate were determined by modeling the batch culture, to calculate ARC associated to the addition of CINADH inhibitor. Results showed that the increment in ethanol production via respiratory inhibition is due to excess in ARC, which generates an increase in ethanol production. Thus ethanol production improvement could be predicted by a change in ARC. PMID:28658270

  8. Ethanol production improvement driven by genome-scale metabolic modeling and sensitivity analysis in Scheffersomyces stipitis.

    PubMed

    Acevedo, Alejandro; Conejeros, Raúl; Aroca, Germán

    2017-01-01

    The yeast Scheffersomyces stipitis naturally produces ethanol from xylose, however reaching high ethanol yields is strongly dependent on aeration conditions. It has been reported that changes in the availability of NAD(H/+) cofactors can improve fermentation in some microorganisms. In this work genome-scale metabolic modeling and phenotypic phase plane analysis were used to characterize metabolic response on a range of uptake rates. Sensitivity analysis was used to assess the effect of ARC on ethanol production indicating that modifying ARC by inhibiting the respiratory chain ethanol production can be improved. It was shown experimentally in batch culture using Rotenone as an inhibitor of the mitochondrial NADH dehydrogenase complex I (CINADH), increasing ethanol yield by 18%. Furthermore, trajectories for uptakes rates, specific productivity and specific growth rate were determined by modeling the batch culture, to calculate ARC associated to the addition of CINADH inhibitor. Results showed that the increment in ethanol production via respiratory inhibition is due to excess in ARC, which generates an increase in ethanol production. Thus ethanol production improvement could be predicted by a change in ARC.

  9. A New Perspective on Polyploid Fragaria (Strawberry) Genome Composition Based on Large-Scale, Multi-Locus Phylogenetic Analysis.

    PubMed

    Yang, Yilong; Davis, Thomas M

    2017-12-01

    The subgenomic compositions of the octoploid (2n = 8× = 56) strawberry (Fragaria) species, including the economically important cultivated species Fragaria x ananassa, have been a topic of long-standing interest. Phylogenomic approaches utilizing next-generation sequencing technologies offer a new window into species relationships and the subgenomic compositions of polyploids. We have conducted a large-scale phylogenetic analysis of Fragaria (strawberry) species using the Fluidigm Access Array system and 454 sequencing platform. About 24 single-copy or low-copy nuclear genes distributed across the genome were amplified and sequenced from 96 genomic DNA samples representing 16 Fragaria species from diploid (2×) to decaploid (10×), including the most extensive sampling of octoploid taxa yet reported. Individual gene trees were constructed by different tree-building methods. Mosaic genomic structures of diploid Fragaria species consisting of sequences at different phylogenetic positions were observed. Our findings support the presence in octoploid species of genetic signatures from at least five diploid ancestors (F. vesca, F. iinumae, F. bucharica, F. viridis, and at least one additional allele contributor of unknown identity), and questions the extent to which distinct subgenomes are preserved over evolutionary time in the allopolyploid Fragaria species. In addition, our data support divergence between the two wild octoploid species, F. virginiana and F. chiloensis. © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  10. FGWAS: Functional genome wide association analysis.

    PubMed

    Huang, Chao; Thompson, Paul; Wang, Yalin; Yu, Yang; Zhang, Jingwen; Kong, Dehan; Colen, Rivka R; Knickmeyer, Rebecca C; Zhu, Hongtu

    2017-10-01

    Functional phenotypes (e.g., subcortical surface representation), which commonly arise in imaging genetic studies, have been used to detect putative genes for complexly inherited neuropsychiatric and neurodegenerative disorders. However, existing statistical methods largely ignore the functional features (e.g., functional smoothness and correlation). The aim of this paper is to develop a functional genome-wide association analysis (FGWAS) framework to efficiently carry out whole-genome analyses of functional phenotypes. FGWAS consists of three components: a multivariate varying coefficient model, a global sure independence screening procedure, and a test procedure. Compared with the standard multivariate regression model, the multivariate varying coefficient model explicitly models the functional features of functional phenotypes through the integration of smooth coefficient functions and functional principal component analysis. Statistically, compared with existing methods for genome-wide association studies (GWAS), FGWAS can substantially boost the detection power for discovering important genetic variants influencing brain structure and function. Simulation studies show that FGWAS outperforms existing GWAS methods for searching sparse signals in an extremely large search space, while controlling for the family-wise error rate. We have successfully applied FGWAS to large-scale analysis of data from the Alzheimer's Disease Neuroimaging Initiative for 708 subjects, 30,000 vertices on the left and right hippocampal surfaces, and 501,584 SNPs. Copyright © 2017 Elsevier Inc. All rights reserved.

  11. Multi-scale structural community organisation of the human genome.

    PubMed

    Boulos, Rasha E; Tremblay, Nicolas; Arneodo, Alain; Borgnat, Pierre; Audit, Benjamin

    2017-04-11

    Structural interaction frequency matrices between all genome loci are now experimentally achievable thanks to high-throughput chromosome conformation capture technologies. This ensues a new methodological challenge for computational biology which consists in objectively extracting from these data the structural motifs characteristic of genome organisation. We deployed the fast multi-scale community mining algorithm based on spectral graph wavelets to characterise the networks of intra-chromosomal interactions in human cell lines. We observed that there exist structural domains of all sizes up to chromosome length and demonstrated that the set of structural communities forms a hierarchy of chromosome segments. Hence, at all scales, chromosome folding predominantly involves interactions between neighbouring sites rather than the formation of links between distant loci. Multi-scale structural decomposition of human chromosomes provides an original framework to question structural organisation and its relationship to functional regulation across the scales. By construction the proposed methodology is independent of the precise assembly of the reference genome and is thus directly applicable to genomes whose assembly is not fully determined.

  12. Substantial genome synteny preservation among woody angiosperm species: comparative genomics of Chinese chestnut (Castanea mollissima) and plant reference genomes.

    PubMed

    Staton, Margaret; Zhebentyayeva, Tetyana; Olukolu, Bode; Fang, Guang Chen; Nelson, Dana; Carlson, John E; Abbott, Albert G

    2015-10-05

    Chinese chestnut (Castanea mollissima) has emerged as a model species for the Fagaceae family with extensive genomic resources including a physical map, a dense genetic map and quantitative trait loci (QTLs) for chestnut blight resistance. These resources enable comparative genomics analyses relative to model plants. We assessed the degree of conservation between the chestnut genome and other well annotated and assembled plant genomic sequences, focusing on the QTL regions of most interest to the chestnut breeding community. The integrated physical and genetic map of Chinese chestnut has been improved to now include 858 shared sequence-based markers. The utility of the integrated map has also been improved through the addition of 42,970 BAC (bacterial artificial chromosome) end sequences spanning over 26 million bases of the estimated 800 Mb chestnut genome. Synteny between chestnut and ten model plant species was conducted on a macro-syntenic scale using sequences from both individual probes and BAC end sequences across the chestnut physical map. Blocks of synteny with chestnut were found in all ten reference species, with the percent of the chestnut physical map that could be aligned ranging from 10 to 39 %. The integrated genetic and physical map was utilized to identify BACs that spanned the three previously identified QTL regions conferring blight resistance. The clones were pooled and sequenced, yielding 396 sequence scaffolds covering 13.9 Mbp. Comparative genomic analysis on a microsytenic scale, using the QTL-associated genomic sequence, identified synteny from chestnut to other plant genomes ranging from 5.4 to 12.9 % of the genome sequences aligning. On both the macro- and micro-synteny levels, the peach, grape and poplar genomes were found to be the most structurally conserved with chestnut. Interestingly, these results did not strictly follow the expectation that decreased phylogenetic distance would correspond to increased levels of genome preservation, but rather suggest the additional influence of life-history traits on preservation of synteny. The regions of synteny that were detected provide an important tool for defining and cataloging genes in the QTL regions for advancing chestnut blight resistance research.

  13. Genomes, free radicals and plant cell invasion: recent developments in plant pathogenic fungi.

    PubMed

    Egan, Martin J; Talbot, Nicholas J

    2008-08-01

    This review describes current advances in our understanding of fungal-plant interactions. The widespread application of whole genome sequencing to a diverse range of fungal species has allowed new insight into the evolution of fungal pathogenesis and the definition of the gene inventories associated with important plant pathogens. This has also led to functional genomic approaches to carry out large-scale gene functional analysis. There has also been significant progress in understanding appressorium-mediated plant infection by fungi and its underlying genetic basis. The nature of biotrophic proliferation of fungal pathogens in host tissue has recently revealed new potential mechanisms for cell-to-cell movement by invading pathogens.

  14. Evidence for Sequential and Increasing Activation of Replication Origins along Replication Timing Gradients in the Human Genome

    PubMed Central

    Guilbaud, Guillaume; Rappailles, Aurélien; Baker, Antoine; Chen, Chun-Long; Arneodo, Alain; Goldar, Arach; d'Aubenton-Carafa, Yves; Thermes, Claude; Audit, Benjamin; Hyrien, Olivier

    2011-01-01

    Genome-wide replication timing studies have suggested that mammalian chromosomes consist of megabase-scale domains of coordinated origin firing separated by large originless transition regions. Here, we report a quantitative genome-wide analysis of DNA replication kinetics in several human cell types that contradicts this view. DNA combing in HeLa cells sorted into four temporal compartments of S phase shows that replication origins are spaced at 40 kb intervals and fire as small clusters whose synchrony increases during S phase and that replication fork velocity (mean 0.7 kb/min, maximum 2.0 kb/min) remains constant and narrowly distributed through S phase. However, multi-scale analysis of a genome-wide replication timing profile shows a broad distribution of replication timing gradients with practically no regions larger than 100 kb replicating at less than 2 kb/min. Therefore, HeLa cells lack large regions of unidirectional fork progression. Temporal transition regions are replicated by sequential activation of origins at a rate that increases during S phase and replication timing gradients are set by the delay and the spacing between successive origin firings rather than by the velocity of single forks. Activation of internal origins in a specific temporal transition region is directly demonstrated by DNA combing of the IGH locus in HeLa cells. Analysis of published origin maps in HeLa cells and published replication timing and DNA combing data in several other cell types corroborate these findings, with the interesting exception of embryonic stem cells where regions of unidirectional fork progression seem more abundant. These results can be explained if origins fire independently of each other but under the control of long-range chromatin structure, or if replication forks progressing from early origins stimulate initiation in nearby unreplicated DNA. These findings shed a new light on the replication timing program of mammalian genomes and provide a general model for their replication kinetics. PMID:22219720

  15. Transposon diversity in Arabidopsis thaliana

    PubMed Central

    Le, Quang Hien; Wright, Stephen; Yu, Zhihui; Bureau, Thomas

    2000-01-01

    Recent availability of extensive genome sequence information offers new opportunities to analyze genome organization, including transposon diversity and accumulation, at a level of resolution that was previously unattainable. In this report, we used sequence similarity search and analysis protocols to perform a fine-scale analysis of a large sample (≈17.2 Mb) of the Arabidopsis thaliana (Columbia) genome for transposons. Consistent with previous studies, we report that the A. thaliana genome harbors diverse representatives of most known superfamilies of transposons. However, our survey reveals a higher density of transposons of which over one-fourth could be classified into a single novel transposon family designated as Basho, which appears unrelated to any previously known superfamily. We have also identified putative transposase-coding ORFs for miniature inverted-repeat transposable elements (MITEs), providing clues into the mechanism of mobility and origins of the most abundant transposons associated with plant genes. In addition, we provide evidence that most mined transposons have a clear distribution preference for A + T-rich sequences and show that structural variation for many mined transposons is partly due to interelement recombination. Taken together, these findings further underscore the complexity of transposons within the compact genome of A. thaliana. PMID:10861007

  16. Component identification of electron transport chains in curdlan-producing Agrobacterium sp. ATCC 31749 and its genome-specific prediction using comparative genome and phylogenetic trees analysis.

    PubMed

    Zhang, Hongtao; Setubal, Joao Carlos; Zhan, Xiaobei; Zheng, Zhiyong; Yu, Lijun; Wu, Jianrong; Chen, Dingqiang

    2011-06-01

    Agrobacterium sp. ATCC 31749 (formerly named Alcaligenes faecalis var. myxogenes) is a non-pathogenic aerobic soil bacterium used in large scale biotechnological production of curdlan. However, little is known about its genomic information. DNA partial sequence of electron transport chains (ETCs) protein genes were obtained in order to understand the components of ETC and genomic-specificity in Agrobacterium sp. ATCC 31749. Degenerate primers were designed according to ETC conserved sequences in other reported species. DNA partial sequences of ETC genes in Agrobacterium sp. ATCC 31749 were cloned by the PCR method using degenerate primers. Based on comparative genomic analysis, nine electron transport elements were ascertained, including NADH ubiquinone oxidoreductase, succinate dehydrogenase complex II, complex III, cytochrome c, ubiquinone biosynthesis protein ubiB, cytochrome d terminal oxidase, cytochrome bo terminal oxidase, cytochrome cbb (3)-type terminal oxidase and cytochrome caa (3)-type terminal oxidase. Similarity and phylogenetic analyses of these genes revealed that among fully sequenced Agrobacterium species, Agrobacterium sp. ATCC 31749 is closest to Agrobacterium tumefaciens C58. Based on these results a comprehensive ETC model for Agrobacterium sp. ATCC 31749 is proposed.

  17. Parallel and serial computing tools for testing single-locus and epistatic SNP effects of quantitative traits in genome-wide association studies

    PubMed Central

    Ma, Li; Runesha, H Birali; Dvorkin, Daniel; Garbe, John R; Da, Yang

    2008-01-01

    Background Genome-wide association studies (GWAS) using single nucleotide polymorphism (SNP) markers provide opportunities to detect epistatic SNPs associated with quantitative traits and to detect the exact mode of an epistasis effect. Computational difficulty is the main bottleneck for epistasis testing in large scale GWAS. Results The EPISNPmpi and EPISNP computer programs were developed for testing single-locus and epistatic SNP effects on quantitative traits in GWAS, including tests of three single-locus effects for each SNP (SNP genotypic effect, additive and dominance effects) and five epistasis effects for each pair of SNPs (two-locus interaction, additive × additive, additive × dominance, dominance × additive, and dominance × dominance) based on the extended Kempthorne model. EPISNPmpi is the parallel computing program for epistasis testing in large scale GWAS and achieved excellent scalability for large scale analysis and portability for various parallel computing platforms. EPISNP is the serial computing program based on the EPISNPmpi code for epistasis testing in small scale GWAS using commonly available operating systems and computer hardware. Three serial computing utility programs were developed for graphical viewing of test results and epistasis networks, and for estimating CPU time and disk space requirements. Conclusion The EPISNPmpi parallel computing program provides an effective computing tool for epistasis testing in large scale GWAS, and the epiSNP serial computing programs are convenient tools for epistasis analysis in small scale GWAS using commonly available computer hardware. PMID:18644146

  18. Genome-wide inference of regulatory networks in Streptomyces coelicolor.

    PubMed

    Castro-Melchor, Marlene; Charaniya, Salim; Karypis, George; Takano, Eriko; Hu, Wei-Shou

    2010-10-18

    The onset of antibiotics production in Streptomyces species is co-ordinated with differentiation events. An understanding of the genetic circuits that regulate these coupled biological phenomena is essential to discover and engineer the pharmacologically important natural products made by these species. The availability of genomic tools and access to a large warehouse of transcriptome data for the model organism, Streptomyces coelicolor, provides incentive to decipher the intricacies of the regulatory cascades and develop biologically meaningful hypotheses. In this study, more than 500 samples of genome-wide temporal transcriptome data, comprising wild-type and more than 25 regulatory gene mutants of Streptomyces coelicolor probed across multiple stress and medium conditions, were investigated. Information based on transcript and functional similarity was used to update a previously-predicted whole-genome operon map and further applied to predict transcriptional networks constituting modules enriched in diverse functions such as secondary metabolism, and sigma factor. The predicted network displays a scale-free architecture with a small-world property observed in many biological networks. The networks were further investigated to identify functionally-relevant modules that exhibit functional coherence and a consensus motif in the promoter elements indicative of DNA-binding elements. Despite the enormous experimental as well as computational challenges, a systems approach for integrating diverse genome-scale datasets to elucidate complex regulatory networks is beginning to emerge. We present an integrated analysis of transcriptome data and genomic features to refine a whole-genome operon map and to construct regulatory networks at the cistron level in Streptomyces coelicolor. The functionally-relevant modules identified in this study pose as potential targets for further studies and verification.

  19. The Role of Genome Accessibility in Transcription Factor Binding in Bacteria.

    PubMed

    Gomes, Antonio L C; Wang, Harris H

    2016-04-01

    ChIP-seq enables genome-scale identification of regulatory regions that govern gene expression. However, the biological insights generated from ChIP-seq analysis have been limited to predictions of binding sites and cooperative interactions. Furthermore, ChIP-seq data often poorly correlate with in vitro measurements or predicted motifs, highlighting that binding affinity alone is insufficient to explain transcription factor (TF)-binding in vivo. One possibility is that binding sites are not equally accessible across the genome. A more comprehensive biophysical representation of TF-binding is required to improve our ability to understand, predict, and alter gene expression. Here, we show that genome accessibility is a key parameter that impacts TF-binding in bacteria. We developed a thermodynamic model that parameterizes ChIP-seq coverage in terms of genome accessibility and binding affinity. The role of genome accessibility is validated using a large-scale ChIP-seq dataset of the M. tuberculosis regulatory network. We find that accounting for genome accessibility led to a model that explains 63% of the ChIP-seq profile variance, while a model based in motif score alone explains only 35% of the variance. Moreover, our framework enables de novo ChIP-seq peak prediction and is useful for inferring TF-binding peaks in new experimental conditions by reducing the need for additional experiments. We observe that the genome is more accessible in intergenic regions, and that increased accessibility is positively correlated with gene expression and anti-correlated with distance to the origin of replication. Our biophysically motivated model provides a more comprehensive description of TF-binding in vivo from first principles towards a better representation of gene regulation in silico, with promising applications in systems biology.

  20. Design of DNA pooling to allow incorporation of covariates in rare variants analysis.

    PubMed

    Guan, Weihua; Li, Chun

    2014-01-01

    Rapid advances in next-generation sequencing technologies facilitate genetic association studies of an increasingly wide array of rare variants. To capture the rare or less common variants, a large number of individuals will be needed. However, the cost of a large scale study using whole genome or exome sequencing is still high. DNA pooling can serve as a cost-effective approach, but with a potential limitation that the identity of individual genomes would be lost and therefore individual characteristics and environmental factors could not be adjusted in association analysis, which may result in power loss and a biased estimate of genetic effect. For case-control studies, we propose a design strategy for pool creation and an analysis strategy that allows covariate adjustment, using multiple imputation technique. Simulations show that our approach can obtain reasonable estimate for genotypic effect with only slight loss of power compared to the much more expensive approach of sequencing individual genomes. Our design and analysis strategies enable more powerful and cost-effective sequencing studies of complex diseases, while allowing incorporation of covariate adjustment.

  1. Next-Generation Sequencing: The Translational Medicine Approach from “Bench to Bedside to Population”

    PubMed Central

    Beigh, Mohammad Muzafar

    2016-01-01

    Humans have predicted the relationship between heredity and diseases for a long time. Only in the beginning of the last century, scientists begin to discover the connotations between different genes and disease phenotypes. Recent trends in next-generation sequencing (NGS) technologies have brought a great momentum in biomedical research that in turn has remarkably augmented our basic understanding of human biology and its associated diseases. State-of-the-art next generation biotechnologies have started making huge strides in our current understanding of mechanisms of various chronic illnesses like cancers, metabolic disorders, neurodegenerative anomalies, etc. We are experiencing a renaissance in biomedical research primarily driven by next generation biotechnologies like genomics, transcriptomics, proteomics, metabolomics, lipidomics etc. Although genomic discoveries are at the forefront of next generation omics technologies, however, their implementation into clinical arena had been painstakingly slow mainly because of high reaction costs and unavailability of requisite computational tools for large-scale data analysis. However rapid innovations and steadily lowering cost of sequence-based chemistries along with the development of advanced bioinformatics tools have lately prompted launching and implementation of large-scale massively parallel genome sequencing programs in different fields ranging from medical genetics, infectious biology, agriculture sciences etc. Recent advances in large-scale omics-technologies is bringing healthcare research beyond the traditional “bench to bedside” approach to more of a continuum that will include improvements, in public healthcare and will be primarily based on predictive, preventive, personalized, and participatory medicine approach (P4). Recent large-scale research projects in genetic and infectious disease biology have indicated that massively parallel whole-genome/whole-exome sequencing, transcriptome analysis, and other functional genomic tools can reveal large number of unique functional elements and/or markers that otherwise would be undetected by traditional sequencing methodologies. Therefore, latest trends in the biomedical research is giving birth to the new branch in medicine commonly referred to as personalized and/or precision medicine. Developments in the post-genomic era are believed to completely restructure the present clinical pattern of disease prevention and treatment as well as methods of diagnosis and prognosis. The next important step in the direction of the precision/personalized medicine approach should be its early adoption in clinics for future medical interventions. Consequently, in coming year’s next generation biotechnologies will reorient medical practice more towards disease prediction and prevention approaches rather than curing them at later stages of their development and progression, even at wider population level(s) for general public healthcare system. PMID:28930123

  2. BRAD, the genetics and genomics database for Brassica plants.

    PubMed

    Cheng, Feng; Liu, Shengyi; Wu, Jian; Fang, Lu; Sun, Silong; Liu, Bo; Li, Pingxia; Hua, Wei; Wang, Xiaowu

    2011-10-13

    Brassica species include both vegetable and oilseed crops, which are very important to the daily life of common human beings. Meanwhile, the Brassica species represent an excellent system for studying numerous aspects of plant biology, specifically for the analysis of genome evolution following polyploidy, so it is also very important for scientific research. Now, the genome of Brassica rapa has already been assembled, it is the time to do deep mining of the genome data. BRAD, the Brassica database, is a web-based resource focusing on genome scale genetic and genomic data for important Brassica crops. BRAD was built based on the first whole genome sequence and on further data analysis of the Brassica A genome species, Brassica rapa (Chiifu-401-42). It provides datasets, such as the complete genome sequence of B. rapa, which was de novo assembled from Illumina GA II short reads and from BAC clone sequences, predicted genes and associated annotations, non coding RNAs, transposable elements (TE), B. rapa genes' orthologous to those in A. thaliana, as well as genetic markers and linkage maps. BRAD offers useful searching and data mining tools, including search across annotation datasets, search for syntenic or non-syntenic orthologs, and to search the flanking regions of a certain target, as well as the tools of BLAST and Gbrowse. BRAD allows users to enter almost any kind of information, such as a B. rapa or A. thaliana gene ID, physical position or genetic marker. BRAD, a new database which focuses on the genetics and genomics of the Brassica plants has been developed, it aims at helping scientists and breeders to fully and efficiently use the information of genome data of Brassica plants. BRAD will be continuously updated and can be accessed through http://brassicadb.org.

  3. Transcriptome analysis reveals the time of the fourth round of genome duplication in common carp (Cyprinus carpio)

    PubMed Central

    2012-01-01

    Background Common carp (Cyprinus carpio) is thought to have undergone one extra round of genome duplication compared to zebrafish. Transcriptome analysis has been used to study the existence and timing of genome duplication in species for which genome sequences are incomplete. Large-scale transcriptome data for the common carp genome should help reveal the timing of the additional duplication event. Results We have sequenced the transcriptome of common carp using 454 pyrosequencing. After assembling the 454 contigs and the published common carp sequences together, we obtained 49,669 contigs and identified genes using homology searches and an ab initio method. We identified 4,651 orthologous pairs between common carp and zebrafish and found 129,984 paralogous pairs within the common carp. An estimation of the synonymous substitution rate in the orthologous pairs indicated that common carp and zebrafish diverged 120 million years ago (MYA). We identified one round of genome duplication in common carp and estimated that it had occurred 5.6 to 11.3 MYA. In zebrafish, no genome duplication event after speciation was observed, suggesting that, compared to zebrafish, common carp had undergone an additional genome duplication event. We annotated the common carp contigs with Gene Ontology terms and KEGG pathways. Compared with zebrafish gene annotations, we found that a set of biological processes and pathways were enriched in common carp. Conclusions The assembled contigs helped us to estimate the time of the fourth-round of genome duplication in common carp. The resource that we have built as part of this study will help advance functional genomics and genome annotation studies in the future. PMID:22424280

  4. Systems biology of embryonic development: Prospects for a complete understanding of the Caenorhabditis elegans embryo.

    PubMed

    Murray, John Isaac

    2018-05-01

    The convergence of developmental biology and modern genomics tools brings the potential for a comprehensive understanding of developmental systems. This is especially true for the Caenorhabditis elegans embryo because its small size, invariant developmental lineage, and powerful genetic and genomic tools provide the prospect of a cellular resolution understanding of messenger RNA (mRNA) expression and regulation across the organism. We describe here how a systems biology framework might allow large-scale determination of the embryonic regulatory relationships encoded in the C. elegans genome. This framework consists of two broad steps: (a) defining the "parts list"-all genes expressed in all cells at each time during development and (b) iterative steps of computational modeling and refinement of these models by experimental perturbation. Substantial progress has been made towards defining the parts list through imaging methods such as large-scale green fluorescent protein (GFP) reporter analysis. Imaging results are now being augmented by high-resolution transcriptome methods such as single-cell RNA sequencing, and it is likely the complete expression patterns of all genes across the embryo will be known within the next few years. In contrast, the modeling and perturbation experiments performed so far have focused largely on individual cell types or genes, and improved methods will be needed to expand them to the full genome and organism. This emerging comprehensive map of embryonic expression and regulatory function will provide a powerful resource for developmental biologists, and would also allow scientists to ask questions not accessible without a comprehensive picture. This article is categorized under: Invertebrate Organogenesis > Worms Technologies > Analysis of the Transcriptome Gene Expression and Transcriptional Hierarchies > Gene Networks and Genomics. © 2018 Wiley Periodicals, Inc.

  5. GenoMycDB: a database for comparative analysis of mycobacterial genes and genomes.

    PubMed

    Catanho, Marcos; Mascarenhas, Daniel; Degrave, Wim; Miranda, Antonio Basílio de

    2006-03-31

    Several databases and computational tools have been created with the aim of organizing, integrating and analyzing the wealth of information generated by large-scale sequencing projects of mycobacterial genomes and those of other organisms. However, with very few exceptions, these databases and tools do not allow for massive and/or dynamic comparison of these data. GenoMycDB (http://www.dbbm.fiocruz.br/GenoMycDB) is a relational database built for large-scale comparative analyses of completely sequenced mycobacterial genomes, based on their predicted protein content. Its central structure is composed of the results obtained after pair-wise sequence alignments among all the predicted proteins coded by the genomes of six mycobacteria: Mycobacterium tuberculosis (strains H37Rv and CDC1551), M. bovis AF2122/97, M. avium subsp. paratuberculosis K10, M. leprae TN, and M. smegmatis MC2 155. The database stores the computed similarity parameters of every aligned pair, providing for each protein sequence the predicted subcellular localization, the assigned cluster of orthologous groups, the features of the corresponding gene, and links to several important databases. Tables containing pairs or groups of potential homologs between selected species/strains can be produced dynamically by user-defined criteria, based on one or multiple sequence similarity parameters. In addition, searches can be restricted according to the predicted subcellular localization of the protein, the DNA strand of the corresponding gene and/or the description of the protein. Massive data search and/or retrieval are available, and different ways of exporting the result are offered. GenoMycDB provides an on-line resource for the functional classification of mycobacterial proteins as well as for the analysis of genome structure, organization, and evolution.

  6. The Divided Bacterial Genome: Structure, Function, and Evolution.

    PubMed

    diCenzo, George C; Finan, Turlough M

    2017-09-01

    Approximately 10% of bacterial genomes are split between two or more large DNA fragments, a genome architecture referred to as a multipartite genome. This multipartite organization is found in many important organisms, including plant symbionts, such as the nitrogen-fixing rhizobia, and plant, animal, and human pathogens, including the genera Brucella , Vibrio , and Burkholderia . The availability of many complete bacterial genome sequences means that we can now examine on a broad scale the characteristics of the different types of DNA molecules in a genome. Recent work has begun to shed light on the unique properties of each class of replicon, the unique functional role of chromosomal and nonchromosomal DNA molecules, and how the exploitation of novel niches may have driven the evolution of the multipartite genome. The aims of this review are to (i) outline the literature regarding bacterial genomes that are divided into multiple fragments, (ii) provide a meta-analysis of completed bacterial genomes from 1,708 species as a way of reviewing the abundant information present in these genome sequences, and (iii) provide an encompassing model to explain the evolution and function of the multipartite genome structure. This review covers, among other topics, salient genome terminology; mechanisms of multipartite genome formation; the phylogenetic distribution of multipartite genomes; how each part of a genome differs with respect to genomic signatures, genetic variability, and gene functional annotation; how each DNA molecule may interact; as well as the costs and benefits of this genome structure. Copyright © 2017 American Society for Microbiology.

  7. Reconstructing Past Admixture Processes from Local Genomic Ancestry Using Wavelet Transformation

    PubMed Central

    Sanderson, Jean; Sudoyo, Herawati; Karafet, Tatiana M.; Hammer, Michael F.; Cox, Murray P.

    2015-01-01

    Admixture between long-separated populations is a defining feature of the genomes of many species. The mosaic block structure of admixed genomes can provide information about past contact events, including the time and extent of admixture. Here, we describe an improved wavelet-based technique that better characterizes ancestry block structure from observed genomic patterns. principal components analysis is first applied to genomic data to identify the primary population structure, followed by wavelet decomposition to develop a new characterization of local ancestry information along the chromosomes. For testing purposes, this method is applied to human genome-wide genotype data from Indonesia, as well as virtual genetic data generated using genome-scale sequential coalescent simulations under a wide range of admixture scenarios. Time of admixture is inferred using an approximate Bayesian computation framework, providing robust estimates of both admixture times and their associated levels of uncertainty. Crucially, we demonstrate that this revised wavelet approach, which we have released as the R package adwave, provides improved statistical power over existing wavelet-based techniques and can be used to address a broad range of admixture questions. PMID:25852078

  8. Automated array-based genomic profiling in chronic lymphocytic leukemia: Development of a clinical tool and discovery of recurrent genomic alterations

    PubMed Central

    Schwaenen, Carsten; Nessling, Michelle; Wessendorf, Swen; Salvi, Tatjana; Wrobel, Gunnar; Radlwimmer, Bernhard; Kestler, Hans A.; Haslinger, Christian; Stilgenbauer, Stephan; Döhner, Hartmut; Bentz, Martin; Lichter, Peter

    2004-01-01

    B cell chronic lymphocytic leukemia (B-CLL) is characterized by a highly variable clinical course. Recurrent chromosomal imbalances provide significant prognostic markers. Risk-adapted therapy based on genomic alterations has become an option that is currently being tested in clinical trials. To supply a robust tool for such large scale studies, we developed a comprehensive DNA microarray dedicated to the automated analysis of recurrent genomic imbalances in B-CLL by array-based comparative genomic hybridization (matrix–CGH). Validation of this chip in a series of 106 B-CLL cases revealed a high specificity and sensitivity that fulfils the criteria for application in clinical oncology. This chip is immediately applicable within clinical B-CLL treatment trials that evaluate whether B-CLL cases with distinct chromosomal abnormalities should be treated with chemotherapy of different intensities and/or stem cell transplantation. Through the control set of DNA fragments equally distributed over the genome, recurrent genomic imbalances were discovered: trisomy of chromosome 19 and gain of the MYCN oncogene correlating with an elevation of MYCN mRNA expression. PMID:14730057

  9. Effective normalization for copy number variation detection from whole genome sequencing.

    PubMed

    Janevski, Angel; Varadan, Vinay; Kamalakaran, Sitharthan; Banerjee, Nilanjana; Dimitrova, Nevenka

    2012-01-01

    Whole genome sequencing enables a high resolution view of the human genome and provides unique insights into genome structure at an unprecedented scale. There have been a number of tools to infer copy number variation in the genome. These tools, while validated, also include a number of parameters that are configurable to genome data being analyzed. These algorithms allow for normalization to account for individual and population-specific effects on individual genome CNV estimates but the impact of these changes on the estimated CNVs is not well characterized. We evaluate in detail the effect of normalization methodologies in two CNV algorithms FREEC and CNV-seq using whole genome sequencing data from 8 individuals spanning four populations. We apply FREEC and CNV-seq to a sequencing data set consisting of 8 genomes. We use multiple configurations corresponding to different read-count normalization methodologies in FREEC, and statistically characterize the concordance of the CNV calls between FREEC configurations and the analogous output from CNV-seq. The normalization methodologies evaluated in FREEC are: GC content, mappability and control genome. We further stratify the concordance analysis within genic, non-genic, and a collection of validated variant regions. The GC content normalization methodology generates the highest number of altered copy number regions. Both mappability and control genome normalization reduce the total number and length of copy number regions. Mappability normalization yields Jaccard indices in the 0.07 - 0.3 range, whereas using a control genome normalization yields Jaccard index values around 0.4 with normalization based on GC content. The most critical impact of using mappability as a normalization factor is substantial reduction of deletion CNV calls. The output of another method based on control genome normalization, CNV-seq, resulted in comparable CNV call profiles, and substantial agreement in variable gene and CNV region calls. Choice of read-count normalization methodology has a substantial effect on CNV calls and the use of genomic mappability or an appropriately chosen control genome can optimize the output of CNV analysis.

  10. Multiscale Embedded Gene Co-expression Network Analysis

    PubMed Central

    Song, Won-Min; Zhang, Bin

    2015-01-01

    Gene co-expression network analysis has been shown effective in identifying functional co-expressed gene modules associated with complex human diseases. However, existing techniques to construct co-expression networks require some critical prior information such as predefined number of clusters, numerical thresholds for defining co-expression/interaction, or do not naturally reproduce the hallmarks of complex systems such as the scale-free degree distribution of small-worldness. Previously, a graph filtering technique called Planar Maximally Filtered Graph (PMFG) has been applied to many real-world data sets such as financial stock prices and gene expression to extract meaningful and relevant interactions. However, PMFG is not suitable for large-scale genomic data due to several drawbacks, such as the high computation complexity O(|V|3), the presence of false-positives due to the maximal planarity constraint, and the inadequacy of the clustering framework. Here, we developed a new co-expression network analysis framework called Multiscale Embedded Gene Co-expression Network Analysis (MEGENA) by: i) introducing quality control of co-expression similarities, ii) parallelizing embedded network construction, and iii) developing a novel clustering technique to identify multi-scale clustering structures in Planar Filtered Networks (PFNs). We applied MEGENA to a series of simulated data and the gene expression data in breast carcinoma and lung adenocarcinoma from The Cancer Genome Atlas (TCGA). MEGENA showed improved performance over well-established clustering methods and co-expression network construction approaches. MEGENA revealed not only meaningful multi-scale organizations of co-expressed gene clusters but also novel targets in breast carcinoma and lung adenocarcinoma. PMID:26618778

  11. Multiscale Embedded Gene Co-expression Network Analysis.

    PubMed

    Song, Won-Min; Zhang, Bin

    2015-11-01

    Gene co-expression network analysis has been shown effective in identifying functional co-expressed gene modules associated with complex human diseases. However, existing techniques to construct co-expression networks require some critical prior information such as predefined number of clusters, numerical thresholds for defining co-expression/interaction, or do not naturally reproduce the hallmarks of complex systems such as the scale-free degree distribution of small-worldness. Previously, a graph filtering technique called Planar Maximally Filtered Graph (PMFG) has been applied to many real-world data sets such as financial stock prices and gene expression to extract meaningful and relevant interactions. However, PMFG is not suitable for large-scale genomic data due to several drawbacks, such as the high computation complexity O(|V|3), the presence of false-positives due to the maximal planarity constraint, and the inadequacy of the clustering framework. Here, we developed a new co-expression network analysis framework called Multiscale Embedded Gene Co-expression Network Analysis (MEGENA) by: i) introducing quality control of co-expression similarities, ii) parallelizing embedded network construction, and iii) developing a novel clustering technique to identify multi-scale clustering structures in Planar Filtered Networks (PFNs). We applied MEGENA to a series of simulated data and the gene expression data in breast carcinoma and lung adenocarcinoma from The Cancer Genome Atlas (TCGA). MEGENA showed improved performance over well-established clustering methods and co-expression network construction approaches. MEGENA revealed not only meaningful multi-scale organizations of co-expressed gene clusters but also novel targets in breast carcinoma and lung adenocarcinoma.

  12. Genome-Scale Architecture of Small Molecule Regulatory Networks and the Fundamental Trade-Off between Regulation and Enzymatic Activity

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Reznik, Ed; Christodoulou, Dimitris; Goldford, Joshua E.

    Metabolic flux is in part regulated by endogenous small molecules that modulate the catalytic activity of an enzyme, e.g., allosteric inhibition. In contrast to transcriptional regulation of enzymes, technical limitations have hindered the production of a genome-scale atlas of small molecule-enzyme regulatory interactions. Here, we develop a framework leveraging the vast, but fragmented, biochemical literature to reconstruct and analyze the small molecule regulatory network (SMRN) of the model organism Escherichia coli, including the primary metabolite regulators and enzyme targets. Using metabolic control analysis, we prove a fundamental trade-off between regulation and enzymatic activity, and we combine it with metabolomic measurementsmore » and the SMRN to make inferences on the sensitivity of enzymes to their regulators. By generalizing the analysis to other organisms, we identify highly conserved regulatory interactions across evolutionarily divergent species, further emphasizing a critical role for small molecule interactions in the maintenance of metabolic homeostasis.« less

  13. InterProScan 5: genome-scale protein function classification

    PubMed Central

    Jones, Philip; Binns, David; Chang, Hsin-Yu; Fraser, Matthew; Li, Weizhong; McAnulla, Craig; McWilliam, Hamish; Maslen, John; Mitchell, Alex; Nuka, Gift; Pesseat, Sebastien; Quinn, Antony F.; Sangrador-Vegas, Amaia; Scheremetjew, Maxim; Yong, Siew-Yit; Lopez, Rodrigo; Hunter, Sarah

    2014-01-01

    Motivation: Robust large-scale sequence analysis is a major challenge in modern genomic science, where biologists are frequently trying to characterize many millions of sequences. Here, we describe a new Java-based architecture for the widely used protein function prediction software package InterProScan. Developments include improvements and additions to the outputs of the software and the complete reimplementation of the software framework, resulting in a flexible and stable system that is able to use both multiprocessor machines and/or conventional clusters to achieve scalable distributed data analysis. InterProScan is freely available for download from the EMBl-EBI FTP site and the open source code is hosted at Google Code. Availability and implementation: InterProScan is distributed via FTP at ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/ and the source code is available from http://code.google.com/p/interproscan/. Contact: http://www.ebi.ac.uk/support or interhelp@ebi.ac.uk or mitchell@ebi.ac.uk PMID:24451626

  14. Genome-Scale Architecture of Small Molecule Regulatory Networks and the Fundamental Trade-Off between Regulation and Enzymatic Activity

    DOE PAGES

    Reznik, Ed; Christodoulou, Dimitris; Goldford, Joshua E.; ...

    2017-09-12

    Metabolic flux is in part regulated by endogenous small molecules that modulate the catalytic activity of an enzyme, e.g., allosteric inhibition. In contrast to transcriptional regulation of enzymes, technical limitations have hindered the production of a genome-scale atlas of small molecule-enzyme regulatory interactions. Here, we develop a framework leveraging the vast, but fragmented, biochemical literature to reconstruct and analyze the small molecule regulatory network (SMRN) of the model organism Escherichia coli, including the primary metabolite regulators and enzyme targets. Using metabolic control analysis, we prove a fundamental trade-off between regulation and enzymatic activity, and we combine it with metabolomic measurementsmore » and the SMRN to make inferences on the sensitivity of enzymes to their regulators. By generalizing the analysis to other organisms, we identify highly conserved regulatory interactions across evolutionarily divergent species, further emphasizing a critical role for small molecule interactions in the maintenance of metabolic homeostasis.« less

  15. Comparative Analysis of Genome Diversity in Bullmastiff Dogs

    PubMed Central

    Mortlock, Sally-Anne; Khatkar, Mehar S.; Williamson, Peter

    2016-01-01

    Management and preservation of genomic diversity in dog breeds is a major objective for maintaining health. The present study was undertaken to characterise genomic diversity in Bullmastiff dogs using both genealogical and molecular analysis. Genealogical analysis of diversity was conducted using a database consisting of 16,378 Bullmastiff pedigrees from year 1980 to 2013. Additionally, a total of 188 Bullmastiff dogs were genotyped using the 170,000 SNP Illumina CanineHD Beadchip. Genealogical parameters revealed a mean inbreeding coefficient of 0.047; 142 total founders (f); an effective number of founders (fe) of 79; an effective number of ancestors (fa) of 62; and an effective population size of the reference population of 41. Genetic diversity and the degree of genome-wide homogeneity within the breed were also investigated using molecular data. Multiple-locus heterozygosity (MLH) was equal to 0.206; runs of homozygosity (ROH) as proportion of the genome, averaged 16.44%; effective population size was 29.1, with an average inbreeding coefficient of 0.035, all estimated using SNP Data. Fine-scale population structure was analysed using NETVIEW, a population analysis pipeline. Visualisation of the high definition network captured relationships among individuals within and between subpopulations. Effects of unequal founder use, and ancestral inbreeding and selection, were evident. While current levels of Bullmastiff heterozygosity, inbreeding and homozygosity are not unusual, a relatively small effective population size indicates that a breeding strategy to reduce the inbreeding rate may be beneficial. PMID:26824579

  16. Meta-analysis of sex-specific genome-wide association studies.

    PubMed

    Magi, Reedik; Lindgren, Cecilia M; Morris, Andrew P

    2010-12-01

    Despite the success of genome-wide association studies, much of the genetic contribution to complex human traits is still unexplained. One potential source of genetic variation that may contribute to this "missing heritability" is that which differs in magnitude and/or direction between males and females, which could result from sexual dimorphism in gene expression. Such sex-differentiated effects are common in model organisms, and are becoming increasingly evident in human complex traits through large-scale male- and female-specific meta-analyses. In this article, we review the methodology for meta-analysis of sex-specific genome-wide association studies, and propose a sex-differentiated test of association with quantitative or dichotomous traits, which allows for heterogeneity of allelic effects between males and females. We perform detailed simulations to compare the power of the proposed sex-differentiated meta-analysis with the more traditional "sex-combined" approach, which is ambivalent to gender. The results of this study highlight only a small loss in power for the sex-differentiated meta-analysis when the allelic effects of the causal variant are the same in males and females. However, over a range of models of heterogeneity in allelic effects between genders, our sex-differentiated meta-analysis strategy offers substantial gains in power, and thus has the potential to discover novel loci contributing effects to complex human traits with existing genome-wide association data. © 2010 Wiley-Liss, Inc.

  17. The Recombination Landscape in Wild House Mice Inferred Using Population Genomic Data.

    PubMed

    Booker, Tom R; Ness, Rob W; Keightley, Peter D

    2017-09-01

    Characterizing variation in the rate of recombination across the genome is important for understanding several evolutionary processes. Previous analysis of the recombination landscape in laboratory mice has revealed that the different subspecies have different suites of recombination hotspots. It is unknown, however, whether hotspots identified in laboratory strains reflect the hotspot diversity of natural populations or whether broad-scale variation in the rate of recombination is conserved between subspecies. In this study, we constructed fine-scale recombination rate maps for a natural population of the Eastern house mouse, Mus musculus castaneus We performed simulations to assess the accuracy of recombination rate inference in the presence of phase errors, and we used a novel approach to quantify phase error. The spatial distribution of recombination events is strongly positively correlated between our castaneus map, and a map constructed using inbred lines derived predominantly from M. m. domesticus Recombination hotspots in wild castaneus show little overlap, however, with the locations of double-strand breaks in wild-derived house mouse strains. Finally, we also find that genetic diversity in M. m. castaneus is positively correlated with the rate of recombination, consistent with pervasive natural selection operating in the genome. Our study suggests that recombination rate variation is conserved at broad scales between house mouse subspecies, but it is not strongly conserved at fine scales. Copyright © 2017 by the Genetics Society of America.

  18. Genomic Analysis of the Necrotrophic Fungal Pathogens Sclerotinia sclerotiorum and Botrytis cinerea

    PubMed Central

    Benito, Ernesto P.; Couloux, Arnaud; Coutinho, Pedro M.; de Vries, Ronald P.; Dyer, Paul S.; Fillinger, Sabine; Fournier, Elisabeth; Gout, Lilian; Hahn, Matthias; Kohn, Linda; Lapalu, Nicolas; Plummer, Kim M.; Pradier, Jean-Marc; Quévillon, Emmanuel; Sharon, Amir; Simon, Adeline; ten Have, Arjen; Tudzynski, Bettina; Tudzynski, Paul; Wincker, Patrick; Andrew, Marion; Anthouard, Véronique; Beffa, Rolland; Benoit, Isabelle; Bouzid, Ourdia; Brault, Baptiste; Chen, Zehua; Choquer, Mathias; Collémare, Jérome; Cotton, Pascale; Danchin, Etienne G.; Da Silva, Corinne; Gautier, Angélique; Giraud, Corinne; Giraud, Tatiana; Gonzalez, Celedonio; Grossetete, Sandrine; Güldener, Ulrich; Henrissat, Bernard; Howlett, Barbara J.; Kodira, Chinnappa; Kretschmer, Matthias; Lappartient, Anne; Leroch, Michaela; Levis, Caroline; Mauceli, Evan; Neuvéglise, Cécile; Oeser, Birgitt; Pearson, Matthew; Poulain, Julie; Poussereau, Nathalie; Quesneville, Hadi; Rascle, Christine; Schumacher, Julia; Ségurens, Béatrice; Sexton, Adrienne; Silva, Evelyn; Sirven, Catherine; Soanes, Darren M.; Talbot, Nicholas J.; Templeton, Matt; Yandava, Chandri; Yarden, Oded; Zeng, Qiandong; Rollins, Jeffrey A.; Lebrun, Marc-Henri; Dickman, Marty

    2011-01-01

    Sclerotinia sclerotiorum and Botrytis cinerea are closely related necrotrophic plant pathogenic fungi notable for their wide host ranges and environmental persistence. These attributes have made these species models for understanding the complexity of necrotrophic, broad host-range pathogenicity. Despite their similarities, the two species differ in mating behaviour and the ability to produce asexual spores. We have sequenced the genomes of one strain of S. sclerotiorum and two strains of B. cinerea. The comparative analysis of these genomes relative to one another and to other sequenced fungal genomes is provided here. Their 38–39 Mb genomes include 11,860–14,270 predicted genes, which share 83% amino acid identity on average between the two species. We have mapped the S. sclerotiorum assembly to 16 chromosomes and found large-scale co-linearity with the B. cinerea genomes. Seven percent of the S. sclerotiorum genome comprises transposable elements compared to <1% of B. cinerea. The arsenal of genes associated with necrotrophic processes is similar between the species, including genes involved in plant cell wall degradation and oxalic acid production. Analysis of secondary metabolism gene clusters revealed an expansion in number and diversity of B. cinerea–specific secondary metabolites relative to S. sclerotiorum. The potential diversity in secondary metabolism might be involved in adaptation to specific ecological niches. Comparative genome analysis revealed the basis of differing sexual mating compatibility systems between S. sclerotiorum and B. cinerea. The organization of the mating-type loci differs, and their structures provide evidence for the evolution of heterothallism from homothallism. These data shed light on the evolutionary and mechanistic bases of the genetically complex traits of necrotrophic pathogenicity and sexual mating. This resource should facilitate the functional studies designed to better understand what makes these fungi such successful and persistent pathogens of agronomic crops. PMID:21876677

  19. A genome-wide 20 K citrus microarray for gene expression analysis

    PubMed Central

    Martinez-Godoy, M Angeles; Mauri, Nuria; Juarez, Jose; Marques, M Carmen; Santiago, Julia; Forment, Javier; Gadea, Jose

    2008-01-01

    Background Understanding of genetic elements that contribute to key aspects of citrus biology will impact future improvements in this economically important crop. Global gene expression analysis demands microarray platforms with a high genome coverage. In the last years, genome-wide EST collections have been generated in citrus, opening the possibility to create new tools for functional genomics in this crop plant. Results We have designed and constructed a publicly available genome-wide cDNA microarray that include 21,081 putative unigenes of citrus. As a functional companion to the microarray, a web-browsable database [1] was created and populated with information about the unigenes represented in the microarray, including cDNA libraries, isolated clones, raw and processed nucleotide and protein sequences, and results of all the structural and functional annotation of the unigenes, like general description, BLAST hits, putative Arabidopsis orthologs, microsatellites, putative SNPs, GO classification and PFAM domains. We have performed a Gene Ontology comparison with the full set of Arabidopsis proteins to estimate the genome coverage of the microarray. We have also performed microarray hybridizations to check its usability. Conclusion This new cDNA microarray replaces the first 7K microarray generated two years ago and allows gene expression analysis at a more global scale. We have followed a rational design to minimize cross-hybridization while maintaining its utility for different citrus species. Furthermore, we also provide access to a website with full structural and functional annotation of the unigenes represented in the microarray, along with the ability to use this site to directly perform gene expression analysis using standard tools at different publicly available servers. Furthermore, we show how this microarray offers a good representation of the citrus genome and present the usefulness of this genomic tool for global studies in citrus by using it to catalogue genes expressed in citrus globular embryos. PMID:18598343

  20. In silico prediction of splice-altering single nucleotide variants in the human genome.

    PubMed

    Jian, Xueqiu; Boerwinkle, Eric; Liu, Xiaoming

    2014-12-16

    In silico tools have been developed to predict variants that may have an impact on pre-mRNA splicing. The major limitation of the application of these tools to basic research and clinical practice is the difficulty in interpreting the output. Most tools only predict potential splice sites given a DNA sequence without measuring splicing signal changes caused by a variant. Another limitation is the lack of large-scale evaluation studies of these tools. We compared eight in silico tools on 2959 single nucleotide variants within splicing consensus regions (scSNVs) using receiver operating characteristic analysis. The Position Weight Matrix model and MaxEntScan outperformed other methods. Two ensemble learning methods, adaptive boosting and random forests, were used to construct models that take advantage of individual methods. Both models further improved prediction, with outputs of directly interpretable prediction scores. We applied our ensemble scores to scSNVs from the Catalogue of Somatic Mutations in Cancer database. Analysis showed that predicted splice-altering scSNVs are enriched in recurrent scSNVs and known cancer genes. We pre-computed our ensemble scores for all potential scSNVs across the human genome, providing a whole genome level resource for identifying splice-altering scSNVs discovered from large-scale sequencing studies.

  1. Large-scale label-free quantitative proteomics of the pea aphid-Buchnera symbiosis.

    PubMed

    Poliakov, Anton; Russell, Calum W; Ponnala, Lalit; Hoops, Harold J; Sun, Qi; Douglas, Angela E; van Wijk, Klaas J

    2011-06-01

    Many insects are nutritionally dependent on symbiotic microorganisms that have tiny genomes and are housed in specialized host cells called bacteriocytes. The obligate symbiosis between the pea aphid Acyrthosiphon pisum and the γ-proteobacterium Buchnera aphidicola (only 584 predicted proteins) is particularly amenable for molecular analysis because the genomes of both partners have been sequenced. To better define the symbiotic relationship between this aphid and Buchnera, we used large-scale, high accuracy tandem mass spectrometry (nanoLC-LTQ-Orbtrap) to identify aphid and Buchnera proteins in the whole aphid body, purified bacteriocytes, isolated Buchnera cells and the residual bacteriocyte fraction. More than 1900 aphid and 400 Buchnera proteins were identified. All enzymes in amino acid metabolism annotated in the Buchnera genome were detected, reflecting the high (68%) coverage of the proteome and supporting the core function of Buchnera in the aphid symbiosis. Transporters mediating the transport of predicted metabolites were present in the bacteriocyte. Label-free spectral counting combined with hierarchical clustering, allowed to define the quantitative distribution of a subset of these proteins across both symbiotic partners, yielding no evidence for the selective transfer of protein among the partners in either direction. This is the first quantitative proteome analysis of bacteriocyte symbiosis, providing a wealth of information about molecular function of both the host cell and bacterial symbiont.

  2. PATIKA: an integrated visual environment for collaborative construction and analysis of cellular pathways.

    PubMed

    Demir, E; Babur, O; Dogrusoz, U; Gursoy, A; Nisanci, G; Cetin-Atalay, R; Ozturk, M

    2002-07-01

    Availability of the sequences of entire genomes shifts the scientific curiosity towards the identification of function of the genomes in large scale as in genome studies. In the near future, data produced about cellular processes at molecular level will accumulate with an accelerating rate as a result of proteomics studies. In this regard, it is essential to develop tools for storing, integrating, accessing, and analyzing this data effectively. We define an ontology for a comprehensive representation of cellular events. The ontology presented here enables integration of fragmented or incomplete pathway information and supports manipulation and incorporation of the stored data, as well as multiple levels of abstraction. Based on this ontology, we present the architecture of an integrated environment named Patika (Pathway Analysis Tool for Integration and Knowledge Acquisition). Patika is composed of a server-side, scalable, object-oriented database and client-side editors to provide an integrated, multi-user environment for visualizing and manipulating network of cellular events. This tool features automated pathway layout, functional computation support, advanced querying and a user-friendly graphical interface. We expect that Patika will be a valuable tool for rapid knowledge acquisition, microarray generated large-scale data interpretation, disease gene identification, and drug development. A prototype of Patika is available upon request from the authors.

  3. Identifying Pleiotropic Genes in Genome-Wide Association Studies for Multivariate Phenotypes with Mixed Measurement Scales

    PubMed Central

    Williams, L. Keoki; Buu, Anne

    2017-01-01

    We propose a multivariate genome-wide association test for mixed continuous, binary, and ordinal phenotypes. A latent response model is used to estimate the correlation between phenotypes with different measurement scales so that the empirical distribution of the Fisher’s combination statistic under the null hypothesis is estimated efficiently. The simulation study shows that our proposed correlation estimation methods have high levels of accuracy. More importantly, our approach conservatively estimates the variance of the test statistic so that the type I error rate is controlled. The simulation also shows that the proposed test maintains the power at the level very close to that of the ideal analysis based on known latent phenotypes while controlling the type I error. In contrast, conventional approaches–dichotomizing all observed phenotypes or treating them as continuous variables–could either reduce the power or employ a linear regression model unfit for the data. Furthermore, the statistical analysis on the database of the Study of Addiction: Genetics and Environment (SAGE) demonstrates that conducting a multivariate test on multiple phenotypes can increase the power of identifying markers that may not be, otherwise, chosen using marginal tests. The proposed method also offers a new approach to analyzing the Fagerström Test for Nicotine Dependence as multivariate phenotypes in genome-wide association studies. PMID:28081206

  4. ESTimating plant phylogeny: lessons from partitioning

    PubMed Central

    de la Torre, Jose EB; Egan, Mary G; Katari, Manpreet S; Brenner, Eric D; Stevenson, Dennis W; Coruzzi, Gloria M; DeSalle, Rob

    2006-01-01

    Background While Expressed Sequence Tags (ESTs) have proven a viable and efficient way to sample genomes, particularly those for which whole-genome sequencing is impractical, phylogenetic analysis using ESTs remains difficult. Sequencing errors and orthology determination are the major problems when using ESTs as a source of characters for systematics. Here we develop methods to incorporate EST sequence information in a simultaneous analysis framework to address controversial phylogenetic questions regarding the relationships among the major groups of seed plants. We use an automated, phylogenetically derived approach to orthology determination called OrthologID generate a phylogeny based on 43 process partitions, many of which are derived from ESTs, and examine several measures of support to assess the utility of EST data for phylogenies. Results A maximum parsimony (MP) analysis resulted in a single tree with relatively high support at all nodes in the tree despite rampant conflict among trees generated from the separate analysis of individual partitions. In a comparison of broader-scale groupings based on cellular compartment (ie: chloroplast, mitochondrial or nuclear) or function, only the nuclear partition tree (based largely on EST data) was found to be topologically identical to the tree based on the simultaneous analysis of all data. Despite topological conflict among the broader-scale groupings examined, only the tree based on morphological data showed statistically significant differences. Conclusion Based on the amount of character support contributed by EST data which make up a majority of the nuclear data set, and the lack of conflict of the nuclear data set with the simultaneous analysis tree, we conclude that the inclusion of EST data does provide a viable and efficient approach to address phylogenetic questions within a parsimony framework on a genomic scale, if problems of orthology determination and potential sequencing errors can be overcome. In addition, approaches that examine conflict and support in a simultaneous analysis framework allow for a more precise understanding of the evolutionary history of individual process partitions and may be a novel way to understand functional aspects of different kinds of cellular classes of gene products. PMID:16776834

  5. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project.

    PubMed

    Gerstein, Mark B; Lu, Zhi John; Van Nostrand, Eric L; Cheng, Chao; Arshinoff, Bradley I; Liu, Tao; Yip, Kevin Y; Robilotto, Rebecca; Rechtsteiner, Andreas; Ikegami, Kohta; Alves, Pedro; Chateigner, Aurelien; Perry, Marc; Morris, Mitzi; Auerbach, Raymond K; Feng, Xin; Leng, Jing; Vielle, Anne; Niu, Wei; Rhrissorrakrai, Kahn; Agarwal, Ashish; Alexander, Roger P; Barber, Galt; Brdlik, Cathleen M; Brennan, Jennifer; Brouillet, Jeremy Jean; Carr, Adrian; Cheung, Ming-Sin; Clawson, Hiram; Contrino, Sergio; Dannenberg, Luke O; Dernburg, Abby F; Desai, Arshad; Dick, Lindsay; Dosé, Andréa C; Du, Jiang; Egelhofer, Thea; Ercan, Sevinc; Euskirchen, Ghia; Ewing, Brent; Feingold, Elise A; Gassmann, Reto; Good, Peter J; Green, Phil; Gullier, Francois; Gutwein, Michelle; Guyer, Mark S; Habegger, Lukas; Han, Ting; Henikoff, Jorja G; Henz, Stefan R; Hinrichs, Angie; Holster, Heather; Hyman, Tony; Iniguez, A Leo; Janette, Judith; Jensen, Morten; Kato, Masaomi; Kent, W James; Kephart, Ellen; Khivansara, Vishal; Khurana, Ekta; Kim, John K; Kolasinska-Zwierz, Paulina; Lai, Eric C; Latorre, Isabel; Leahey, Amber; Lewis, Suzanna; Lloyd, Paul; Lochovsky, Lucas; Lowdon, Rebecca F; Lubling, Yaniv; Lyne, Rachel; MacCoss, Michael; Mackowiak, Sebastian D; Mangone, Marco; McKay, Sheldon; Mecenas, Desirea; Merrihew, Gennifer; Miller, David M; Muroyama, Andrew; Murray, John I; Ooi, Siew-Loon; Pham, Hoang; Phippen, Taryn; Preston, Elicia A; Rajewsky, Nikolaus; Rätsch, Gunnar; Rosenbaum, Heidi; Rozowsky, Joel; Rutherford, Kim; Ruzanov, Peter; Sarov, Mihail; Sasidharan, Rajkumar; Sboner, Andrea; Scheid, Paul; Segal, Eran; Shin, Hyunjin; Shou, Chong; Slack, Frank J; Slightam, Cindie; Smith, Richard; Spencer, William C; Stinson, E O; Taing, Scott; Takasaki, Teruaki; Vafeados, Dionne; Voronina, Ksenia; Wang, Guilin; Washington, Nicole L; Whittle, Christina M; Wu, Beijing; Yan, Koon-Kiu; Zeller, Georg; Zha, Zheng; Zhong, Mei; Zhou, Xingliang; Ahringer, Julie; Strome, Susan; Gunsalus, Kristin C; Micklem, Gos; Liu, X Shirley; Reinke, Valerie; Kim, Stuart K; Hillier, LaDeana W; Henikoff, Steven; Piano, Fabio; Snyder, Michael; Stein, Lincoln; Lieb, Jason D; Waterston, Robert H

    2010-12-24

    We systematically generated large-scale data sets to improve genome annotation for the nematode Caenorhabditis elegans, a key model organism. These data sets include transcriptome profiling across a developmental time course, genome-wide identification of transcription factor-binding sites, and maps of chromatin organization. From this, we created more complete and accurate gene models, including alternative splice forms and candidate noncoding RNAs. We constructed hierarchical networks of transcription factor-binding and microRNA interactions and discovered chromosomal locations bound by an unusually large number of transcription factors. Different patterns of chromatin composition and histone modification were revealed between chromosome arms and centers, with similarly prominent differences between autosomes and the X chromosome. Integrating data types, we built statistical models relating chromatin, transcription factor binding, and gene expression. Overall, our analyses ascribed putative functions to most of the conserved genome.

  6. 'Big data', Hadoop and cloud computing in genomics.

    PubMed

    O'Driscoll, Aisling; Daugelaite, Jurate; Sleator, Roy D

    2013-10-01

    Since the completion of the Human Genome project at the turn of the Century, there has been an unprecedented proliferation of genomic sequence data. A consequence of this is that the medical discoveries of the future will largely depend on our ability to process and analyse large genomic data sets, which continue to expand as the cost of sequencing decreases. Herein, we provide an overview of cloud computing and big data technologies, and discuss how such expertise can be used to deal with biology's big data sets. In particular, big data technologies such as the Apache Hadoop project, which provides distributed and parallelised data processing and analysis of petabyte (PB) scale data sets will be discussed, together with an overview of the current usage of Hadoop within the bioinformatics community. Copyright © 2013 Elsevier Inc. All rights reserved.

  7. A site specific model and analysis of the neutral somatic mutation rate in whole-genome cancer data.

    PubMed

    Bertl, Johanna; Guo, Qianyun; Juul, Malene; Besenbacher, Søren; Nielsen, Morten Muhlig; Hornshøj, Henrik; Pedersen, Jakob Skou; Hobolth, Asger

    2018-04-19

    Detailed modelling of the neutral mutational process in cancer cells is crucial for identifying driver mutations and understanding the mutational mechanisms that act during cancer development. The neutral mutational process is very complex: whole-genome analyses have revealed that the mutation rate differs between cancer types, between patients and along the genome depending on the genetic and epigenetic context. Therefore, methods that predict the number of different types of mutations in regions or specific genomic elements must consider local genomic explanatory variables. A major drawback of most methods is the need to average the explanatory variables across the entire region or genomic element. This procedure is particularly problematic if the explanatory variable varies dramatically in the element under consideration. To take into account the fine scale of the explanatory variables, we model the probabilities of different types of mutations for each position in the genome by multinomial logistic regression. We analyse 505 cancer genomes from 14 different cancer types and compare the performance in predicting mutation rate for both regional based models and site-specific models. We show that for 1000 randomly selected genomic positions, the site-specific model predicts the mutation rate much better than regional based models. We use a forward selection procedure to identify the most important explanatory variables. The procedure identifies site-specific conservation (phyloP), replication timing, and expression level as the best predictors for the mutation rate. Finally, our model confirms and quantifies certain well-known mutational signatures. We find that our site-specific multinomial regression model outperforms the regional based models. The possibility of including genomic variables on different scales and patient specific variables makes it a versatile framework for studying different mutational mechanisms. Our model can serve as the neutral null model for the mutational process; regions that deviate from the null model are candidates for elements that drive cancer development.

  8. Next-generation genome-scale models for metabolic engineering.

    PubMed

    King, Zachary A; Lloyd, Colton J; Feist, Adam M; Palsson, Bernhard O

    2015-12-01

    Constraint-based reconstruction and analysis (COBRA) methods have become widely used tools for metabolic engineering in both academic and industrial laboratories. By employing a genome-scale in silico representation of the metabolic network of a host organism, COBRA methods can be used to predict optimal genetic modifications that improve the rate and yield of chemical production. A new generation of COBRA models and methods is now being developed--encompassing many biological processes and simulation strategies-and next-generation models enable new types of predictions. Here, three key examples of applying COBRA methods to strain optimization are presented and discussed. Then, an outlook is provided on the next generation of COBRA models and the new types of predictions they will enable for systems metabolic engineering. Copyright © 2014 Elsevier Ltd. All rights reserved.

  9. GSuite HyperBrowser: integrative analysis of dataset collections across the genome and epigenome.

    PubMed

    Simovski, Boris; Vodák, Daniel; Gundersen, Sveinung; Domanska, Diana; Azab, Abdulrahman; Holden, Lars; Holden, Marit; Grytten, Ivar; Rand, Knut; Drabløs, Finn; Johansen, Morten; Mora, Antonio; Lund-Andersen, Christin; Fromm, Bastian; Eskeland, Ragnhild; Gabrielsen, Odd Stokke; Ferkingstad, Egil; Nakken, Sigve; Bengtsen, Mads; Nederbragt, Alexander Johan; Thorarensen, Hildur Sif; Akse, Johannes Andreas; Glad, Ingrid; Hovig, Eivind; Sandve, Geir Kjetil

    2017-07-01

    Recent large-scale undertakings such as ENCODE and Roadmap Epigenomics have generated experimental data mapped to the human reference genome (as genomic tracks) representing a variety of functional elements across a large number of cell types. Despite the high potential value of these publicly available data for a broad variety of investigations, little attention has been given to the analytical methodology necessary for their widespread utilisation. We here present a first principled treatment of the analysis of collections of genomic tracks. We have developed novel computational and statistical methodology to permit comparative and confirmatory analyses across multiple and disparate data sources. We delineate a set of generic questions that are useful across a broad range of investigations and discuss the implications of choosing different statistical measures and null models. Examples include contrasting analyses across different tissues or diseases. The methodology has been implemented in a comprehensive open-source software system, the GSuite HyperBrowser. To make the functionality accessible to biologists, and to facilitate reproducible analysis, we have also developed a web-based interface providing an expertly guided and customizable way of utilizing the methodology. With this system, many novel biological questions can flexibly be posed and rapidly answered. Through a combination of streamlined data acquisition, interoperable representation of dataset collections, and customizable statistical analysis with guided setup and interpretation, the GSuite HyperBrowser represents a first comprehensive solution for integrative analysis of track collections across the genome and epigenome. The software is available at: https://hyperbrowser.uio.no. © The Author 2017. Published by Oxford University Press.

  10. GenomeLandscaper: Landscape analysis of genome-fingerprints maps assessing chromosome architecture.

    PubMed

    Ai, Hannan; Ai, Yuncan; Meng, Fanmei

    2018-01-18

    Assessing correctness of an assembled chromosome architecture is a central challenge. We create a geometric analysis method (called GenomeLandscaper) to conduct landscape analysis of genome-fingerprints maps (GFM), trace large-scale repetitive regions, and assess their impacts on the global architectures of assembled chromosomes. We develop an alignment-free method for phylogenetics analysis. The human Y chromosomes (GRCh.chrY, HuRef.chrY and YH.chrY) are analysed as a proof-of-concept study. We construct a galaxy of genome-fingerprints maps (GGFM) for them, and a landscape compatibility among relatives is observed. But a long sharp straight line on the GGFM breaks such a landscape compatibility, distinguishing GRCh38p1.chrY (and throughout GRCh38p7.chrY) from GRCh37p13.chrY, HuRef.chrY and YH.chrY. We delete a 1.30-Mbp target segment to rescue the landscape compatibility, matching the antecedent GRCh37p13.chrY. We re-locate it into the modelled centromeric and pericentromeric region of GRCh38p10.chrY, matching a gap placeholder of GRCh37p13.chrY. We decompose it into sub-constituents (such as BACs, interspersed repeats, and tandem repeats) and trace their homologues by phylogenetics analysis. We elucidate that most examined tandem repeats are of reasonable quality, but the BAC-sized repeats, 173U1020C (176.46 Kbp) and 5U41068C (205.34 Kbp), are likely over-repeated. These results offer unique insights into the centromeric and pericentromeric regions of the human Y chromosomes.

  11. GT-WGS: an efficient and economic tool for large-scale WGS analyses based on the AWS cloud service.

    PubMed

    Wang, Yiqi; Li, Gen; Ma, Mark; He, Fazhong; Song, Zhuo; Zhang, Wei; Wu, Chengkun

    2018-01-19

    Whole-genome sequencing (WGS) plays an increasingly important role in clinical practice and public health. Due to the big data size, WGS data analysis is usually compute-intensive and IO-intensive. Currently it usually takes 30 to 40 h to finish a 50× WGS analysis task, which is far from the ideal speed required by the industry. Furthermore, the high-end infrastructure required by WGS computing is costly in terms of time and money. In this paper, we aim to improve the time efficiency of WGS analysis and minimize the cost by elastic cloud computing. We developed a distributed system, GT-WGS, for large-scale WGS analyses utilizing the Amazon Web Services (AWS). Our system won the first prize on the Wind and Cloud challenge held by Genomics and Cloud Technology Alliance conference (GCTA) committee. The system makes full use of the dynamic pricing mechanism of AWS. We evaluate the performance of GT-WGS with a 55× WGS dataset (400GB fastq) provided by the GCTA 2017 competition. In the best case, it only took 18.4 min to finish the analysis and the AWS cost of the whole process is only 16.5 US dollars. The accuracy of GT-WGS is 99.9% consistent with that of the Genome Analysis Toolkit (GATK) best practice. We also evaluated the performance of GT-WGS performance on a real-world dataset provided by the XiangYa hospital, which consists of 5× whole-genome dataset with 500 samples, and on average GT-WGS managed to finish one 5× WGS analysis task in 2.4 min at a cost of $3.6. WGS is already playing an important role in guiding therapeutic intervention. However, its application is limited by the time cost and computing cost. GT-WGS excelled as an efficient and affordable WGS analyses tool to address this problem. The demo video and supplementary materials of GT-WGS can be accessed at https://github.com/Genetalks/wgs_analysis_demo .

  12. Genome-wide analysis of Tol2 transposon reintegration in zebrafish.

    PubMed

    Kondrychyn, Igor; Garcia-Lecea, Marta; Emelyanov, Alexander; Parinov, Sergey; Korzh, Vladimir

    2009-09-08

    Tol2, a member of the hAT family of transposons, has become a useful tool for genetic manipulation of model animals, but information about its interactions with vertebrate genomes is still limited. Furthermore, published reports on Tol2 have mainly been based on random integration of the transposon system after co-injection of a plasmid DNA harboring the transposon and a transposase mRNA. It is important to understand how Tol2 would behave upon activation after integration into the genome. We performed a large-scale enhancer trap (ET) screen and generated 338 insertions of the Tol2 transposon-based ET cassette into the zebrafish genome. These insertions were generated by remobilizing the transposon from two different donor sites in two transgenic lines. We found that 39% of Tol2 insertions occurred in transcription units, mostly into introns. Analysis of the transposon target sites revealed no strict specificity at the DNA sequence level. However, Tol2 was prone to target AT-rich regions with weak palindromic consensus sequences centered at the insertion site. Our systematic analysis of sequential remobilizations of the Tol2 transposon from two independent sites within a vertebrate genome has revealed properties such as a tendency to integrate into transcription units and into AT-rich palindrome-like sequences. This information will influence the development of various applications involving DNA transposons and Tol2 in particular.

  13. Identification of 15 genetic loci associated with risk of major depression in individuals of European descent

    PubMed Central

    Hyde, Craig L.; Nagle, Mike W.; Tian, Chao; Chen, Xing; Paciga, Sara A.; Wendland, Jens R.; Tung, Joyce; Hinds, David A.; Perlis, Roy H.; Winslow, Ashley R.

    2016-01-01

    Despite strong evidence supporting the heritability of Major Depressive Disorder, previous genome-wide studies were unable to identify risk loci among individuals of European descent. We used self-reported data from 75,607 individuals reporting clinical diagnosis of depression and 231,747 reporting no history of depression through 23andMe, and meta-analyzed these results with published MDD GWAS results. We identified five independent variants from four regions associated with self-report of clinical diagnosis or treatment for depression. Loci with pval<1.0×10−5 in the meta-analysis were further analyzed in a replication dataset (45,773 cases and 106,354 controls) from 23andMe. A total of 17 independent SNPs from 15 regions reached genome-wide significance after joint-analysis over all three datasets. Some of these loci were also implicated in GWAS of related psychiatric traits. These studies provide evidence for large-scale consumer genomic data as a powerful and efficient complement to traditional means of ascertainment for neuropsychiatric disease genomics. PMID:27479909

  14. Whole genome sequence analysis of Geitlerinema sp. FC II unveils competitive edge of the strain in marine cultivation system for biofuel production.

    PubMed

    Batchu, Navish Kumar; Khater, Shradha; Patil, Sonal; Nagle, Vinod; Das, Gautam; Bhadra, Bhaskar; Sapre, Ajit; Dasgupta, Santanu

    2018-03-05

    A filamentous cyanobacteria, Geitlerinema sp. FC II, was isolated from marine algae culture pond at Reliance Industries Limited (RIL), India. The 6.7 Mb draft genome of FC II encodes for 6697 protein coding genes. Analysis of the whole genome sequence revealed presence of nif gene cluster, supporting its capability to fix atmospheric nitrogen. FC II genome contains two variants of sulfide:quinone oxidoreductases (SQR), which is a crucial elector donor in cyanobacterial metabolic processes. FC II is characterized by the presence of multiple CRISPR- Cas (Clustered Regularly Interspaced Short Palindrome Repeats - CRISPR associated proteins) clusters, multiple variants of genes encoding photosystem reaction centres, biosynthetic gene clusters of alkane, polyketides and non-ribosomal peptides. Presence of these pathways will help FC II in gaining an ecological advantage over other strains for biomass production in large scale cultivation system. Hence, FC II may be used for production of biofuel and other industrially important metabolites. Copyright © 2018 Elsevier Inc. All rights reserved.

  15. Genome-Scale Metabolic Reconstructions and Theoretical Investigation of Methane Conversion in Methylomicrobium buryatense Strain 5G(B1)

    DOE PAGES

    de la Torre, Andrea; Metivier, Aisha; Chu, Frances; ...

    2015-11-25

    Methane-utilizing bacteria (methanotrophs) are capable of growth on methane and are attractive systems for bio-catalysis. However, the application of natural methanotrophic strains to large-scale production of value-added chemicals/biofuels requires a number of physiological and genetic alterations. An accurate metabolic model coupled with flux balance analysis can provide a solid interpretative framework for experimental data analyses and integration.

  16. Linear score tests for variance components in linear mixed models and applications to genetic association studies.

    PubMed

    Qu, Long; Guennel, Tobias; Marshall, Scott L

    2013-12-01

    Following the rapid development of genome-scale genotyping technologies, genetic association mapping has become a popular tool to detect genomic regions responsible for certain (disease) phenotypes, especially in early-phase pharmacogenomic studies with limited sample size. In response to such applications, a good association test needs to be (1) applicable to a wide range of possible genetic models, including, but not limited to, the presence of gene-by-environment or gene-by-gene interactions and non-linearity of a group of marker effects, (2) accurate in small samples, fast to compute on the genomic scale, and amenable to large scale multiple testing corrections, and (3) reasonably powerful to locate causal genomic regions. The kernel machine method represented in linear mixed models provides a viable solution by transforming the problem into testing the nullity of variance components. In this study, we consider score-based tests by choosing a statistic linear in the score function. When the model under the null hypothesis has only one error variance parameter, our test is exact in finite samples. When the null model has more than one variance parameter, we develop a new moment-based approximation that performs well in simulations. Through simulations and analysis of real data, we demonstrate that the new test possesses most of the aforementioned characteristics, especially when compared to existing quadratic score tests or restricted likelihood ratio tests. © 2013, The International Biometric Society.

  17. Deppdb--DNA electrostatic potential properties database: electrostatic properties of genome DNA.

    PubMed

    Osypov, Alexander A; Krutinin, Gleb G; Kamzolova, Svetlana G

    2010-06-01

    The electrostatic properties of genome DNA influence its interactions with different proteins, in particular, the regulation of transcription by RNA-polymerases. DEPPDB--DNA Electrostatic Potential Properties Database--was developed to hold and provide all available information on the electrostatic properties of genome DNA combined with its sequence and annotation of biological and structural properties of genome elements and whole genomes. Genomes in DEPPDB are organized on a taxonomical basis. Currently, the database contains all the completely sequenced bacterial and viral genomes according to NCBI RefSeq. General properties of the genome DNA electrostatic potential profile and principles of its formation are revealed. This potential correlates with the GC content but does not correspond to it exactly and strongly depends on both the sequence arrangement and its context (flanking regions). Analysis of the promoter regions for bacterial and viral RNA polymerases revealed a correspondence between the scale of these proteins' physical properties and electrostatic profile patterns. We also discovered a direct correlation between the potential value and the binding frequency of RNA polymerase to DNA, supporting the idea of the role of electrostatics in these interactions. This matches a pronounced tendency of the promoter regions to possess higher values of the electrostatic potential.

  18. A High-Resolution InDel (Insertion–Deletion) Markers-Anchored Consensus Genetic Map Identifies Major QTLs Governing Pod Number and Seed Yield in Chickpea

    PubMed Central

    Srivastava, Rishi; Singh, Mohar; Bajaj, Deepak; Parida, Swarup K.

    2016-01-01

    Development and large-scale genotyping of user-friendly informative genome/gene-derived InDel markers in natural and mapping populations is vital for accelerating genomics-assisted breeding applications of chickpea with minimal resource expenses. The present investigation employed a high-throughput whole genome next-generation resequencing strategy in low and high pod number parental accessions and homozygous individuals constituting the bulks from each of two inter-specific mapping populations [(Pusa 1103 × ILWC 46) and (Pusa 256 × ILWC 46)] to develop non-erroneous InDel markers at a genome-wide scale. Comparing these high-quality genomic sequences, 82,360 InDel markers with reference to kabuli genome and 13,891 InDel markers exhibiting differentiation between low and high pod number parental accessions and bulks of aforementioned mapping populations were developed. These informative markers were structurally and functionally annotated in diverse coding and non-coding sequence components of genome/genes of kabuli chickpea. The functional significance of regulatory and coding (frameshift and large-effect mutations) InDel markers for establishing marker-trait linkages through association/genetic mapping was apparent. The markers detected a greater amplification (97%) and intra-specific polymorphic potential (58–87%) among a diverse panel of cultivated desi, kabuli, and wild accessions even by using a simpler cost-efficient agarose gel-based assay implicating their utility in large-scale genetic analysis especially in domesticated chickpea with narrow genetic base. Two high-density inter-specific genetic linkage maps generated using aforesaid mapping populations were integrated to construct a consensus 1479 InDel markers-anchored high-resolution (inter-marker distance: 0.66 cM) genetic map for efficient molecular mapping of major QTLs governing pod number and seed yield per plant in chickpea. Utilizing these high-density genetic maps as anchors, three major genomic regions harboring each of pod number and seed yield robust QTLs (15–28% phenotypic variation explained) were identified on chromosomes 2, 4, and 6. The integration of genetic and physical maps at these QTLs mapped on chromosomes scaled-down the long major QTL intervals into high-resolution short pod number and seed yield robust QTL physical intervals (0.89–2.94 Mb) which were essentially got validated in multiple genetic backgrounds of two chickpea mapping populations. The genome-wide InDel markers including natural allelic variants and genomic loci/genes delineated at major six especially in one colocalized novel congruent robust pod number and seed yield robust QTLs mapped on a high-density consensus genetic map were found most promising in chickpea. These functionally relevant molecular tags can drive marker-assisted genetic enhancement to develop high-yielding cultivars with increased seed/pod number and yield in chickpea. PMID:27695461

  19. Molecular evolution and expression of archosaurian β-keratins: diversification and expansion of archosaurian β-keratins and the origin of feather β-keratins.

    PubMed

    Greenwold, Matthew J; Sawyer, Roger H

    2013-09-01

    The archosauria consist of two living groups, crocodilians, and birds. Here we compare the structure, expression, and phylogeny of the beta (β)-keratins in two crocodilian genomes and two avian genomes to gain a better understanding of the evolutionary origin of the feather β-keratins. Unlike squamates such as the green anole with 40 β-keratins in its genome, the chicken and zebra finch genomes have over 100 β-keratin genes in their genomes, while the American alligator has 20 β-keratin genes, and the saltwater crocodile has 21 β-keratin genes. The crocodilian β-keratins are similar to those of birds and these structural proteins have a central filament domain and N- and C-termini, which contribute to the matrix material between the twisted β-sheets, which form the 2-3 nm filament. Overall the expression of alligator β-keratin genes in the integument increases during development. Phylogenetic analysis demonstrates that a crocodilian β-keratin clade forms a monophyletic group with the avian scale and feather β-keratins, suggesting that avian scale and feather β-keratins along with a subset of crocodilian β-keratins evolved from a common ancestral gene/s. Overall, our analyses support the view that the epidermal appendages of basal archosaurs used a diverse array of β-keratins, which evolved into crocodilian and avian specific clades. In birds, the scale and feather subfamilies appear to have evolved independently in the avian lineage from a subset of archosaurian claw β-keratins. The expansion of the avian specific feather β-keratin genes accompanied the diversification of birds and the evolution of feathers. Copyright © 2013 Wiley Periodicals, Inc.

  20. Improved evidence-based genome-scale metabolic models for maize leaf, embryo, and endosperm

    PubMed Central

    Seaver, Samuel M. D.; Bradbury, Louis M. T.; Frelin, Océane; Zarecki, Raphy; Ruppin, Eytan; Hanson, Andrew D.; Henry, Christopher S.

    2015-01-01

    There is a growing demand for genome-scale metabolic reconstructions for plants, fueled by the need to understand the metabolic basis of crop yield and by progress in genome and transcriptome sequencing. Methods are also required to enable the interpretation of plant transcriptome data to study how cellular metabolic activity varies under different growth conditions or even within different organs, tissues, and developmental stages. Such methods depend extensively on the accuracy with which genes have been mapped to the biochemical reactions in the plant metabolic pathways. Errors in these mappings lead to metabolic reconstructions with an inflated number of reactions and possible generation of unreliable metabolic phenotype predictions. Here we introduce a new evidence-based genome-scale metabolic reconstruction of maize, with significant improvements in the quality of the gene-reaction associations included within our model. We also present a new approach for applying our model to predict active metabolic genes based on transcriptome data. This method includes a minimal set of reactions associated with low expression genes to enable activity of a maximum number of reactions associated with high expression genes. We apply this method to construct an organ-specific model for the maize leaf, and tissue specific models for maize embryo and endosperm cells. We validate our models using fluxomics data for the endosperm and embryo, demonstrating an improved capacity of our models to fit the available fluxomics data. All models are publicly available via the DOE Systems Biology Knowledgebase and PlantSEED, and our new method is generally applicable for analysis transcript profiles from any plant, paving the way for further in silico studies with a wide variety of plant genomes. PMID:25806041

  1. Improved evidence-based genome-scale metabolic models for maize leaf, embryo, and endosperm

    DOE PAGES

    Seaver, Samuel M.D.; Bradbury, Louis M.T.; Frelin, Océane; ...

    2015-03-10

    There is a growing demand for genome-scale metabolic reconstructions for plants, fueled by the need to understand the metabolic basis of crop yield and by progress in genome and transcriptome sequencing. Methods are also required to enable the interpretation of plant transcriptome data to study how cellular metabolic activity varies under different growth conditions or even within different organs, tissues, and developmental stages. Such methods depend extensively on the accuracy with which genes have been mapped to the biochemical reactions in the plant metabolic pathways. Errors in these mappings lead to metabolic reconstructions with an inflated number of reactions andmore » possible generation of unreliable metabolic phenotype predictions. Here we introduce a new evidence-based genome-scale metabolic reconstruction of maize, with significant improvements in the quality of the gene-reaction associations included within our model. We also present a new approach for applying our model to predict active metabolic genes based on transcriptome data. This method includes a minimal set of reactions associated with low expression genes to enable activity of a maximum number of reactions associated with high expression genes. We apply this method to construct an organ-specific model for the maize leaf, and tissue specific models for maize embryo and endosperm cells. We validate our models using fluxomics data for the endosperm and embryo, demonstrating an improved capacity of our models to fit the available fluxomics data. All models are publicly available via the DOE Systems Biology Knowledgebase and PlantSEED, and our new method is generally applicable for analysis transcript profiles from any plant, paving the way for further in silico studies with a wide variety of plant genomes.« less

  2. Signatures of host specialization and a recent transposable element burst in the dynamic one-speed genome of the fungal barley powdery mildew pathogen.

    PubMed

    Frantzeskakis, Lamprinos; Kracher, Barbara; Kusch, Stefan; Yoshikawa-Maekawa, Makoto; Bauer, Saskia; Pedersen, Carsten; Spanu, Pietro D; Maekawa, Takaki; Schulze-Lefert, Paul; Panstruga, Ralph

    2018-05-22

    Powdery mildews are biotrophic pathogenic fungi infecting a number of economically important plants. The grass powdery mildew, Blumeria graminis, has become a model organism to study host specialization of obligate biotrophic fungal pathogens. We resolved the large-scale genomic architecture of B. graminis forma specialis hordei (Bgh) to explore the potential influence of its genome organization on the co-evolutionary process with its host plant, barley (Hordeum vulgare). The near-chromosome level assemblies of the Bgh reference isolate DH14 and one of the most diversified isolates, RACE1, enabled a comparative analysis of these haploid genomes, which are highly enriched with transposable elements (TEs). We found largely retained genome synteny and gene repertoires, yet detected copy number variation (CNV) of secretion signal peptide-containing protein-coding genes (SPs) and locally disrupted synteny blocks. Genes coding for sequence-related SPs are often locally clustered, but neither the SPs nor the TEs reside preferentially in genomic regions with unique features. Extended comparative analysis with different host-specific B. graminis formae speciales revealed the existence of a core suite of SPs, but also isolate-specific SP sets as well as congruence of SP CNV and phylogenetic relationship. We further detected evidence for a recent, lineage-specific expansion of TEs in the Bgh genome. The characteristics of the Bgh genome (largely retained synteny, CNV of SP genes, recently proliferated TEs and a lack of significant compartmentalization) are consistent with a "one-speed" genome that differs in its architecture and (co-)evolutionary pattern from the "two-speed" genomes reported for several other filamentous phytopathogens.

  3. Analysis of genetic diversity using SNP markers in oat

    USDA-ARS?s Scientific Manuscript database

    A large-scale single nucleotide polymorphism (SNP) discovery was carried out in cultivated oat using Roche 454 sequencing methods. DNA sequences were generated from cDNAs originating from a panel of 20 diverse oat cultivars, and from Diversity Array Technology (DArT) genomic complexity reductions fr...

  4. Large-scale gene-centric meta-analysis across 32 studies identifies multiple lipid loci

    USDA-ARS?s Scientific Manuscript database

    Genome-wide association studies (GWASs) have identified many SNPs underlying variations in plasma-lipid levels. We explore whether additional loci associated with plasma-lipid phenotypes, such as high-density lipoprotein cholesterol (HDL-C), low-density lipoprotein cholesterol (LDL-C), total cholest...

  5. A refined genome-scale reconstruction of Chlamydomonas metabolism provides a platform for systems-level analyses.

    PubMed

    Imam, Saheed; Schäuble, Sascha; Valenzuela, Jacob; López García de Lomana, Adrián; Carter, Warren; Price, Nathan D; Baliga, Nitin S

    2015-12-01

    Microalgae have reemerged as organisms of prime biotechnological interest due to their ability to synthesize a suite of valuable chemicals. To harness the capabilities of these organisms, we need a comprehensive systems-level understanding of their metabolism, which can be fundamentally achieved through large-scale mechanistic models of metabolism. In this study, we present a revised and significantly improved genome-scale metabolic model for the widely-studied microalga, Chlamydomonas reinhardtii. The model, iCre1355, represents a major advance over previous models, both in content and predictive power. iCre1355 encompasses a broad range of metabolic functions encoded across the nuclear, chloroplast and mitochondrial genomes accounting for 1355 genes (1460 transcripts), 2394 and 1133 metabolites. We found improved performance over the previous metabolic model based on comparisons of predictive accuracy across 306 phenotypes (from 81 mutants), lipid yield analysis and growth rates derived from chemostat-grown cells (under three conditions). Measurement of macronutrient uptake revealed carbon and phosphate to be good predictors of growth rate, while nitrogen consumption appeared to be in excess. We analyzed high-resolution time series transcriptomics data using iCre1355 to uncover dynamic pathway-level changes that occur in response to nitrogen starvation and changes in light intensity. This approach enabled accurate prediction of growth rates, the cessation of growth and accumulation of triacylglycerols during nitrogen starvation, and the temporal response of different growth-associated pathways to increased light intensity. Thus, iCre1355 represents an experimentally validated genome-scale reconstruction of C. reinhardtii metabolism that should serve as a useful resource for studying the metabolic processes of this and related microalgae. © 2015 The Authors The Plant Journal © 2015 John Wiley & Sons Ltd.

  6. A refined genome-scale reconstruction of Chlamydomonas metabolism provides a platform for systems-level analyses

    DOE PAGES

    Imam, Saheed; Schäuble, Sascha; Valenzuela, Jacob; ...

    2015-10-20

    Microalgae have reemerged as organisms of prime biotechnological interest due to their ability to synthesize a suite of valuable chemicals. To harness the capabilities of these organisms, we need a comprehensive systems-level understanding of their metabolism, which can be fundamentally achieved through large-scale mechanistic models of metabolism. In this study, we present a revised and significantly improved genome-scale metabolic model for the widely-studied microalga, Chlamydomonas reinhardtii. The model, iCre1355, represents a major advance over previous models, both in content and predictive power. iCre1355 encompasses a broad range of metabolic functions encoded across the nuclear, chloroplast and mitochondrial genomes accounting formore » 1355 genes (1460 transcripts), 2394 and 1133 metabolites. We found improved performance over the previous metabolic model based on comparisons of predictive accuracy across 306 phenotypes (from 81 mutants), lipid yield analysis and growth rates derived from chemostat-grown cells (under three conditions). Measurement of macronutrient uptake revealed carbon and phosphate to be good predictors of growth rate, while nitrogen consumption appeared to be in excess. We analyzed high-resolution time series transcriptomics data using iCre1355 to uncover dynamic pathway-level changes that occur in response to nitrogen starvation and changes in light intensity. This approach enabled accurate prediction of growth rates, the cessation of growth and accumulation of triacylglycerols during nitrogen starvation, and the temporal response of different growth-associated pathways to increased light intensity. Thus, iCre1355 represents an experimentally validated genome-scale reconstruction of C. reinhardtii metabolism that should serve as a useful resource for studying the metabolic processes of this and related microalgae.« less

  7. Physiologically Shrinking the Solution Space of a Saccharomyces cerevisiae Genome-Scale Model Suggests the Role of the Metabolic Network in Shaping Gene Expression Noise.

    PubMed

    Chi, Baofang; Tao, Shiheng; Liu, Yanlin

    2015-01-01

    Sampling the solution space of genome-scale models is generally conducted to determine the feasible region for metabolic flux distribution. Because the region for actual metabolic states resides only in a small fraction of the entire space, it is necessary to shrink the solution space to improve the predictive power of a model. A common strategy is to constrain models by integrating extra datasets such as high-throughput datasets and C13-labeled flux datasets. However, studies refining these approaches by performing a meta-analysis of massive experimental metabolic flux measurements, which are closely linked to cellular phenotypes, are limited. In the present study, experimentally identified metabolic flux data from 96 published reports were systematically reviewed. Several strong associations among metabolic flux phenotypes were observed. These phenotype-phenotype associations at the flux level were quantified and integrated into a Saccharomyces cerevisiae genome-scale model as extra physiological constraints. By sampling the shrunken solution space of the model, the metabolic flux fluctuation level, which is an intrinsic trait of metabolic reactions determined by the network, was estimated and utilized to explore its relationship to gene expression noise. Although no correlation was observed in all enzyme-coding genes, a relationship between metabolic flux fluctuation and expression noise of genes associated with enzyme-dosage sensitive reactions was detected, suggesting that the metabolic network plays a role in shaping gene expression noise. Such correlation was mainly attributed to the genes corresponding to non-essential reactions, rather than essential ones. This was at least partially, due to regulations underlying the flux phenotype-phenotype associations. Altogether, this study proposes a new approach in shrinking the solution space of a genome-scale model, of which sampling provides new insights into gene expression noise.

  8. Comparative Genomic Hybridization–Array Analysis Enhances the Detection of Aneuploidies and Submicroscopic Imbalances in Spontaneous Miscarriages

    PubMed Central

    Schaeffer, Anthony J. ; Chung, June ; Heretis, Konstantina ; Wong, Andrew ; Ledbetter, David H. ; Lese Martin, Christa 

    2004-01-01

    Miscarriage is a condition that affects 10%–15% of all clinically recognized pregnancies, most of which occur in the first trimester. Approximately 50% of first-trimester miscarriages result from fetal chromosome abnormalities. Currently, G-banded chromosome analysis is used to determine if large-scale genetic imbalances are the cause of these pregnancy losses. This technique relies on the culture of cells derived from the fetus, a technique that has many limitations, including a high rate of culture failure, maternal overgrowth of fetal cells, and poor chromosome morphology. Comparative genomic hybridization (CGH)–array analysis is a powerful new molecular cytogenetic technique that allows genomewide analysis of DNA copy number. By hybridizing patient DNA and normal reference DNA to arrays of genomic clones, unbalanced gains or losses of genetic material across the genome can be detected. In this study, 41 product-of-conception (POC) samples, which were previously analyzed by G-banding, were tested using CGH arrays to determine not only if the array could identify all reported abnormalities, but also whether any previously undetected genomic imbalances would be discovered. The array methodology detected all abnormalities as reported by G-banding analysis and revealed new abnormalities in 4/41 (9.8%) cases. Of those, one trisomy 21 POC was also mosaic for trisomy 20, one had a duplication of the 10q telomere region, one had an interstitial deletion of chromosome 9p, and the fourth had an interstitial duplication of the Prader-Willi/Angelman syndrome region on chromosome 15q, which, if maternally inherited, has been implicated in autism. This retrospective study demonstrates that the DNA-based CGH-array technology overcomes many of the limitations of routine cytogenetic analysis of POC samples while enhancing the detection of fetal chromosome aberrations. PMID:15127362

  9. DNA Data Bank of Japan (DDBJ) for genome scale research in life science

    PubMed Central

    Tateno, Y.; Imanishi, T.; Miyazaki, S.; Fukami-Kobayashi, K.; Saitou, N.; Sugawara, H.; Gojobori, T.

    2002-01-01

    The DNA Data Bank of Japan (DDBJ, http://www.ddbj.nig.ac.jp) has made an effort to collect as much data as possible mainly from Japanese researchers. The increase rates of the data we collected, annotated and released to the public in the past year are 43% for the number of entries and 52% for the number of bases. The increase rates are accelerated even after the human genome was sequenced, because sequencing technology has been remarkably advanced and simplified, and research in life science has been shifted from the gene scale to the genome scale. In addition, we have developed the Genome Information Broker (GIB, http://gib.genes.nig.ac.jp) that now includes more than 50 complete microbial genome and Arabidopsis genome data. We have also developed a database of the human genome, the Human Genomics Studio (HGS, http://studio.nig.ac.jp). HGS provides one with a set of sequences being as continuous as possible in any one of the 24 chromosomes. Both GIB and HGS have been updated incorporating newly available data and retrieval tools. PMID:11752245

  10. An interactive web-based application for Comprehensive Analysis of RNAi-screen Data.

    PubMed

    Dutta, Bhaskar; Azhir, Alaleh; Merino, Louis-Henri; Guo, Yongjian; Revanur, Swetha; Madhamshettiwar, Piyush B; Germain, Ronald N; Smith, Jennifer A; Simpson, Kaylene J; Martin, Scott E; Buehler, Eugen; Beuhler, Eugen; Fraser, Iain D C

    2016-02-23

    RNAi screens are widely used in functional genomics. Although the screen data can be susceptible to a number of experimental biases, many of these can be corrected by computational analysis. For this purpose, here we have developed a web-based platform for integrated analysis and visualization of RNAi screen data named CARD (for Comprehensive Analysis of RNAi Data; available at https://card.niaid.nih.gov). CARD allows the user to seamlessly carry out sequential steps in a rigorous data analysis workflow, including normalization, off-target analysis, integration of gene expression data, optimal thresholds for hit selection and network/pathway analysis. To evaluate the utility of CARD, we describe analysis of three genome-scale siRNA screens and demonstrate: (i) a significant increase both in selection of subsequently validated hits and in rejection of false positives, (ii) an increased overlap of hits from independent screens of the same biology and (iii) insight to microRNA (miRNA) activity based on siRNA seed enrichment.

  11. An interactive web-based application for Comprehensive Analysis of RNAi-screen Data

    PubMed Central

    Dutta, Bhaskar; Azhir, Alaleh; Merino, Louis-Henri; Guo, Yongjian; Revanur, Swetha; Madhamshettiwar, Piyush B.; Germain, Ronald N.; Smith, Jennifer A.; Simpson, Kaylene J.; Martin, Scott E.; Beuhler, Eugen; Fraser, Iain D. C.

    2016-01-01

    RNAi screens are widely used in functional genomics. Although the screen data can be susceptible to a number of experimental biases, many of these can be corrected by computational analysis. For this purpose, here we have developed a web-based platform for integrated analysis and visualization of RNAi screen data named CARD (for Comprehensive Analysis of RNAi Data; available at https://card.niaid.nih.gov). CARD allows the user to seamlessly carry out sequential steps in a rigorous data analysis workflow, including normalization, off-target analysis, integration of gene expression data, optimal thresholds for hit selection and network/pathway analysis. To evaluate the utility of CARD, we describe analysis of three genome-scale siRNA screens and demonstrate: (i) a significant increase both in selection of subsequently validated hits and in rejection of false positives, (ii) an increased overlap of hits from independent screens of the same biology and (iii) insight to microRNA (miRNA) activity based on siRNA seed enrichment. PMID:26902267

  12. SQC: secure quality control for meta-analysis of genome-wide association studies.

    PubMed

    Huang, Zhicong; Lin, Huang; Fellay, Jacques; Kutalik, Zoltán; Hubaux, Jean-Pierre

    2017-08-01

    Due to the limited power of small-scale genome-wide association studies (GWAS), researchers tend to collaborate and establish a larger consortium in order to perform large-scale GWAS. Genome-wide association meta-analysis (GWAMA) is a statistical tool that aims to synthesize results from multiple independent studies to increase the statistical power and reduce false-positive findings of GWAS. However, it has been demonstrated that the aggregate data of individual studies are subject to inference attacks, hence privacy concerns arise when researchers share study data in GWAMA. In this article, we propose a secure quality control (SQC) protocol, which enables checking the quality of data in a privacy-preserving way without revealing sensitive information to a potential adversary. SQC employs state-of-the-art cryptographic and statistical techniques for privacy protection. We implement the solution in a meta-analysis pipeline with real data to demonstrate the efficiency and scalability on commodity machines. The distributed execution of SQC on a cluster of 128 cores for one million genetic variants takes less than one hour, which is a modest cost considering the 10-month time span usually observed for the completion of the QC procedure that includes timing of logistics. SQC is implemented in Java and is publicly available at https://github.com/acs6610987/secureqc. jean-pierre.hubaux@epfl.ch. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  13. Functional phylogenomics analysis of bacteria and archaea using consistent genome annotation with UniFam

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chai, Juanjuan; Kora, Guruprasad; Ahn, Tae-Hyuk

    2014-10-09

    To supply some background, phylogenetic studies have provided detailed knowledge on the evolutionary mechanisms of genes and species in Bacteria and Archaea. However, the evolution of cellular functions, represented by metabolic pathways and biological processes, has not been systematically characterized. Many clades in the prokaryotic tree of life have now been covered by sequenced genomes in GenBank. This enables a large-scale functional phylogenomics study of many computationally inferred cellular functions across all sequenced prokaryotes. Our results show a total of 14,727 GenBank prokaryotic genomes were re-annotated using a new protein family database, UniFam, to obtain consistent functional annotations for accuratemore » comparison. The functional profile of a genome was represented by the biological process Gene Ontology (GO) terms in its annotation. The GO term enrichment analysis differentiated the functional profiles between selected archaeal taxa. 706 prokaryotic metabolic pathways were inferred from these genomes using Pathway Tools and MetaCyc. The consistency between the distribution of metabolic pathways in the genomes and the phylogenetic tree of the genomes was measured using parsimony scores and retention indices. The ancestral functional profiles at the internal nodes of the phylogenetic tree were reconstructed to track the gains and losses of metabolic pathways in evolutionary history. In conclusion, our functional phylogenomics analysis shows divergent functional profiles of taxa and clades. Such function-phylogeny correlation stems from a set of clade-specific cellular functions with low parsimony scores. On the other hand, many cellular functions are sparsely dispersed across many clades with high parsimony scores. These different types of cellular functions have distinct evolutionary patterns reconstructed from the prokaryotic tree.« less

  14. Genome-scale resources for Thermoanaerobacterium saccharolyticum.

    PubMed

    Currie, Devin H; Raman, Babu; Gowen, Christopher M; Tschaplinski, Timothy J; Land, Miriam L; Brown, Steven D; Covalla, Sean F; Klingeman, Dawn M; Yang, Zamin K; Engle, Nancy L; Johnson, Courtney M; Rodriguez, Miguel; Shaw, A Joe; Kenealy, William R; Lynd, Lee R; Fong, Stephen S; Mielenz, Jonathan R; Davison, Brian H; Hogsett, David A; Herring, Christopher D

    2015-06-26

    Thermoanaerobacterium saccharolyticum is a hemicellulose-degrading thermophilic anaerobe that was previously engineered to produce ethanol at high yield. A major project was undertaken to develop this organism into an industrial biocatalyst, but the lack of genome information and resources were recognized early on as a key limitation. Here we present a set of genome-scale resources to enable the systems level investigation and development of this potentially important industrial organism. Resources include a complete genome sequence for strain JW/SL-YS485, a genome-scale reconstruction of metabolism, tiled microarray data showing transcription units, mRNA expression data from 71 different growth conditions or timepoints and GC/MS-based metabolite analysis data from 42 different conditions or timepoints. Growth conditions include hemicellulose hydrolysate, the inhibitors HMF, furfural, diamide, and ethanol, as well as high levels of cellulose, xylose, cellobiose or maltodextrin. The genome consists of a 2.7 Mbp chromosome and a 110 Kbp megaplasmid. An active prophage was also detected, and the expression levels of CRISPR genes were observed to increase in association with those of the phage. Hemicellulose hydrolysate elicited a response of carbohydrate transport and catabolism genes, as well as poorly characterized genes suggesting a redox challenge. In some conditions, a time series of combined transcription and metabolite measurements were made to allow careful study of microbial physiology under process conditions. As a demonstration of the potential utility of the metabolic reconstruction, the OptKnock algorithm was used to predict a set of gene knockouts that maximize growth-coupled ethanol production. The predictions validated intuitive strain designs and matched previous experimental results. These data will be a useful asset for efforts to develop T. saccharolyticum for efficient industrial production of biofuels. The resources presented herein may also be useful on a comparative basis for development of other lignocellulose degrading microbes, such as Clostridium thermocellum.

  15. Emergent Self-Organized Criticality in Gene Expression Dynamics: Temporal Development of Global Phase Transition Revealed in a Cancer Cell Line

    PubMed Central

    Tsuchiya, Masa; Giuliani, Alessandro; Hashimoto, Midori; Erenpreisa, Jekaterina; Yoshikawa, Kenichi

    2015-01-01

    Background The underlying mechanism of dynamic control of the genome-wide expression is a fundamental issue in bioscience. We addressed it in terms of phase transition by a systemic approach based on both density analysis and characteristics of temporal fluctuation for the time-course mRNA expression in differentiating MCF-7 breast cancer cells. Methodology In a recent work, we suggested criticality as an essential aspect of dynamic control of genome-wide gene expression. Criticality was evident by a unimodal-bimodal transition through flattened unimodal expression profile. The flatness on the transition suggests the existence of a critical transition at which up- and down-regulated expression is balanced. Mean field (averaging) behavior of mRNAs based on the temporal expression changes reveals a sandpile type of transition in the flattened profile. Furthermore, around the transition, a self-similar unimodal-bimodal transition of the whole expression occurs in the density profile of an ensemble of mRNA expression. These singular and scaling behaviors identify the transition as the expression phase transition driven by self-organized criticality (SOC). Principal Findings Emergent properties of SOC through a mean field approach are revealed: i) SOC, as a form of genomic phase transition, consolidates distinct critical states of expression, ii) Coupling of coherent stochastic oscillations between critical states on different time-scales gives rise to SOC, and iii) Specific gene clusters (barcode genes) ranging in size from kbp to Mbp reveal similar SOC to genome-wide mRNA expression and ON-OFF synchronization to critical states. This suggests that the cooperative gene regulation of topological genome sub-units is mediated by the coherent phase transitions of megadomain-scaled conformations between compact and swollen chromatin states. Conclusion and Significance In summary, our study provides not only a systemic method to demonstrate SOC in whole-genome expression, but also introduces novel, physically grounded concepts for a breakthrough in the study of biological regulation. PMID:26067993

  16. Genome-scale case-control analysis of CD4+ T-cell DNA methylation in juvenile idiopathic arthritis reveals potential targets involved in disease.

    PubMed

    Ellis, Justine A; Munro, Jane E; Chavez, Raul A; Gordon, Lavinia; Joo, Jihoon E; Akikusa, Jonathan D; Allen, Roger C; Ponsonby, Anne-Louise; Craig, Jeffrey M; Saffery, Richard

    2012-11-13

    Juvenile Idiopathic Arthritis (JIA) is a complex autoimmune rheumatic disease of largely unknown cause. Evidence is growing that epigenetic variation, particularly DNA methylation, is associated with autoimmune disease. However, nothing is currently known about the potential role of aberrant DNA methylation in JIA. As a first step to addressing this knowledge gap, we have profiled DNA methylation in purified CD4+ T cells from JIA subjects and controls. Genomic DNA was isolated from peripheral blood CD4+ T cells from 14 oligoarticular and polyarticular JIA cases with active disease, and healthy age- and sex-matched controls. Genome-scale methylation analysis was carried out using the Illumina Infinium HumanMethylation27 BeadChip. Methylation data at >25,000 CpGs was compared in a case-control study design. Methylation levels were significantly different (FDR adjusted p<0.1) at 145 loci. Removal of four samples exposed to methotrexate had a striking impact on the outcome of the analysis, reducing the number of differentially methylated loci to 11. The methotrexate-naive analysis identified reduced methylation at the gene encoding the pro-inflammatory cytokine IL32, which was subsequently replicated using a second analysis platform and a second set of case-control pairs. Our data suggests that differential T cell DNA methylation may be a feature of JIA, and that reduced methylation at IL32 is associated with this disease. Further work in larger prospective and longitudinal sample collections is required to confirm these findings, assess whether the identified differences are causal or consequential of disease, and further investigate the epigenetic modifying properties of therapeutic regimens.

  17. Somatic, positive and negative domains of the Center for Epidemiological Studies Depression (CES-D) scale: a meta-analysis of genome-wide association studies.

    PubMed

    Demirkan, A; Lahti, J; Direk, N; Viktorin, A; Lunetta, K L; Terracciano, A; Nalls, M A; Tanaka, T; Hek, K; Fornage, M; Wellmann, J; Cornelis, M C; Ollila, H M; Yu, L; Smith, J A; Pilling, L C; Isaacs, A; Palotie, A; Zhuang, W V; Zonderman, A; Faul, J D; Sutin, A; Meirelles, O; Mulas, A; Hofman, A; Uitterlinden, A; Rivadeneira, F; Perola, M; Zhao, W; Salomaa, V; Yaffe, K; Luik, A I; Liu, Y; Ding, J; Lichtenstein, P; Landén, M; Widen, E; Weir, D R; Llewellyn, D J; Murray, A; Kardia, S L R; Eriksson, J G; Koenen, K; Magnusson, P K E; Ferrucci, L; Mosley, T H; Cucca, F; Oostra, B A; Bennett, D A; Paunio, T; Berger, K; Harris, T B; Pedersen, N L; Murabito, J M; Tiemeier, H; van Duijn, C M; Räikkönen, K

    2016-06-01

    Major depressive disorder (MDD) is moderately heritable, however genome-wide association studies (GWAS) for MDD, as well as for related continuous outcomes, have not shown consistent results. Attempts to elucidate the genetic basis of MDD may be hindered by heterogeneity in diagnosis. The Center for Epidemiological Studies Depression (CES-D) scale provides a widely used tool for measuring depressive symptoms clustered in four different domains which can be combined together into a total score but also can be analysed as separate symptom domains. We performed a meta-analysis of GWAS of the CES-D symptom clusters. We recruited 12 cohorts with the 20- or 10-item CES-D scale (32 528 persons). One single nucleotide polymorphism (SNP), rs713224, located near the brain-expressed melatonin receptor (MTNR1A) gene, was associated with the somatic complaints domain of depression symptoms, with borderline genome-wide significance (p discovery = 3.82 × 10-8). The SNP was analysed in an additional five cohorts comprising the replication sample (6813 persons). However, the association was not consistent among the replication sample (p discovery+replication = 1.10 × 10-6) with evidence of heterogeneity. Despite the effort to harmonize the phenotypes across cohorts and participants, our study is still underpowered to detect consistent association for depression, even by means of symptom classification. On the contrary, the SNP-based heritability and co-heritability estimation results suggest that a very minor part of the variation could be captured by GWAS, explaining the reason of sparse findings.

  18. Dissecting metabolic behavior of lipid over-producing strain of Mucor circinelloides through genome-scale metabolic network and multi-level data integration.

    PubMed

    Vongsangnak, Wanwipa; Kingkaw, Amornthep; Yang, Junhuan; Song, Yuanda; Laoteng, Kobkul

    2018-09-05

    Lipid accumulation is an important cellular process of oleaginous microorganisms. To dissect metabolic behavior of oleaginous Zygomycetes, the lipid over-producing strain, Mucor circinelloides WJ11, was subjected for omics-scale analysis. The genome annotation was improved and used for construction of genome-scale metabolic network of WJ11 strain. Then, the quality of the metabolic network was enhanced by incorporating gene and protein expression data. In addition to the known oleaginous genes, our results showed a number of newly identified unique genes of WJ11 strain, which involved in central carbon metabolism, lipid, amino acid and nitrogen metabolisms. The systematic compilations indicated the additional metabolic routes with the involvement in supplying precursors (acetyl-CoA, NADPH and fatty acyl substrate) for fatty acid and lipid biosynthesis. Interestingly, amino acid metabolism played a substantial role in responsive mechanism of the fungal cells to nutrient imbalance circumstance through lipogenesis as the finding of reporter metabolites (l-methionine, l-glutamate, l-aspartate, l-asparagine and l-glutamine) at lipid-accumulating stage. The cooperative function of certain lipid-degrading enzymes at the particular growth stage was elucidated by integrating the metabolic networks with gene expression data. The unique feature of carotenoid biosynthetic route in WJ11 strain was also identified by protein domain analysis. Taken together, there were cross-functional metabolisms in regulating lipid biosynthesis and retaining high level of cellular lipids in the representative of lipid over-producing strains. Copyright © 2018 Elsevier B.V. All rights reserved.

  19. Integrating Kinetic Model of E. coli with Genome Scale Metabolic Fluxes Overcomes Its Open System Problem and Reveals Bistability in Central Metabolism

    PubMed Central

    Mannan, Ahmad A.; Toya, Yoshihiro; Shimizu, Kazuyuki; McFadden, Johnjoe; Kierzek, Andrzej M.; Rocco, Andrea

    2015-01-01

    An understanding of the dynamics of the metabolic profile of a bacterial cell is sought from a dynamical systems analysis of kinetic models. This modelling formalism relies on a deterministic mathematical description of enzyme kinetics and their metabolite regulation. However, it is severely impeded by the lack of available kinetic information, limiting the size of the system that can be modelled. Furthermore, the subsystem of the metabolic network whose dynamics can be modelled is faced with three problems: how to parameterize the model with mostly incomplete steady state data, how to close what is now an inherently open system, and how to account for the impact on growth. In this study we address these challenges of kinetic modelling by capitalizing on multi-‘omics’ steady state data and a genome-scale metabolic network model. We use these to generate parameters that integrate knowledge embedded in the genome-scale metabolic network model, into the most comprehensive kinetic model of the central carbon metabolism of E. coli realized to date. As an application, we performed a dynamical systems analysis of the resulting enriched model. This revealed bistability of the central carbon metabolism and thus its potential to express two distinct metabolic states. Furthermore, since our model-informing technique ensures both stable states are constrained by the same thermodynamically feasible steady state growth rate, the ensuing bistability represents a temporal coexistence of the two states, and by extension, reveals the emergence of a phenotypically heterogeneous population. PMID:26469081

  20. Network thermodynamic curation of human and yeast genome-scale metabolic models.

    PubMed

    Martínez, Verónica S; Quek, Lake-Ee; Nielsen, Lars K

    2014-07-15

    Genome-scale models are used for an ever-widening range of applications. Although there has been much focus on specifying the stoichiometric matrix, the predictive power of genome-scale models equally depends on reaction directions. Two-thirds of reactions in the two eukaryotic reconstructions Homo sapiens Recon 1 and Yeast 5 are specified as irreversible. However, these specifications are mainly based on biochemical textbooks or on their similarity to other organisms and are rarely underpinned by detailed thermodynamic analysis. In this study, a to our knowledge new workflow combining network-embedded thermodynamic and flux variability analysis was used to evaluate existing irreversibility constraints in Recon 1 and Yeast 5 and to identify new ones. A total of 27 and 16 new irreversible reactions were identified in Recon 1 and Yeast 5, respectively, whereas only four reactions were found with directions incorrectly specified against thermodynamics (three in Yeast 5 and one in Recon 1). The workflow further identified for both models several isolated internal loops that require further curation. The framework also highlighted the need for substrate channeling (in human) and ATP hydrolysis (in yeast) for the essential reaction catalyzed by phosphoribosylaminoimidazole carboxylase in purine metabolism. Finally, the framework highlighted differences in proline metabolism between yeast (cytosolic anabolism and mitochondrial catabolism) and humans (exclusively mitochondrial metabolism). We conclude that network-embedded thermodynamics facilitates the specification and validation of irreversibility constraints in compartmentalized metabolic models, at the same time providing further insight into network properties. Copyright © 2014 Biophysical Society. Published by Elsevier Inc. All rights reserved.

  1. Revealing genome-scale transcriptional regulatory landscape of OmpR highlights its expanded regulatory roles under osmotic stress in Escherichia coli K-12 MG1655.

    PubMed

    Seo, Sang Woo; Gao, Ye; Kim, Donghyuk; Szubin, Richard; Yang, Jina; Cho, Byung-Kwan; Palsson, Bernhard O

    2017-05-19

    A transcription factor (TF), OmpR, plays a critical role in transcriptional regulation of the osmotic stress response in bacteria. Here, we reveal a genome-scale OmpR regulon in Escherichia coli K-12 MG1655. Integrative data analysis reveals that a total of 37 genes in 24 transcription units (TUs) belong to OmpR regulon. Among them, 26 genes show more than two-fold changes in expression level in an OmpR knock-out strain. Specifically, we find that: 1) OmpR regulates mostly membrane-located gene products involved in diverse fundamental biological processes, such as narU (encoding nitrate/nitrite transporter), ompX (encoding outer membrane protein X), and nuoN (encoding NADH:ubiquinone oxidoreductase); 2) by investigating co-regulation of entire sets of genes regulated by other stress-response TFs, stresses are surprisingly independently regulated among each other; and, 3) a detailed investigation of the physiological roles of the newly discovered OmpR regulon genes reveals that activation of narU represents a novel strategy to significantly improve osmotic stress tolerance of E. coli. Thus, the genome-scale approach to elucidating regulons comprehensively identifies regulated genes and leads to fundamental discoveries related to stress responses.

  2. Construction of a Genome-Scale Metabolic Model of Arthrospira platensis NIES-39 and Metabolic Design for Cyanobacterial Bioproduction

    PubMed Central

    Yoshikawa, Katsunori; Aikawa, Shimpei; Kojima, Yuta; Toya, Yoshihiro; Furusawa, Chikara; Kondo, Akihiko; Shimizu, Hiroshi

    2015-01-01

    Arthrospira (Spirulina) platensis is a promising feedstock and host strain for bioproduction because of its high accumulation of glycogen and superior characteristics for industrial production. Metabolic simulation using a genome-scale metabolic model and flux balance analysis is a powerful method that can be used to design metabolic engineering strategies for the improvement of target molecule production. In this study, we constructed a genome-scale metabolic model of A. platensis NIES-39 including 746 metabolic reactions and 673 metabolites, and developed novel strategies to improve the production of valuable metabolites, such as glycogen and ethanol. The simulation results obtained using the metabolic model showed high consistency with experimental results for growth rates under several trophic conditions and growth capabilities on various organic substrates. The metabolic model was further applied to design a metabolic network to improve the autotrophic production of glycogen and ethanol. Decreased flux of reactions related to the TCA cycle and phosphoenolpyruvate reaction were found to improve glycogen production. Furthermore, in silico knockout simulation indicated that deletion of genes related to the respiratory chain, such as NAD(P)H dehydrogenase and cytochrome-c oxidase, could enhance ethanol production by using ammonium as a nitrogen source. PMID:26640947

  3. Origin of a cryptic lineage in a threatened reptile through isolation and historical hybridization.

    PubMed

    Sovic, M G; Fries, A C; Gibbs, H L

    2016-11-01

    Identifying phylogenetically distinct lineages and understanding the evolutionary processes by which they have arisen are important goals of phylogeography. This information can also help define conservation units in endangered species. Such analyses are being transformed by the availability of genomic-scale data sets and novel analytical approaches for statistically comparing different historical scenarios as causes of phylogeographic patterns. Here, we use genomic-scale restriction-site-associated DNA sequencing (RADseq) data to test for distinct lineages in the endangered Eastern Massasauga Rattlesnake (Sistrurus catenatus). We then use coalescent-based modeling techniques to identify the evolutionary mechanisms responsible for the origin of the lineages in this species. We find equivocal evidence for distinct phylogenetic lineages within S. catenatus east of the Mississippi River, but strong support for a previously unrecognized lineage on the western edge of the range of this snake, represented by populations from Iowa, USA. Snakes from these populations show patterns of genetic admixture with a nearby non-threatened sister species (Sistrurus tergeminus). Tests of historical demographic models support the hypothesis that the genetic distinctiveness of Iowa snakes is due to a combination of isolation and historical introgression between S. catenatus and S. tergeminus. Our work provides an example of how model-based analysis of genomic-scale data can help identify conservation units in rare species.

  4. Quantitative Assessment of Thermodynamic Constraints on the Solution Space of Genome-Scale Metabolic Models

    PubMed Central

    Hamilton, Joshua J.; Dwivedi, Vivek; Reed, Jennifer L.

    2013-01-01

    Constraint-based methods provide powerful computational techniques to allow understanding and prediction of cellular behavior. These methods rely on physiochemical constraints to eliminate infeasible behaviors from the space of available behaviors. One such constraint is thermodynamic feasibility, the requirement that intracellular flux distributions obey the laws of thermodynamics. The past decade has seen several constraint-based methods that interpret this constraint in different ways, including those that are limited to small networks, rely on predefined reaction directions, and/or neglect the relationship between reaction free energies and metabolite concentrations. In this work, we utilize one such approach, thermodynamics-based metabolic flux analysis (TMFA), to make genome-scale, quantitative predictions about metabolite concentrations and reaction free energies in the absence of prior knowledge of reaction directions, while accounting for uncertainties in thermodynamic estimates. We applied TMFA to a genome-scale network reconstruction of Escherichia coli and examined the effect of thermodynamic constraints on the flux space. We also assessed the predictive performance of TMFA against gene essentiality and quantitative metabolomics data, under both aerobic and anaerobic, and optimal and suboptimal growth conditions. Based on these results, we propose that TMFA is a useful tool for validating phenotypes and generating hypotheses, and that additional types of data and constraints can improve predictions of metabolite concentrations. PMID:23870272

  5. Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment.

    PubMed

    Baichoo, Shakuntala; Ouzounis, Christos A

    A multitude of algorithms for sequence comparison, short-read assembly and whole-genome alignment have been developed in the general context of molecular biology, to support technology development for high-throughput sequencing, numerous applications in genome biology and fundamental research on comparative genomics. The computational complexity of these algorithms has been previously reported in original research papers, yet this often neglected property has not been reviewed previously in a systematic manner and for a wider audience. We provide a review of space and time complexity of key sequence analysis algorithms and highlight their properties in a comprehensive manner, in order to identify potential opportunities for further research in algorithm or data structure optimization. The complexity aspect is poised to become pivotal as we will be facing challenges related to the continuous increase of genomic data on unprecedented scales and complexity in the foreseeable future, when robust biological simulation at the cell level and above becomes a reality. Copyright © 2017 Elsevier B.V. All rights reserved.

  6. Molecular epidemiology of Epizootic haematopoietic necrosis virus (EHNV).

    PubMed

    Hick, Paul M; Subramaniam, Kuttichantran; Thompson, Patrick M; Waltzek, Thomas B; Becker, Joy A; Whittington, Richard J

    2017-11-01

    Low genetic diversity of Epizootic haematopoietic necrosis virus (EHNV) was determined for the complete genome of 16 isolates spanning the natural range of hosts, geography and time since the first outbreaks of disease. Genomes ranged from 125,591-127,487 nucleotides with 97.47% pairwise identity and 106-109 genes. All isolates shared 101 core genes with 121 potential genes predicted within the pan-genome of this collection. There was high conservation within 90,181 nucleotides of the core genes with isolates separated by average genetic distance of 3.43 × 10 -4 substitutions per site. Evolutionary analysis of the core genome strongly supported historical epidemiological evidence of iatrogenic spread of EHNV to naïve hosts and establishment of endemic status in discrete ecological niches. There was no evidence of structural genome reorganization, however, the complement of non-core genes and variation in repeat elements enabled fine scale molecular epidemiological investigation of this unpredictable pathogen of fish. Copyright © 2017 Elsevier Inc. All rights reserved.

  7. Piggy: a rapid, large-scale pan-genome analysis tool for intergenic regions in bacteria.

    PubMed

    Thorpe, Harry A; Bayliss, Sion C; Sheppard, Samuel K; Feil, Edward J

    2018-04-01

    The concept of the "pan-genome," which refers to the total complement of genes within a given sample or species, is well established in bacterial genomics. Rapid and scalable pipelines are available for managing and interpreting pan-genomes from large batches of annotated assemblies. However, despite overwhelming evidence that variation in intergenic regions in bacteria can directly influence phenotypes, most current approaches for analyzing pan-genomes focus exclusively on protein-coding sequences. To address this we present Piggy, a novel pipeline that emulates Roary except that it is based only on intergenic regions. A key utility provided by Piggy is the detection of highly divergent ("switched") intergenic regions (IGRs) upstream of genes. We demonstrate the use of Piggy on large datasets of clinically important lineages of Staphylococcus aureus and Escherichia coli. For S. aureus, we show that highly divergent (switched) IGRs are associated with differences in gene expression and we establish a multilocus reference database of IGR alleles (igMLST; implemented in BIGSdb).

  8. Large-scale contamination of microbial isolate genomes by Illumina PhiX control.

    PubMed

    Mukherjee, Supratim; Huntemann, Marcel; Ivanova, Natalia; Kyrpides, Nikos C; Pati, Amrita

    2015-01-01

    With the rapid growth and development of sequencing technologies, genomes have become the new go-to for exploring solutions to some of the world's biggest challenges such as searching for alternative energy sources and exploration of genomic dark matter. However, progress in sequencing has been accompanied by its share of errors that can occur during template or library preparation, sequencing, imaging or data analysis. In this study we screened over 18,000 publicly available microbial isolate genome sequences in the Integrated Microbial Genomes database and identified more than 1000 genomes that are contaminated with PhiX, a control frequently used during Illumina sequencing runs. Approximately 10% of these genomes have been published in literature and 129 contaminated genomes were sequenced under the Human Microbiome Project. Raw sequence reads are prone to contamination from various sources and are usually eliminated during downstream quality control steps. Detection of PhiX contaminated genomes indicates a lapse in either the application or effectiveness of proper quality control measures. The presence of PhiX contamination in several publicly available isolate genomes can result in additional errors when such data are used in comparative genomics analyses. Such contamination of public databases have far-reaching consequences in the form of erroneous data interpretation and analyses, and necessitates better measures to proofread raw sequences before releasing them to the broader scientific community.

  9. Genome-Based Taxonomic Classification of Bacteroidetes

    DOE PAGES

    Hahnke, Richard L.; Meier-Kolthoff, Jan P.; García-López, Marina; ...

    2016-12-20

    The bacterial phylum Bacteroidetes, characterized by a distinct gliding motility, occurs in a broad variety of ecosystems, habitats, life styles, and physiologies. Accordingly, taxonomic classification of the phylum, based on a limited number of features, proved difficult and controversial in the past, for example, when decisions were based on unresolved phylogenetic trees of the 16S rRNA gene sequence. Here we use a large collection of type-strain genomes from Bacteroidetes and closely related phyla for assessing their taxonomy based on the principles of phylogenetic classification and trees inferred from genome-scale data. No significant conflict between 16S rRNA gene and whole-genome phylogeneticmore » analysis is found, whereas many but not all of the involved taxa are supported as monophyletic groups, particularly in the genome-scale trees. Phenotypic and phylogenomic features support the separation of Balneolaceae as new phylum Balneolaeota from Rhodothermaeota and of Saprospiraceae as new class Saprospiria from Chitinophagia. Epilithonimonas is nested within the older genus Chryseobacterium and without significant phenotypic differences; thus merging the two genera is proposed. Similarly, Vitellibacter is proposed to be included in Aequorivita. Flexibacter is confirmed as being heterogeneous and dissected, yielding six distinct genera. Hallella seregens is a later heterotypic synonym of Prevotella dentalis. Compared to values directly calculated from genome sequences, the G+C content mentioned in many species descriptions is too imprecise; moreover, corrected G+C content values have a significantly better fit to the phylogeny. Corresponding emendations of species descriptions are provided where necessary. Whereas most observed conflict with the current classification of Bacteroidetes is already visible in 16S rRNA gene trees, as expected whole-genome phylogenies are much better resolved.« less

  10. Genome-Based Taxonomic Classification of Bacteroidetes

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hahnke, Richard L.; Meier-Kolthoff, Jan P.; García-López, Marina

    The bacterial phylum Bacteroidetes, characterized by a distinct gliding motility, occurs in a broad variety of ecosystems, habitats, life styles, and physiologies. Accordingly, taxonomic classification of the phylum, based on a limited number of features, proved difficult and controversial in the past, for example, when decisions were based on unresolved phylogenetic trees of the 16S rRNA gene sequence. Here we use a large collection of type-strain genomes from Bacteroidetes and closely related phyla for assessing their taxonomy based on the principles of phylogenetic classification and trees inferred from genome-scale data. No significant conflict between 16S rRNA gene and whole-genome phylogeneticmore » analysis is found, whereas many but not all of the involved taxa are supported as monophyletic groups, particularly in the genome-scale trees. Phenotypic and phylogenomic features support the separation of Balneolaceae as new phylum Balneolaeota from Rhodothermaeota and of Saprospiraceae as new class Saprospiria from Chitinophagia. Epilithonimonas is nested within the older genus Chryseobacterium and without significant phenotypic differences; thus merging the two genera is proposed. Similarly, Vitellibacter is proposed to be included in Aequorivita. Flexibacter is confirmed as being heterogeneous and dissected, yielding six distinct genera. Hallella seregens is a later heterotypic synonym of Prevotella dentalis. Compared to values directly calculated from genome sequences, the G+C content mentioned in many species descriptions is too imprecise; moreover, corrected G+C content values have a significantly better fit to the phylogeny. Corresponding emendations of species descriptions are provided where necessary. Whereas most observed conflict with the current classification of Bacteroidetes is already visible in 16S rRNA gene trees, as expected whole-genome phylogenies are much better resolved.« less

  11. The Peach v2.0 release: high-resolution linkage mapping and deep resequencing improve chromosome-scale assembly and contiguity.

    PubMed

    Verde, Ignazio; Jenkins, Jerry; Dondini, Luca; Micali, Sabrina; Pagliarani, Giulia; Vendramin, Elisa; Paris, Roberta; Aramini, Valeria; Gazza, Laura; Rossini, Laura; Bassi, Daniele; Troggio, Michela; Shu, Shengqiang; Grimwood, Jane; Tartarini, Stefano; Dettori, Maria Teresa; Schmutz, Jeremy

    2017-03-11

    The availability of the peach genome sequence has fostered relevant research in peach and related Prunus species enabling the identification of genes underlying important horticultural traits as well as the development of advanced tools for genetic and genomic analyses. The first release of the peach genome (Peach v1.0) represented a high-quality WGS (Whole Genome Shotgun) chromosome-scale assembly with high contiguity (contig L50 214.2 kb), large portions of mapped sequences (96%) and high base accuracy (99.96%). The aim of this work was to improve the quality of the first assembly by increasing the portion of mapped and oriented sequences, correcting misassemblies and improving the contiguity and base accuracy using high-throughput linkage mapping and deep resequencing approaches. Four linkage maps with 3,576 molecular markers were used to improve the portion of mapped and oriented sequences (from 96.0% and 85.6% of Peach v1.0 to 99.2% and 98.2% of v2.0, respectively) and enabled a more detailed identification of discernible misassemblies (10.4 Mb in total). The deep resequencing approach fixed 859 homozygous SNPs (Single Nucleotide Polymorphisms) and 1347 homozygous indels. Moreover, the assembled NGS contigs enabled the closing of 212 gaps with an improvement in the contig L50 of 19.2%. The improved high quality peach genome assembly (Peach v2.0) represents a valuable tool for the analysis of the genetic diversity, domestication, and as a vehicle for genetic improvement of peach and related Prunus species. Moreover, the important phylogenetic position of peach and the absence of recent whole genome duplication (WGD) events make peach a pivotal species for comparative genomics studies aiming at elucidating plant speciation and diversification processes.

  12. Genome-Based Taxonomic Classification of Bacteroidetes

    PubMed Central

    Hahnke, Richard L.; Meier-Kolthoff, Jan P.; García-López, Marina; Mukherjee, Supratim; Huntemann, Marcel; Ivanova, Natalia N.; Woyke, Tanja; Kyrpides, Nikos C.; Klenk, Hans-Peter; Göker, Markus

    2016-01-01

    The bacterial phylum Bacteroidetes, characterized by a distinct gliding motility, occurs in a broad variety of ecosystems, habitats, life styles, and physiologies. Accordingly, taxonomic classification of the phylum, based on a limited number of features, proved difficult and controversial in the past, for example, when decisions were based on unresolved phylogenetic trees of the 16S rRNA gene sequence. Here we use a large collection of type-strain genomes from Bacteroidetes and closely related phyla for assessing their taxonomy based on the principles of phylogenetic classification and trees inferred from genome-scale data. No significant conflict between 16S rRNA gene and whole-genome phylogenetic analysis is found, whereas many but not all of the involved taxa are supported as monophyletic groups, particularly in the genome-scale trees. Phenotypic and phylogenomic features support the separation of Balneolaceae as new phylum Balneolaeota from Rhodothermaeota and of Saprospiraceae as new class Saprospiria from Chitinophagia. Epilithonimonas is nested within the older genus Chryseobacterium and without significant phenotypic differences; thus merging the two genera is proposed. Similarly, Vitellibacter is proposed to be included in Aequorivita. Flexibacter is confirmed as being heterogeneous and dissected, yielding six distinct genera. Hallella seregens is a later heterotypic synonym of Prevotella dentalis. Compared to values directly calculated from genome sequences, the G+C content mentioned in many species descriptions is too imprecise; moreover, corrected G+C content values have a significantly better fit to the phylogeny. Corresponding emendations of species descriptions are provided where necessary. Whereas most observed conflict with the current classification of Bacteroidetes is already visible in 16S rRNA gene trees, as expected whole-genome phylogenies are much better resolved. PMID:28066339

  13. In-silico Taxonomic Classification of 373 Genomes Reveals Species Misidentification and New Genospecies within the Genus Pseudomonas.

    PubMed

    Tran, Phuong N; Savka, Michael A; Gan, Han Ming

    2017-01-01

    The genus Pseudomonas has one of the largest diversity of species within the Bacteria kingdom. To date, its taxonomy is still being revised and updated. Due to the non-standardized procedure and ambiguous thresholds at species level, largely based on 16S rRNA gene or conventional biochemical assay, species identification of publicly available Pseudomonas genomes remains questionable. In this study, we performed a large-scale analysis of all Pseudomonas genomes with species designation (excluding the well-defined P. aeruginosa ) and re-evaluated their taxonomic assignment via in silico genome-genome hybridization and/or genetic comparison with valid type species. Three-hundred and seventy-three pseudomonad genomes were analyzed and subsequently clustered into 145 distinct genospecies. We detected 207 erroneous labels and corrected 43 to the proper species based on Average Nucleotide Identity Multilocus Sequence Typing (MLST) sequence similarity to the type strain. Surprisingly, more than half of the genomes initially designated as Pseudomonas syringae and Pseudomonas fluorescens should be classified either to a previously described species or to a new genospecies. Notably, high pairwise average nucleotide identity (>95%) indicating species-level similarity was observed between P. synxantha-P. libanensis, P. psychrotolerans - P. oryzihabitans , and P. kilonensis- P. brassicacearum , that were previously differentiated based on conventional biochemical tests and/or genome-genome hybridization techniques.

  14. An object model and database for functional genomics.

    PubMed

    Jones, Andrew; Hunt, Ela; Wastling, Jonathan M; Pizarro, Angel; Stoeckert, Christian J

    2004-07-10

    Large-scale functional genomics analysis is now feasible and presents significant challenges in data analysis, storage and querying. Data standards are required to enable the development of public data repositories and to improve data sharing. There is an established data format for microarrays (microarray gene expression markup language, MAGE-ML) and a draft standard for proteomics (PEDRo). We believe that all types of functional genomics experiments should be annotated in a consistent manner, and we hope to open up new ways of comparing multiple datasets used in functional genomics. We have created a functional genomics experiment object model (FGE-OM), developed from the microarray model, MAGE-OM and two models for proteomics, PEDRo and our own model (Gla-PSI-Glasgow Proposal for the Proteomics Standards Initiative). FGE-OM comprises three namespaces representing (i) the parts of the model common to all functional genomics experiments; (ii) microarray-specific components; and (iii) proteomics-specific components. We believe that FGE-OM should initiate discussion about the contents and structure of the next version of MAGE and the future of proteomics standards. A prototype database called RNA And Protein Abundance Database (RAPAD), based on FGE-OM, has been implemented and populated with data from microbial pathogenesis. FGE-OM and the RAPAD schema are available from http://www.gusdb.org/fge.html, along with a set of more detailed diagrams. RAPAD can be accessed by registration at the site.

  15. Structure and variation of the mitochondrial genome of fishes.

    PubMed

    Satoh, Takashi P; Miya, Masaki; Mabuchi, Kohji; Nishida, Mutsumi

    2016-09-07

    The mitochondrial (mt) genome has been used as an effective tool for phylogenetic and population genetic analyses in vertebrates. However, the structure and variability of the vertebrate mt genome are not well understood. A potential strategy for improving our understanding is to conduct a comprehensive comparative study of large mt genome data. The aim of this study was to characterize the structure and variability of the fish mt genome through comparative analysis of large datasets. An analysis of the secondary structure of proteins for 250 fish species (248 ray-finned and 2 cartilaginous fishes) illustrated that cytochrome c oxidase subunits (COI, COII, and COIII) and a cytochrome bc1 complex subunit (Cyt b) had substantial amino acid conservation. Among the four proteins, COI was the most conserved, as more than half of all amino acid sites were invariable among the 250 species. Our models identified 43 and 58 stems within 12S rRNA and 16S rRNA, respectively, with larger numbers than proposed previously for vertebrates. The models also identified 149 and 319 invariable sites in 12S rRNA and 16S rRNA, respectively, in all fishes. In particular, the present result verified that a region corresponding to the peptidyl transferase center in prokaryotic 23S rRNA, which is homologous to mt 16S rRNA, is also conserved in fish mt 16S rRNA. Concerning the gene order, we found 35 variations (in 32 families) that deviated from the common gene order in vertebrates. These gene rearrangements were mostly observed in the area spanning the ND5 gene to the control region as well as two tRNA gene cluster regions (IQM and WANCY regions). Although many of such gene rearrangements were unique to a specific taxon, some were shared polyphyletically between distantly related species. Through a large-scale comparative analysis of 250 fish species mt genomes, we elucidated various structural aspects of the fish mt genome and the encoded genes. The present results will be important for understanding functions of the mt genome and developing programs for nucleotide sequence analysis. This study demonstrated the significance of extensive comparisons for understanding the structure of the mt genome.

  16. The use of population-scale sequencing to identify CNVs impacting productive traits in different cattle breeds

    USDA-ARS?s Scientific Manuscript database

    Individualized copy number variation (CNV) maps have highlighted the need for population surveys of cattle to detect rare and common variants. While SNP and comparative genomic hybridization (CGH) arrays have provided preliminary data, next-generation sequence (NGS) data analysis offers an increased...

  17. Sequence analysis reveals genomic factors affecting EST-SSR primer performance and polymorphism

    USDA-ARS?s Scientific Manuscript database

    Search for simple sequence repeat (SSR) motifs and design of flanking primers in expressed sequence tag (EST) sequences can be easily done at a large scale using bioinformatics programs. However, failed amplification and/or detection, along with lack of polymorphism, is often seen among randomly sel...

  18. Triticeae Resources in Ensembl Plants

    PubMed Central

    Bolser, Dan M.; Kerhornou, Arnaud; Walts, Brandon; Kersey, Paul

    2015-01-01

    Recent developments in DNA sequencing have enabled the large and complex genomes of many crop species to be determined for the first time, even those previously intractable due to their polyploid nature. Indeed, over the course of the last 2 years, the genome sequences of several commercially important cereals, notably barley and bread wheat, have become available, as well as those of related wild species. While still incomplete, comparison with other, more completely assembled species suggests that coverage of genic regions is likely to be high. Ensembl Plants (http://plants.ensembl.org) is an integrative resource organizing, analyzing and visualizing genome-scale information for important crop and model plants. Available data include reference genome sequence, variant loci, gene models and functional annotation. For variant loci, individual and population genotypes, linkage information and, where available, phenotypic information are shown. Comparative analyses are performed on DNA and protein sequence alignments. The resulting genome alignments and gene trees, representing the implied evolutionary history of the gene family, are made available for visualization and analysis. Driven by the case of bread wheat, specific extensions to the analysis pipelines and web interface have recently been developed to support polyploid genomes. Data in Ensembl Plants is accessible through a genome browser incorporating various specialist interfaces for different data types, and through a variety of additional methods for programmatic access and data mining. These interfaces are consistent with those offered through the Ensembl interface for the genomes of non-plant species, including those of plant pathogens, pests and pollinators, facilitating the study of the plant in its environment. PMID:25432969

  19. Array-Based Comparative Genomic Hybridization for the Genomewide Detection of Submicroscopic Chromosomal Abnormalities

    PubMed Central

    Vissers, Lisenka E. L. M. ; de Vries, Bert B. A. ; Osoegawa, Kazutoyo ; Janssen, Irene M. ; Feuth, Ton ; Choy, Chik On ; Straatman, Huub ; van der Vliet, Walter ; Huys, Erik H. L. P. G. ; van Rijk, Anke ; Smeets, Dominique ; van Ravenswaaij-Arts, Conny M. A. ; Knoers, Nine V. ; van der Burgt, Ineke ; de Jong, Pieter J. ; Brunner, Han G. ; van Kessel, Ad Geurts ; Schoenmakers, Eric F. P. M. ; Veltman, Joris A. 

    2003-01-01

    Microdeletions and microduplications, not visible by routine chromosome analysis, are a major cause of human malformation and mental retardation. Novel high-resolution, whole-genome technologies can improve the diagnostic detection rate of these small chromosomal abnormalities. Array-based comparative genomic hybridization allows such a high-resolution screening by hybridizing differentially labeled test and reference DNAs to arrays consisting of thousands of genomic clones. In this study, we tested the diagnostic capacity of this technology using ∼3,500 flourescent in situ hybridization–verified clones selected to cover the genome with an average of 1 clone per megabase (Mb). The sensitivity and specificity of the technology were tested in normal-versus-normal control experiments and through the screening of patients with known microdeletion syndromes. Subsequently, a series of 20 cytogenetically normal patients with mental retardation and dysmorphisms suggestive of a chromosomal abnormality were analyzed. In this series, three microdeletions and two microduplications were identified and validated. Two of these genomic changes were identified also in one of the parents, indicating that these are large-scale genomic polymorphisms. Deletions and duplications as small as 1 Mb could be reliably detected by our approach. The percentage of false-positive results was reduced to a minimum by use of a dye-swap-replicate analysis, all but eliminating the need for laborious validation experiments and facilitating implementation in a routine diagnostic setting. This high-resolution assay will facilitate the identification of novel genes involved in human mental retardation and/or malformation syndromes and will provide insight into the flexibility and plasticity of the human genome. PMID:14628292

  20. SWPhylo - A Novel Tool for Phylogenomic Inferences by Comparison of Oligonucleotide Patterns and Integration of Genome-Based and Gene-Based Phylogenetic Trees.

    PubMed

    Yu, Xiaoyu; Reva, Oleg N

    2018-01-01

    Modern phylogenetic studies may benefit from the analysis of complete genome sequences of various microorganisms. Evolutionary inferences based on genome-scale analysis are believed to be more accurate than the gene-based alternative. However, the computational complexity of current phylogenomic procedures, inappropriateness of standard phylogenetic tools to process genome-wide data, and lack of reliable substitution models which correlates with alignment-free phylogenomic approaches deter microbiologists from using these opportunities. For example, the super-matrix and super-tree approaches of phylogenomics use multiple integrated genomic loci or individual gene-based trees to infer an overall consensus tree. However, these approaches potentially multiply errors of gene annotation and sequence alignment not mentioning the computational complexity and laboriousness of the methods. In this article, we demonstrate that the annotation- and alignment-free comparison of genome-wide tetranucleotide frequencies, termed oligonucleotide usage patterns (OUPs), allowed a fast and reliable inference of phylogenetic trees. These were congruent to the corresponding whole genome super-matrix trees in terms of tree topology when compared with other known approaches including 16S ribosomal RNA and GyrA protein sequence comparison, complete genome-based MAUVE, and CVTree methods. A Web-based program to perform the alignment-free OUP-based phylogenomic inferences was implemented at http://swphylo.bi.up.ac.za/. Applicability of the tool was tested on different taxa from subspecies to intergeneric levels. Distinguishing between closely related taxonomic units may be enforced by providing the program with alignments of marker protein sequences, eg, GyrA.

  1. SWPhylo – A Novel Tool for Phylogenomic Inferences by Comparison of Oligonucleotide Patterns and Integration of Genome-Based and Gene-Based Phylogenetic Trees

    PubMed Central

    Yu, Xiaoyu; Reva, Oleg N

    2018-01-01

    Modern phylogenetic studies may benefit from the analysis of complete genome sequences of various microorganisms. Evolutionary inferences based on genome-scale analysis are believed to be more accurate than the gene-based alternative. However, the computational complexity of current phylogenomic procedures, inappropriateness of standard phylogenetic tools to process genome-wide data, and lack of reliable substitution models which correlates with alignment-free phylogenomic approaches deter microbiologists from using these opportunities. For example, the super-matrix and super-tree approaches of phylogenomics use multiple integrated genomic loci or individual gene-based trees to infer an overall consensus tree. However, these approaches potentially multiply errors of gene annotation and sequence alignment not mentioning the computational complexity and laboriousness of the methods. In this article, we demonstrate that the annotation- and alignment-free comparison of genome-wide tetranucleotide frequencies, termed oligonucleotide usage patterns (OUPs), allowed a fast and reliable inference of phylogenetic trees. These were congruent to the corresponding whole genome super-matrix trees in terms of tree topology when compared with other known approaches including 16S ribosomal RNA and GyrA protein sequence comparison, complete genome-based MAUVE, and CVTree methods. A Web-based program to perform the alignment-free OUP-based phylogenomic inferences was implemented at http://swphylo.bi.up.ac.za/. Applicability of the tool was tested on different taxa from subspecies to intergeneric levels. Distinguishing between closely related taxonomic units may be enforced by providing the program with alignments of marker protein sequences, eg, GyrA. PMID:29511354

  2. Genome resequencing in Populus: Revealing large-scale genome variation and implications on specialized-trait genomics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Muchero, Wellington; Labbe, Jessy L; Priya, Ranjan

    2014-01-01

    To date, Populus ranks among a few plant species with a complete genome sequence and other highly developed genomic resources. With the first genome sequence among all tree species, Populus has been adopted as a suitable model organism for genomic studies in trees. However, far from being just a model species, Populus is a key renewable economic resource that plays a significant role in providing raw materials for the biofuel and pulp and paper industries. Therefore, aside from leading frontiers of basic tree molecular biology and ecological research, Populus leads frontiers in addressing global economic challenges related to fuel andmore » fiber production. The latter fact suggests that research aimed at improving quality and quantity of Populus as a raw material will likely drive the pursuit of more targeted and deeper research in order to unlock the economic potential tied in molecular biology processes that drive this tree species. Advances in genome sequence-driven technologies, such as resequencing individual genotypes, which in turn facilitates large scale SNP discovery and identification of large scale polymorphisms are key determinants of future success in these initiatives. In this treatise we discuss implications of genome sequence-enable technologies on Populus genomic and genetic studies of complex and specialized-traits.« less

  3. Comprehensive reconstruction and in silico analysis of Aspergillus niger genome-scale metabolic network model that accounts for 1210 ORFs.

    PubMed

    Lu, Hongzhong; Cao, Weiqiang; Ouyang, Liming; Xia, Jianye; Huang, Mingzhi; Chu, Ju; Zhuang, Yingping; Zhang, Siliang; Noorman, Henk

    2017-03-01

    Aspergillus niger is one of the most important cell factories for industrial enzymes and organic acids production. A comprehensive genome-scale metabolic network model (GSMM) with high quality is crucial for efficient strain improvement and process optimization. The lack of accurate reaction equations and gene-protein-reaction associations (GPRs) in the current best model of A. niger named GSMM iMA871, however, limits its application scope. To overcome these limitations, we updated the A. niger GSMM by combining the latest genome annotation and literature mining technology. Compared with iMA871, the number of reactions in iHL1210 was increased from 1,380 to 1,764, and the number of unique ORFs from 871 to 1,210. With the aid of our transcriptomics analysis, the existence of 63% ORFs and 68% reactions in iHL1210 can be verified when glucose was used as the only carbon source. Physiological data from chemostat cultivations, 13 C-labeled and molecular experiments from the published literature were further used to check the performance of iHL1210. The average correlation coefficients between the predicted fluxes and estimated fluxes from 13 C-labeling data were sufficiently high (above 0.89) and the prediction of cell growth on most of the reported carbon and nitrogen sources was consistent. Using the updated genome-scale model, we evaluated gene essentiality on synthetic and yeast extract medium, as well as the effects of NADPH supply on glucoamylase production in A. niger. In summary, the new A. niger GSMM iHL1210 contains significant improvements with respect to the metabolic coverage and prediction performance, which paves the way for systematic metabolic engineering of A. niger. Biotechnol. Bioeng. 2017;114: 685-695. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.

  4. Genome-scale engineering of Saccharomyces cerevisiae with single-nucleotide precision.

    PubMed

    Bao, Zehua; HamediRad, Mohammad; Xue, Pu; Xiao, Han; Tasan, Ipek; Chao, Ran; Liang, Jing; Zhao, Huimin

    2018-07-01

    We developed a CRISPR-Cas9- and homology-directed-repair-assisted genome-scale engineering method named CHAnGE that can rapidly output tens of thousands of specific genetic variants in yeast. More than 98% of target sequences were efficiently edited with an average frequency of 82%. We validate the single-nucleotide resolution genome-editing capability of this technology by creating a genome-wide gene disruption collection and apply our method to improve tolerance to growth inhibitors.

  5. Functional regression method for whole genome eQTL epistasis analysis with sequencing data.

    PubMed

    Xu, Kelin; Jin, Li; Xiong, Momiao

    2017-05-18

    Epistasis plays an essential rule in understanding the regulation mechanisms and is an essential component of the genetic architecture of the gene expressions. However, interaction analysis of gene expressions remains fundamentally unexplored due to great computational challenges and data availability. Due to variation in splicing, transcription start sites, polyadenylation sites, post-transcriptional RNA editing across the entire gene, and transcription rates of the cells, RNA-seq measurements generate large expression variability and collectively create the observed position level read count curves. A single number for measuring gene expression which is widely used for microarray measured gene expression analysis is highly unlikely to sufficiently account for large expression variation across the gene. Simultaneously analyzing epistatic architecture using the RNA-seq and whole genome sequencing (WGS) data poses enormous challenges. We develop a nonlinear functional regression model (FRGM) with functional responses where the position-level read counts within a gene are taken as a function of genomic position, and functional predictors where genotype profiles are viewed as a function of genomic position, for epistasis analysis with RNA-seq data. Instead of testing the interaction of all possible pair-wises SNPs, the FRGM takes a gene as a basic unit for epistasis analysis, which tests for the interaction of all possible pairs of genes and use all the information that can be accessed to collectively test interaction between all possible pairs of SNPs within two genome regions. By large-scale simulations, we demonstrate that the proposed FRGM for epistasis analysis can achieve the correct type 1 error and has higher power to detect the interactions between genes than the existing methods. The proposed methods are applied to the RNA-seq and WGS data from the 1000 Genome Project. The numbers of pairs of significantly interacting genes after Bonferroni correction identified using FRGM, RPKM and DESeq were 16,2361, 260 and 51, respectively, from the 350 European samples. The proposed FRGM for epistasis analysis of RNA-seq can capture isoform and position-level information and will have a broad application. Both simulations and real data analysis highlight the potential for the FRGM to be a good choice of the epistatic analysis with sequencing data.

  6. Systems perspectives on erythromycin biosynthesis by comparative genomic and transcriptomic analyses of S. erythraea E3 and NRRL23338 strains

    PubMed Central

    2013-01-01

    Background S. erythraea is a Gram-positive filamentous bacterium used for the industrial-scale production of erythromycin A which is of high clinical importance. In this work, we sequenced the whole genome of a high-producing strain (E3) obtained by random mutagenesis and screening from the wild-type strain NRRL23338, and examined time-series expression profiles of both E3 and NRRL23338. Based on the genomic data and transcriptpmic data of these two strains, we carried out comparative analysis of high-producing strain and wild-type strain at both the genomic level and the transcriptomic level. Results We observed a large number of genetic variants including 60 insertions, 46 deletions and 584 single nucleotide variations (SNV) in E3 in comparison with NRRL23338, and the analysis of time series transcriptomic data indicated that the genes involved in erythromycin biosynthesis and feeder pathways were significantly up-regulated during the 60 hours time-course. According to our data, BldD, a previously identified ery cluster regulator, did not show any positive correlations with the expression of ery cluster, suggesting the existence of alternative regulation mechanisms of erythromycin synthesis in S. erythraea. Several potential regulators were then proposed by integration analysis of genomic and transcriptomic data. Conclusion This is a demonstration of the functional comparative genomics between an industrial S. erythraea strain and the wild-type strain. These findings help to understand the global regulation mechanisms of erythromycin biosynthesis in S. erythraea, providing useful clues for genetic and metabolic engineering in the future. PMID:23902230

  7. Comprehensive Analysis of DNA Methylation Data with RnBeads

    PubMed Central

    Walter, Jörn; Lengauer, Thomas; Bock, Christoph

    2014-01-01

    RnBeads is a software tool for large-scale analysis and interpretation of DNA methylation data, providing a user-friendly analysis workflow that yields detailed hypertext reports (http://rnbeads.mpi-inf.mpg.de). Supported assays include whole genome bisulfite sequencing, reduced representation bisulfite sequencing, Infinium microarrays, and any other protocol that produces high-resolution DNA methylation data. Important applications of RnBeads include the analysis of epigenome-wide association studies and epigenetic biomarker discovery in cancer cohorts. PMID:25262207

  8. Sequencing of Seven Haloarchaeal Genomes Reveals Patterns of Genomic Flux

    PubMed Central

    Lynch, Erin A.; Langille, Morgan G. I.; Darling, Aaron; Wilbanks, Elizabeth G.; Haltiner, Caitlin; Shao, Katie S. Y.; Starr, Michael O.; Teiling, Clotilde; Harkins, Timothy T.; Edwards, Robert A.; Eisen, Jonathan A.; Facciotti, Marc T.

    2012-01-01

    We report the sequencing of seven genomes from two haloarchaeal genera, Haloferax and Haloarcula. Ease of cultivation and the existence of well-developed genetic and biochemical tools for several diverse haloarchaeal species make haloarchaea a model group for the study of archaeal biology. The unique physiological properties of these organisms also make them good candidates for novel enzyme discovery for biotechnological applications. Seven genomes were sequenced to ∼20×coverage and assembled to an average of 50 contigs (range 5 scaffolds - 168 contigs). Comparisons of protein-coding gene compliments revealed large-scale differences in COG functional group enrichment between these genera. Analysis of genes encoding machinery for DNA metabolism reveals genera-specific expansions of the general transcription factor TATA binding protein as well as a history of extensive duplication and horizontal transfer of the proliferating cell nuclear antigen. Insights gained from this study emphasize the importance of haloarchaea for investigation of archaeal biology. PMID:22848480

  9. Conservation genetics and genomics of amphibians and reptiles.

    PubMed

    Shaffer, H Bradley; Gidiş, Müge; McCartney-Melstad, Evan; Neal, Kevin M; Oyamaguchi, Hilton M; Tellez, Marisa; Toffelmier, Erin M

    2015-01-01

    Amphibians and reptiles as a group are often secretive, reach their greatest diversity often in remote tropical regions, and contain some of the most endangered groups of organisms on earth. Particularly in the past decade, genetics and genomics have been instrumental in the conservation biology of these cryptic vertebrates, enabling work ranging from the identification of populations subject to trade and exploitation, to the identification of cryptic lineages harboring critical genetic variation, to the analysis of genes controlling key life history traits. In this review, we highlight some of the most important ways that genetic analyses have brought new insights to the conservation of amphibians and reptiles. Although genomics has only recently emerged as part of this conservation tool kit, several large-scale data sources, including full genomes, expressed sequence tags, and transcriptomes, are providing new opportunities to identify key genes, quantify landscape effects, and manage captive breeding stocks of at-risk species.

  10. Quantitative Tracking of Combinatorially Engineered Populations with Multiplexed Binary Assemblies.

    PubMed

    Zeitoun, Ramsey I; Pines, Gur; Grau, Willliam C; Gill, Ryan T

    2017-04-21

    Advances in synthetic biology and genomics have enabled full-scale genome engineering efforts on laboratory time scales. However, the absence of sufficient approaches for mapping engineered genomes at system-wide scales onto performance has limited the adoption of more sophisticated algorithms for engineering complex biological systems. Here we report on the development and application of a robust approach to quantitatively map combinatorially engineered populations at scales up to several dozen target sites. This approach works by assembling genome engineered sites with cell-specific barcodes into a format compatible with high-throughput sequencing technologies. This approach, called barcoded-TRACE (bTRACE) was applied to assess E. coli populations engineered by recursive multiplex recombineering across both 6-target sites and 31-target sites. The 31-target library was then tracked throughout growth selections in the presence and absence of isopentenol (a potential next-generation biofuel). We also use the resolution of bTRACE to compare the influence of technical and biological noise on genome engineering efforts.

  11. Genome-scale CRISPR-Cas9 knockout screening in human cells.

    PubMed

    Shalem, Ophir; Sanjana, Neville E; Hartenian, Ella; Shi, Xi; Scott, David A; Mikkelson, Tarjei; Heckl, Dirk; Ebert, Benjamin L; Root, David E; Doench, John G; Zhang, Feng

    2014-01-03

    The simplicity of programming the CRISPR (clustered regularly interspaced short palindromic repeats)-associated nuclease Cas9 to modify specific genomic loci suggests a new way to interrogate gene function on a genome-wide scale. We show that lentiviral delivery of a genome-scale CRISPR-Cas9 knockout (GeCKO) library targeting 18,080 genes with 64,751 unique guide sequences enables both negative and positive selection screening in human cells. First, we used the GeCKO library to identify genes essential for cell viability in cancer and pluripotent stem cells. Next, in a melanoma model, we screened for genes whose loss is involved in resistance to vemurafenib, a therapeutic RAF inhibitor. Our highest-ranking candidates include previously validated genes NF1 and MED12, as well as novel hits NF2, CUL3, TADA2B, and TADA1. We observe a high level of consistency between independent guide RNAs targeting the same gene and a high rate of hit confirmation, demonstrating the promise of genome-scale screening with Cas9.

  12. Large-scale genome-wide association studies in East Asians identify new genetic loci influencing metabolic traits.

    PubMed

    Kim, Young Jin; Go, Min Jin; Hu, Cheng; Hong, Chang Bum; Kim, Yun Kyoung; Lee, Ji Young; Hwang, Joo-Yeon; Oh, Ji Hee; Kim, Dong-Joon; Kim, Nam Hee; Kim, Soeui; Hong, Eun Jung; Kim, Ji-Hyun; Min, Haesook; Kim, Yeonjung; Zhang, Rong; Jia, Weiping; Okada, Yukinori; Takahashi, Atsushi; Kubo, Michiaki; Tanaka, Toshihiro; Kamatani, Naoyuki; Matsuda, Koichi; Park, Taesung; Oh, Bermseok; Kimm, Kuchan; Kang, Daehee; Shin, Chol; Cho, Nam H; Kim, Hyung-Lae; Han, Bok-Ghee; Lee, Jong-Young; Cho, Yoon Shin

    2011-09-11

    To identify the genetic bases for nine metabolic traits, we conducted a meta-analysis combining Korean genome-wide association results from the KARE project (n = 8,842) and the HEXA shared control study (n = 3,703). We verified the associations of the loci selected from the discovery meta-analysis in the replication stage (30,395 individuals from the BioBank Japan genome-wide association study and individuals comprising the Health2 and Shanghai Jiao Tong University Diabetes cohorts). We identified ten genome-wide significant signals newly associated with traits from an overall meta-analysis. The most compelling associations involved 12q24.11 (near MYL2) and 12q24.13 (in C12orf51) for high-density lipoprotein cholesterol, 2p21 (near SIX2-SIX3) for fasting plasma glucose, 19q13.33 (in RPS11) and 6q22.33 (in RSPO3) for renal traits, and 12q24.11 (near MYL2), 12q24.13 (in C12orf51 and near OAS1), 4q31.22 (in ZNF827) and 7q11.23 (near TBL2-BCL7B) for hepatic traits. These findings highlight previously unknown biological pathways for metabolic traits investigated in this study.

  13. Genomic insights into the taxonomic status of the Bacillus cereus group

    PubMed Central

    Liu, Yang; Lai, Qiliang; Göker, Markus; Meier-Kolthoff, Jan P.; Wang, Meng; Sun, Yamin; Wang, Lei; Shao, Zongze

    2015-01-01

    The identification and phylogenetic relationships of bacteria within the Bacillus cereus group are controversial. This study aimed at determining the taxonomic affiliations of these strains using the whole-genome sequence-based Genome BLAST Distance Phylogeny (GBDP) approach. The GBDP analysis clearly separated 224 strains into 30 clusters, representing eleven known, partially merged species and accordingly 19–20 putative novel species. Additionally, 16S rRNA gene analysis, a novel variant of multi-locus sequence analysis (nMLSA) and screening of virulence genes were performed. The 16S rRNA gene sequence was not sufficient to differentiate the bacteria within this group due to its high conservation. The nMLSA results were consistent with GBDP. Moreover, a fast typing method was proposed using the pycA gene, and where necessary, the ccpA gene. The pXO plasmids and cry genes were widely distributed, suggesting little correlation with the phylogenetic positions of the host bacteria. This might explain why classifications based on virulence characteristics proved unsatisfactory in the past. In summary, this is the first large-scale and systematic study of the taxonomic status of the bacteria within the B. cereus group using whole-genome sequences, and is likely to contribute to further insights into their pathogenicity, phylogeny and adaptation to diverse environments. PMID:26373441

  14. The Mobile Element Locator Tool (MELT): population-scale mobile element discovery and biology

    PubMed Central

    Gardner, Eugene J.; Lam, Vincent K.; Harris, Daniel N.; Chuang, Nelson T.; Scott, Emma C.; Pittard, W. Stephen; Mills, Ryan E.; Devine, Scott E.

    2017-01-01

    Mobile element insertions (MEIs) represent ∼25% of all structural variants in human genomes. Moreover, when they disrupt genes, MEIs can influence human traits and diseases. Therefore, MEIs should be fully discovered along with other forms of genetic variation in whole genome sequencing (WGS) projects involving population genetics, human diseases, and clinical genomics. Here, we describe the Mobile Element Locator Tool (MELT), which was developed as part of the 1000 Genomes Project to perform MEI discovery on a population scale. Using both Illumina WGS data and simulations, we demonstrate that MELT outperforms existing MEI discovery tools in terms of speed, scalability, specificity, and sensitivity, while also detecting a broader spectrum of MEI-associated features. Several run modes were developed to perform MEI discovery on local and cloud systems. In addition to using MELT to discover MEIs in modern humans as part of the 1000 Genomes Project, we also used it to discover MEIs in chimpanzees and ancient (Neanderthal and Denisovan) hominids. We detected diverse patterns of MEI stratification across these populations that likely were caused by (1) diverse rates of MEI production from source elements, (2) diverse patterns of MEI inheritance, and (3) the introgression of ancient MEIs into modern human genomes. Overall, our study provides the most comprehensive map of MEIs to date spanning chimpanzees, ancient hominids, and modern humans and reveals new aspects of MEI biology in these lineages. We also demonstrate that MELT is a robust platform for MEI discovery and analysis in a variety of experimental settings. PMID:28855259

  15. Non-equivalent contributions of maternal and paternal genomes to early plant embryogenesis.

    PubMed

    Del Toro-De León, Gerardo; García-Aguilar, Marcelina; Gillmor, C Stewart

    2014-10-30

    Zygotic genome activation in metazoans typically occurs several hours to a day after fertilization, and thus maternal RNAs and proteins drive early animal embryo development. In plants, despite several molecular studies of post-fertilization transcriptional activation, the timing of zygotic genome activation remains a matter of debate. For example, two recent reports that used different hybrid ecotype combinations for RNA sequence profiling of early Arabidopsis embryo transcriptomes came to divergent conclusions. One identified paternal contributions that varied by gene, but with overall maternal dominance, while the other found that the maternal and paternal genomes are transcriptionally equivalent. Here we assess paternal gene activation functionally in an isogenic background, by performing a large-scale genetic analysis of 49 EMBRYO DEFECTIVE genes and testing the ability of wild-type paternal alleles to complement phenotypes conditioned by mutant maternal alleles. Our results demonstrate that wild-type paternal alleles for nine of these genes are completely functional 2 days after pollination, with the remaining 40 genes showing partial activity beginning at 2, 3 or 5 days after pollination. Using our functional assay, we also demonstrate that different hybrid combinations exhibit significant variation in paternal allele activation, reconciling the apparently contradictory results of previous transcriptional studies. The variation in timing of gene function that we observe confirms that paternal genome activation does not occur in one early discrete step, provides large-scale functional evidence that maternal and paternal genomes make non-equivalent contributions to early plant embryogenesis, and uncovers an unexpectedly profound effect of hybrid genetic backgrounds on paternal gene activity.

  16. Genome-wide association analysis of more than 120,000 individuals identifies 15 new susceptibility loci for breast cancer

    PubMed Central

    Michailidou, Kyriaki; Beesley, Jonathan; Lindstrom, Sara; Canisius, Sander; Dennis, Joe; Lush, Michael; Maranian, Mel J; Bolla, Manjeet K; Wang, Qin; Shah, Mitul; Perkins, Barbara J; Czene, Kamila; Eriksson, Mikael; Darabi, Hatef; Brand, Judith S; Bojesen, Stig E; Nordestgaard, Børge G; Flyger, Henrik; Nielsen, Sune F; Rahman, Nazneen; Turnbull, Clare; Fletcher, Olivia; Peto, Julian; Gibson, Lorna; dos-Santos-Silva, Isabel; Chang-Claude, Jenny; Flesch-Janys, Dieter; Rudolph, Anja; Eilber, Ursula; Behrens, Sabine; Nevanlinna, Heli; Muranen, Taru A; Aittomäki, Kristiina; Blomqvist, Carl; Khan, Sofia; Aaltonen, Kirsimari; Ahsan, Habibul; Kibriya, Muhammad G; Whittemore, Alice S; John, Esther M; Malone, Kathleen E; Gammon, Marilie D; Santella, Regina M; Ursin, Giske; Makalic, Enes; Schmidt, Daniel F; Casey, Graham; Hunter, David J; Gapstur, Susan M; Gaudet, Mia M; Diver, W Ryan; Haiman, Christopher A; Schumacher, Fredrick; Henderson, Brian E; Le Marchand, Loic; Berg, Christine D; Chanock, Stephen; Figueroa, Jonine; Hoover, Robert N; Lambrechts, Diether; Neven, Patrick; Wildiers, Hans; van Limbergen, Erik; Schmidt, Marjanka K; Broeks, Annegien; Verhoef, Senno; Cornelissen, Sten; Couch, Fergus J; Olson, Janet E; Hallberg, Emily; Vachon, Celine; Waisfisz, Quinten; Meijers-Heijboer, Hanne; Adank, Muriel A; van der Luijt, Rob B; Li, Jingmei; Liu, Jianjun; Humphreys, Keith; Kang, Daehee; Choi, Ji-Yeob; Park, Sue K; Yoo, Keun-Young; Matsuo, Keitaro; Ito, Hidemi; Iwata, Hiroji; Tajima, Kazuo; Guénel, Pascal; Truong, Thérèse; Mulot, Claire; Sanchez, Marie; Burwinkel, Barbara; Marme, Frederik; Surowy, Harald; Sohn, Christof; Wu, Anna H; Tseng, Chiu-chen; Van Den Berg, David; Stram, Daniel O; González-Neira, Anna; Benitez, Javier; Zamora, M Pilar; Perez, Jose Ignacio Arias; Shu, Xiao-Ou; Lu, Wei; Gao, Yu-Tang; Cai, Hui; Cox, Angela; Cross, Simon S; Reed, Malcolm WR; Andrulis, Irene L; Knight, Julia A; Glendon, Gord; Mulligan, Anna Marie; Sawyer, Elinor J; Tomlinson, Ian; Kerin, Michael J; Miller, Nicola; Lindblom, Annika; Margolin, Sara; Teo, Soo Hwang; Yip, Cheng Har; Taib, Nur Aishah Mohd; TAN, Gie-Hooi; Hooning, Maartje J; Hollestelle, Antoinette; Martens, John WM; Collée, J Margriet; Blot, William; Signorello, Lisa B; Cai, Qiuyin; Hopper, John L; Southey, Melissa C; Tsimiklis, Helen; Apicella, Carmel; Shen, Chen-Yang; Hsiung, Chia-Ni; Wu, Pei-Ei; Hou, Ming-Feng; Kristensen, Vessela N; Nord, Silje; Alnaes, Grethe I Grenaker; Giles, Graham G; Milne, Roger L; McLean, Catriona; Canzian, Federico; Trichopoulos, Dmitrios; Peeters, Petra; Lund, Eiliv; Sund, Malin; Khaw, Kay-Tee; Gunter, Marc J; Palli, Domenico; Mortensen, Lotte Maxild; Dossus, Laure; Huerta, Jose-Maria; Meindl, Alfons; Schmutzler, Rita K; Sutter, Christian; Yang, Rongxi; Muir, Kenneth; Lophatananon, Artitaya; Stewart-Brown, Sarah; Siriwanarangsan, Pornthep; Hartman, Mikael; Miao, Hui; Chia, Kee Seng; Chan, Ching Wan; Fasching, Peter A; Hein, Alexander; Beckmann, Matthias W; Haeberle, Lothar; Brenner, Hermann; Dieffenbach, Aida Karina; Arndt, Volker; Stegmaier, Christa; Ashworth, Alan; Orr, Nick; Schoemaker, Minouk J; Swerdlow, Anthony J; Brinton, Louise; Garcia-Closas, Montserrat; Zheng, Wei; Halverson, Sandra L; Shrubsole, Martha; Long, Jirong; Goldberg, Mark S; Labrèche, France; Dumont, Martine; Winqvist, Robert; Pylkäs, Katri; Jukkola-Vuorinen, Arja; Grip, Mervi; Brauch, Hiltrud; Hamann, Ute; Brüning, Thomas; Radice, Paolo; Peterlongo, Paolo; Manoukian, Siranoush; Bernard, Loris; Bogdanova, Natalia V; Dörk, Thilo; Mannermaa, Arto; Kataja, Vesa; Kosma, Veli-Matti; Hartikainen, Jaana M; Devilee, Peter; Tollenaar, Robert AEM; Seynaeve, Caroline; Van Asperen, Christi J; Jakubowska, Anna; Lubinski, Jan; Jaworska, Katarzyna; Huzarski, Tomasz; Sangrajrang, Suleeporn; Gaborieau, Valerie; Brennan, Paul; McKay, James; Slager, Susan; Toland, Amanda E; Ambrosone, Christine B; Yannoukakos, Drakoulis; Kabisch, Maria; Torres, Diana; Neuhausen, Susan L; Anton-Culver, Hoda; Luccarini, Craig; Baynes, Caroline; Ahmed, Shahana; Healey, Catherine S; Tessier, Daniel C; Vincent, Daniel; Bacot, Francois; Pita, Guillermo; Alonso, M Rosario; Álvarez, Nuria; Herrero, Daniel; Simard, Jacques; Pharoah, Paul PDP; Kraft, Peter; Dunning, Alison M; Chenevix-Trench, Georgia; Hall, Per; Easton, Douglas F

    2015-01-01

    Genome wide association studies (GWAS) and large scale replication studies have identified common variants in 79 loci associated with breast cancer, explaining ~14% of the familial risk of the disease. To identify new susceptibility loci, we performed a meta-analysis of 11 GWAS comprising of 15,748 breast cancer cases and 18,084 controls, and 46,785 cases and 42,892 controls from 41 studies genotyped on a 200K custom array (iCOGS). Analyses were restricted to women of European ancestry. Genotypes for more than 11M SNPs were generated by imputation using the 1000 Genomes Project reference panel. We identified 15 novel loci associated with breast cancer at P<5×10−8. Combining association analysis with ChIP-Seq data in mammary cell lines and ChIA-PET chromatin interaction data in ENCODE, we identified likely target genes in two regions: SETBP1 on 18q12.3 and RNF115 and PDZK1 on 1q21.1. One association appears to be driven by an amino-acid substitution in EXO1. PMID:25751625

  17. Signatures of selection in the three-spined stickleback along a small-scale brackish water - freshwater transition zone.

    PubMed

    Konijnendijk, Nellie; Shikano, Takahito; Daneels, Dorien; Volckaert, Filip A M; Raeymaekers, Joost A M

    2015-09-01

    Local adaptation is often obvious when gene flow is impeded, such as observed at large spatial scales and across strong ecological contrasts. However, it becomes less certain at small scales such as between adjacent populations or across weak ecological contrasts, when gene flow is strong. While studies on genomic adaptation tend to focus on the former, less is known about the genomic targets of natural selection in the latter situation. In this study, we investigate genomic adaptation in populations of the three-spined stickleback Gasterosteus aculeatus L. across a small-scale ecological transition with salinities ranging from brackish to fresh. Adaptation to salinity has been repeatedly demonstrated in this species. A genome scan based on 87 microsatellite markers revealed only few signatures of selection, likely owing to the constraints that homogenizing gene flow puts on adaptive divergence. However, the detected loci appear repeatedly as targets of selection in similar studies of genomic adaptation in the three-spined stickleback. We conclude that the signature of genomic selection in the face of strong gene flow is weak, yet detectable. We argue that the range of studies of genomic divergence should be extended to include more systems characterized by limited geographical and ecological isolation, which is often a realistic setting in nature.

  18. When Whole-Genome Alignments Just Won't Work: kSNP v2 Software for Alignment-Free SNP Discovery and Phylogenetics of Hundreds of Microbial Genomes

    PubMed Central

    Gardner, Shea N.; Hall, Barry G.

    2013-01-01

    Effective use of rapid and inexpensive whole genome sequencing for microbes requires fast, memory efficient bioinformatics tools for sequence comparison. The kSNP v2 software finds single nucleotide polymorphisms (SNPs) in whole genome data. kSNP v2 has numerous improvements over kSNP v1 including SNP gene annotation; better scaling for draft genomes available as assembled contigs or raw, unassembled reads; a tool to identify the optimal value of k; distribution of packages of executables for Linux and Mac OS X for ease of installation and user-friendly use; and a detailed User Guide. SNP discovery is based on k-mer analysis, and requires no multiple sequence alignment or the selection of a single reference genome. Most target sets with hundreds of genomes complete in minutes to hours. SNP phylogenies are built by maximum likelihood, parsimony, and distance, based on all SNPs, only core SNPs, or SNPs present in some intermediate user-specified fraction of targets. The SNP-based trees that result are consistent with known taxonomy. kSNP v2 can handle many gigabases of sequence in a single run, and if one or more annotated genomes are included in the target set, SNPs are annotated with protein coding and other information (UTRs, etc.) from Genbank file(s). We demonstrate application of kSNP v2 on sets of viral and bacterial genomes, and discuss in detail analysis of a set of 68 finished E. coli and Shigella genomes and a set of the same genomes to which have been added 47 assemblies and four “raw read” genomes of H104:H4 strains from the recent European E. coli outbreak that resulted in both bloody diarrhea and hemolytic uremic syndrome (HUS), and caused at least 50 deaths. PMID:24349125

  19. When whole-genome alignments just won't work: kSNP v2 software for alignment-free SNP discovery and phylogenetics of hundreds of microbial genomes.

    PubMed

    Gardner, Shea N; Hall, Barry G

    2013-01-01

    Effective use of rapid and inexpensive whole genome sequencing for microbes requires fast, memory efficient bioinformatics tools for sequence comparison. The kSNP v2 software finds single nucleotide polymorphisms (SNPs) in whole genome data. kSNP v2 has numerous improvements over kSNP v1 including SNP gene annotation; better scaling for draft genomes available as assembled contigs or raw, unassembled reads; a tool to identify the optimal value of k; distribution of packages of executables for Linux and Mac OS X for ease of installation and user-friendly use; and a detailed User Guide. SNP discovery is based on k-mer analysis, and requires no multiple sequence alignment or the selection of a single reference genome. Most target sets with hundreds of genomes complete in minutes to hours. SNP phylogenies are built by maximum likelihood, parsimony, and distance, based on all SNPs, only core SNPs, or SNPs present in some intermediate user-specified fraction of targets. The SNP-based trees that result are consistent with known taxonomy. kSNP v2 can handle many gigabases of sequence in a single run, and if one or more annotated genomes are included in the target set, SNPs are annotated with protein coding and other information (UTRs, etc.) from Genbank file(s). We demonstrate application of kSNP v2 on sets of viral and bacterial genomes, and discuss in detail analysis of a set of 68 finished E. coli and Shigella genomes and a set of the same genomes to which have been added 47 assemblies and four "raw read" genomes of H104:H4 strains from the recent European E. coli outbreak that resulted in both bloody diarrhea and hemolytic uremic syndrome (HUS), and caused at least 50 deaths.

  20. Characterization of canine osteosarcoma by array comparative genomic hybridization and RT-qPCR: signatures of genomic imbalance in canine osteosarcoma parallel the human counterpart.

    PubMed

    Angstadt, Andrea Y; Motsinger-Reif, Alison; Thomas, Rachael; Kisseberth, William C; Guillermo Couto, C; Duval, Dawn L; Nielsen, Dahlia M; Modiano, Jaime F; Breen, Matthew

    2011-11-01

    Osteosarcoma (OS) is the most commonly diagnosed malignant bone tumor in humans and dogs, characterized in both species by extremely complex karyotypes exhibiting high frequencies of genomic imbalance. Evaluation of genomic signatures in human OS using array comparative genomic hybridization (aCGH) has assisted in uncovering genetic mechanisms that result in disease phenotype. Previous low-resolution (10-20 Mb) aCGH analysis of canine OS identified a wide range of recurrent DNA copy number aberrations, indicating extensive genomic instability. In this study, we profiled 123 canine OS tumors by 1 Mb-resolution aCGH to generate a dataset for direct comparison with current data for human OS, concluding that several high frequency aberrations in canine and human OS are orthologous. To ensure complete coverage of gene annotation, we identified the human refseq genes that map to these orthologous aberrant dog regions and found several candidate genes warranting evaluation for OS involvement. Specifically, subsequenct FISH and qRT-PCR analysis of RUNX2, TUSC3, and PTEN indicated that expression levels correlated with genomic copy number status, showcasing RUNX2 as an OS associated gene and TUSC3 as a possible tumor suppressor candidate. Together these data demonstrate the ability of genomic comparative oncology to identify genetic abberations which may be important for OS progression. Large scale screening of genomic imbalance in canine OS further validates the use of the dog as a suitable model for human cancers, supporting the idea that dysregulation discovered in canine cancers will provide an avenue for complementary study in human counterparts. Copyright © 2011 Wiley-Liss, Inc.

  1. DNA confinement in nanochannels: physics and biological applications

    NASA Astrophysics Data System (ADS)

    Reisner, Walter; Pedersen, Jonas N.; Austin, Robert H.

    2012-10-01

    DNA is the central storage molecule of genetic information in the cell, and reading that information is a central problem in biology. While sequencing technology has made enormous advances over the past decade, there is growing interest in platforms that can readout genetic information directly from long single DNA molecules, with the ultimate goal of single-cell, single-genome analysis. Such a capability would obviate the need for ensemble averaging over heterogeneous cellular populations and eliminate uncertainties introduced by cloning and molecular amplification steps (thus enabling direct assessment of the genome in its native state). In this review, we will discuss how the information contained in genomic-length single DNA molecules can be accessed via physical confinement in nanochannels. Due to self-avoidance interactions, DNA molecules will stretch out when confined in nanochannels, creating a linear unscrolling of the genome along the channel for analysis. We will first review the fundamental physics of DNA nanochannel confinement—including the effect of varying ionic strength—and then discuss recent applications of these systems to genomic mapping. Apart from the intense biological interest in extracting linear sequence information from elongated DNA molecules, from a physics view these systems are fascinating as they enable probing of single-molecule conformation in environments with dimensions that intersect key physical length-scales in the 1 nm to 100 µm range.

  2. Smooth Muscle Cell Genome Browser: Enabling the Identification of Novel Serum Response Factor Target Genes

    PubMed Central

    Lee, Moon Young; Park, Chanjae; Berent, Robyn M.; Park, Paul J.; Fuchs, Robert; Syn, Hannah; Chin, Albert; Townsend, Jared; Benson, Craig C.; Redelman, Doug; Shen, Tsai-wei; Park, Jong Kun; Miano, Joseph M.; Sanders, Kenton M.; Ro, Seungil

    2015-01-01

    Genome-scale expression data on the absolute numbers of gene isoforms offers essential clues in cellular functions and biological processes. Smooth muscle cells (SMCs) perform a unique contractile function through expression of specific genes controlled by serum response factor (SRF), a transcription factor that binds to DNA sites known as the CArG boxes. To identify SRF-regulated genes specifically expressed in SMCs, we isolated SMC populations from mouse small intestine and colon, obtained their transcriptomes, and constructed an interactive SMC genome and CArGome browser. To our knowledge, this is the first online resource that provides a comprehensive library of all genetic transcripts expressed in primary SMCs. The browser also serves as the first genome-wide map of SRF binding sites. The browser analysis revealed novel SMC-specific transcriptional variants and SRF target genes, which provided new and unique insights into the cellular and biological functions of the cells in gastrointestinal (GI) physiology. The SRF target genes in SMCs, which were discovered in silico, were confirmed by proteomic analysis of SMC-specific Srf knockout mice. Our genome browser offers a new perspective into the alternative expression of genes in the context of SRF binding sites in SMCs and provides a valuable reference for future functional studies. PMID:26241044

  3. DNA confinement in nanochannels: physics and biological applications.

    PubMed

    Reisner, Walter; Pedersen, Jonas N; Austin, Robert H

    2012-10-01

    DNA is the central storage molecule of genetic information in the cell, and reading that information is a central problem in biology. While sequencing technology has made enormous advances over the past decade, there is growing interest in platforms that can readout genetic information directly from long single DNA molecules, with the ultimate goal of single-cell, single-genome analysis. Such a capability would obviate the need for ensemble averaging over heterogeneous cellular populations and eliminate uncertainties introduced by cloning and molecular amplification steps (thus enabling direct assessment of the genome in its native state). In this review, we will discuss how the information contained in genomic-length single DNA molecules can be accessed via physical confinement in nanochannels. Due to self-avoidance interactions, DNA molecules will stretch out when confined in nanochannels, creating a linear unscrolling of the genome along the channel for analysis. We will first review the fundamental physics of DNA nanochannel confinement--including the effect of varying ionic strength--and then discuss recent applications of these systems to genomic mapping. Apart from the intense biological interest in extracting linear sequence information from elongated DNA molecules, from a physics view these systems are fascinating as they enable probing of single-molecule conformation in environments with dimensions that intersect key physical length-scales in the 1 nm to 100 µm range.

  4. Integrative Analysis of Many RNA-Seq Datasets to Study Alternative Splicing

    PubMed Central

    Li, Wenyuan; Dai, Chao; Kang, Shuli; Zhou, Xianghong Jasmine

    2014-01-01

    Alternative splicing is an important gene regulatory mechanism that dramatically increases the complexity of the proteome. However, how alternative splicing is regulated and how transcription and splicing are coordinated are still poorly understood, and functions of transcript isoforms have been studied only in a few limited cases. Nowadays, RNA-seq technology provides an exceptional opportunity to study alternative splicing on genome-wide scales and in an unbiased manner. With the rapid accumulation of data in public repositories, new challenges arise from the urgent need to effectively integrate many different RNA-seq datasets for study alterative splicing. This paper discusses a set of advanced computational methods that can integrate and analyze many RNA-seq datasets to systematically identify splicing modules, unravel the coupling of transcription and splicing, and predict the functions of splicing isoforms on a genome-wide scale. PMID:24583115

  5. LOD significance thresholds for QTL analysis in experimental populations of diploid species

    PubMed

    Van Ooijen JW

    1999-11-01

    Linkage analysis with molecular genetic markers is a very powerful tool in the biological research of quantitative traits. The lack of an easy way to know what areas of the genome can be designated as statistically significant for containing a gene affecting the quantitative trait of interest hampers the important prediction of the rate of false positives. In this paper four tables, obtained by large-scale simulations, are presented that can be used with a simple formula to get the false-positives rate for analyses of the standard types of experimental populations with diploid species with any size of genome. A new definition of the term 'suggestive linkage' is proposed that allows a more objective comparison of results across species.

  6. Analysis of Environmental Stress Factors Using an Artificial Growth System and Plant Fitness Optimization

    PubMed Central

    Lee, Meonghun; Yoe, Hyun

    2015-01-01

    The environment promotes evolution. Evolutionary processes represent environmental adaptations over long time scales; evolution of crop genomes is not inducible within the relatively short time span of a human generation. Extreme environmental conditions can accelerate evolution, but such conditions are often stress inducing and disruptive. Artificial growth systems can be used to induce and select genomic variation by changing external environmental conditions, thus, accelerating evolution. By using cloud computing and big-data analysis, we analyzed environmental stress factors for Pleurotus ostreatus by assessing, evaluating, and predicting information of the growth environment. Through the indexing of environmental stress, the growth environment can be precisely controlled and developed into a technology for improving crop quality and production. PMID:25874206

  7. A phasing and imputation method for pedigreed populations that results in a single-stage genomic evaluation

    PubMed Central

    2012-01-01

    Background Efficient, robust, and accurate genotype imputation algorithms make large-scale application of genomic selection cost effective. An algorithm that imputes alleles or allele probabilities for all animals in the pedigree and for all genotyped single nucleotide polymorphisms (SNP) provides a framework to combine all pedigree, genomic, and phenotypic information into a single-stage genomic evaluation. Methods An algorithm was developed for imputation of genotypes in pedigreed populations that allows imputation for completely ungenotyped animals and for low-density genotyped animals, accommodates a wide variety of pedigree structures for genotyped animals, imputes unmapped SNP, and works for large datasets. The method involves simple phasing rules, long-range phasing and haplotype library imputation and segregation analysis. Results Imputation accuracy was high and computational cost was feasible for datasets with pedigrees of up to 25 000 animals. The resulting single-stage genomic evaluation increased the accuracy of estimated genomic breeding values compared to a scenario in which phenotypes on relatives that were not genotyped were ignored. Conclusions The developed imputation algorithm and software and the resulting single-stage genomic evaluation method provide powerful new ways to exploit imputation and to obtain more accurate genetic evaluations. PMID:22462519

  8. Using Galaxy to Perform Large-Scale Interactive Data Analyses

    PubMed Central

    Hillman-Jackson, Jennifer; Clements, Dave; Blankenberg, Daniel; Taylor, James; Nekrutenko, Anton

    2012-01-01

    Innovations in biomedical research technologies continue to provide experimental biologists with novel and increasingly large genomic and high-throughput data resources to be analyzed. As creating and obtaining data has become easier, the key decision faced by many researchers is a practical one: where and how should an analysis be performed? Datasets are large and analysis tool set-up and use is riddled with complexities outside of the scope of core research activities. The authors believe that Galaxy (galaxyproject.org) provides a powerful solution that simplifies data acquisition and analysis in an intuitive web-application, granting all researchers access to key informatics tools previously only available to computational specialists working in Unix-based environments. We will demonstrate through a series of biomedically relevant protocols how Galaxy specifically brings together 1) data retrieval from public and private sources, for example, UCSC’s Eukaryote and Microbial Genome Browsers (genome.ucsc.edu), 2) custom tools (wrapped Unix functions, format standardization/conversions, interval operations) and 3rd party analysis tools, for example, Bowtie/Tuxedo Suite (bowtie-bio.sourceforge.net), Lastz (www.bx.psu.edu/~rsharris/lastz/), SAMTools (samtools.sourceforge.net), FASTX-toolkit (hannonlab.cshl.edu/fastx_toolkit), and MACS (liulab.dfci.harvard.edu/MACS), and creates results formatted for visualization in tools such as the Galaxy Track Browser (GTB, galaxyproject.org/wiki/Learn/Visualization), UCSC Genome Browser (genome.ucsc.edu), Ensembl (www.ensembl.org), and GeneTrack (genetrack.bx.psu.edu). Galaxy rapidly has become the most popular choice for integrated next generation sequencing (NGS) analytics and collaboration, where users can perform, document, and share complex analysis within a single interface in an unprecedented number of ways. PMID:18428782

  9. Development of a targeted transgenesis strategy in highly differentiated cells: a powerful tool for functional genomic analysis.

    PubMed

    Puttini, Stefania; Ouvrard-Pascaud, Antoine; Palais, Gael; Beggah, Ahmed T; Gascard, Philippe; Cohen-Tannoudji, Michel; Babinet, Charles; Blot-Chabaud, Marcel; Jaisser, Frederic

    2005-03-16

    Functional genomic analysis is a challenging step in the so-called post-genomic field. Identification of potential targets using large-scale gene expression analysis requires functional validation to identify those that are physiologically relevant. Genetically modified cell models are often used for this purpose allowing up- or down-expression of selected targets in a well-defined and if possible highly differentiated cell type. However, the generation of such models remains time-consuming and expensive. In order to alleviate this step, we developed a strategy aimed at the rapid and efficient generation of genetically modified cell lines with conditional, inducible expression of various target genes. Efficient knock-in of various constructs, called targeted transgenesis, in a locus selected for its permissibility to the tet inducible system, was obtained through the stimulation of site-specific homologous recombination by the meganuclease I-SceI. Our results demonstrate that targeted transgenesis in a reference inducible locus greatly facilitated the functional analysis of the selected recombinant cells. The efficient screening strategy we have designed makes possible automation of the transfection and selection steps. Furthermore, this strategy could be applied to a variety of highly differentiated cells.

  10. RNA-Seq analysis of yak ovary: improving yak gene structure information and mining reproduction-related genes.

    PubMed

    Lan, DaoLiang; Xiong, XianRong; Wei, YanLi; Xu, Tong; Zhong, JinCheng; Zhi, XiangDong; Wang, Yong; Li, Jian

    2014-09-01

    RNA-Seq, a high-throughput (HT) sequencing technique, has been used effectively in large-scale transcriptomic studies, and is particularly useful for improving gene structure information and mining of new genes. In this study, RNA-Seq HT technology was employed to analyze the transcriptome of yak ovary. After Illumina-Solexa deep sequencing, 26826516 clean reads with a total of 4828772880 bp were obtained from the ovary library. Alignment analysis showed that 16992 yak genes mapped to the yak genome and 3734 of these genes were involved in alternative splicing. Gene structure refinement analysis showed that 7340 genes that were annotated in the yak genome could be extended at the 5' or 3' ends based on the alignments been the transcripts and the genome sequence. Novel transcript prediction analysis identified 6321 new transcripts with lengths ranging from 180 to 14884 bp, and 2267 of them were predicted to code proteins. BLAST analysis of the new transcripts showed that 1200?4933 mapped to the non-redundant (nr), nucleotide (nt) and/or SwissProt sequence databases. Comparative statistical analysis of the new mapped transcripts showed that the majority of them were similar to genes in Bos taurus (41.4%), Bos grunniens mutus (33.0%), Ovis aries (6.3%), Homo sapiens (2.8%), Mus musculus (1.6%) and other species. Functional analysis showed that these expressed genes were involved in various Gene Ontology (GO) categories and Kyoto Encyclopedia of Genes and Genomes pathways. GO analysis of the new transcripts found that the largest proportion of them was associated with reproduction. The results of this study will provide a basis for describing the normal transcriptome map of yak ovary and for future studies on yak breeding performance. Moreover, the results confirmed that RNA-Seq HT technology is highly advantageous in improving gene structure information and mining of new genes, as well as in providing valuable data to expand the yak genome information.

  11. Discovery, genotyping and characterization of structural variation and novel sequence at single nucleotide resolution from de novo genome assemblies on a population scale.

    PubMed

    Liu, Siyang; Huang, Shujia; Rao, Junhua; Ye, Weijian; Krogh, Anders; Wang, Jun

    2015-01-01

    Comprehensive recognition of genomic variation in one individual is important for understanding disease and developing personalized medication and treatment. Many tools based on DNA re-sequencing exist for identification of single nucleotide polymorphisms, small insertions and deletions (indels) as well as large deletions. However, these approaches consistently display a substantial bias against the recovery of complex structural variants and novel sequence in individual genomes and do not provide interpretation information such as the annotation of ancestral state and formation mechanism. We present a novel approach implemented in a single software package, AsmVar, to discover, genotype and characterize different forms of structural variation and novel sequence from population-scale de novo genome assemblies up to nucleotide resolution. Application of AsmVar to several human de novo genome assemblies captures a wide spectrum of structural variants and novel sequences present in the human population in high sensitivity and specificity. Our method provides a direct solution for investigating structural variants and novel sequences from de novo genome assemblies, facilitating the construction of population-scale pan-genomes. Our study also highlights the usefulness of the de novo assembly strategy for definition of genome structure.

  12. Ensembl genomes 2016: more genomes, more complexity

    USDA-ARS?s Scientific Manuscript database

    Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the context of the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent...

  13. Integrating genome-wide association studies and gene expression data highlights dysregulated multiple sclerosis risk pathways.

    PubMed

    Liu, Guiyou; Zhang, Fang; Jiang, Yongshuai; Hu, Yang; Gong, Zhongying; Liu, Shoufeng; Chen, Xiuju; Jiang, Qinghua; Hao, Junwei

    2017-02-01

    Much effort has been expended on identifying the genetic determinants of multiple sclerosis (MS). Existing large-scale genome-wide association study (GWAS) datasets provide strong support for using pathway and network-based analysis methods to investigate the mechanisms underlying MS. However, no shared genetic pathways have been identified to date. We hypothesize that shared genetic pathways may indeed exist in different MS-GWAS datasets. Here, we report results from a three-stage analysis of GWAS and expression datasets. In stage 1, we conducted multiple pathway analyses of two MS-GWAS datasets. In stage 2, we performed a candidate pathway analysis of the large-scale MS-GWAS dataset. In stage 3, we performed a pathway analysis using the dysregulated MS gene list from seven human MS case-control expression datasets. In stage 1, we identified 15 shared pathways. In stage 2, we successfully replicated 14 of these 15 significant pathways. In stage 3, we found that dysregulated MS genes were significantly enriched in 10 of 15 MS risk pathways identified in stages 1 and 2. We report shared genetic pathways in different MS-GWAS datasets and highlight some new MS risk pathways. Our findings provide new insights on the genetic determinants of MS.

  14. D-Light on promoters: a client-server system for the analysis and visualization of cis-regulatory elements

    PubMed Central

    2013-01-01

    Background The binding of transcription factors to DNA plays an essential role in the regulation of gene expression. Numerous experiments elucidated binding sequences which subsequently have been used to derive statistical models for predicting potential transcription factor binding sites (TFBS). The rapidly increasing number of genome sequence data requires sophisticated computational approaches to manage and query experimental and predicted TFBS data in the context of other epigenetic factors and across different organisms. Results We have developed D-Light, a novel client-server software package to store and query large amounts of TFBS data for any number of genomes. Users can add small-scale data to the server database and query them in a large scale, genome-wide promoter context. The client is implemented in Java and provides simple graphical user interfaces and data visualization. Here we also performed a statistical analysis showing what a user can expect for certain parameter settings and we illustrate the usage of D-Light with the help of a microarray data set. Conclusions D-Light is an easy to use software tool to integrate, store and query annotation data for promoters. A public D-Light server, the client and server software for local installation and the source code under GNU GPL license are available at http://biwww.che.sbg.ac.at/dlight. PMID:23617301

  15. Whole Genome Analysis of 132 Clinical Saccharomyces cerevisiae Strains Reveals Extensive Ploidy Variation

    PubMed Central

    Zhu, Yuan O.; Sherlock, Gavin; Petrov, Dmitri A.

    2016-01-01

    Budding yeast has undergone several independent transitions from commercial to clinical lifestyles. The frequency of such transitions suggests that clinical yeast strains are derived from environmentally available yeast populations, including commercial sources. However, despite their important role in adaptive evolution, the prevalence of polyploidy and aneuploidy has not been extensively analyzed in clinical strains. In this study, we have looked for patterns governing the transition to clinical invasion in the largest screen of clinical yeast isolates to date. In particular, we have focused on the hypothesis that ploidy changes have influenced adaptive processes. We sequenced 144 yeast strains, 132 of which are clinical isolates. We found pervasive large-scale genomic variation in both overall ploidy (34% of strains identified as 3n/4n) and individual chromosomal copy numbers (36% of strains identified as aneuploid). We also found evidence for the highly dynamic nature of yeast genomes, with 35 strains showing partial chromosomal copy number changes and eight strains showing multiple independent chromosomal events. Intriguingly, a lineage identified to be baker’s/commercial derived with a unique damaging mutation in NDC80 was particularly prone to polyploidy, with 83% of its members being triploid or tetraploid. Polyploidy was in turn associated with a >2× increase in aneuploidy rates as compared to other lineages. This dataset provides a rich source of information on the genomics of clinical yeast strains and highlights the potential importance of large-scale genomic copy variation in yeast adaptation. PMID:27317778

  16. GenomeDiagram: a python package for the visualization of large-scale genomic data.

    PubMed

    Pritchard, Leighton; White, Jennifer A; Birch, Paul R J; Toth, Ian K

    2006-03-01

    We present GenomeDiagram, a flexible, open-source Python module for the visualization of large-scale genomic, comparative genomic and other data with reference to a single chromosome or other biological sequence. GenomeDiagram may be used to generate publication-quality vector graphics, rastered images and in-line streamed graphics for webpages. The package integrates with datatypes from the BioPython project, and is available for Windows, Linux and Mac OS X systems. GenomeDiagram is freely available as source code (under GNU Public License) at http://bioinf.scri.ac.uk/lp/programs.html, and requires Python 2.3 or higher, and recent versions of the ReportLab and BioPython packages. A user manual, example code and images are available at http://bioinf.scri.ac.uk/lp/programs.html.

  17. Constraints on genes shape long-term conservation of macro-synteny in metazoan genomes.

    PubMed

    Lv, Jie; Havlak, Paul; Putnam, Nicholas H

    2011-10-05

    Many metazoan genomes conserve chromosome-scale gene linkage relationships ("macro-synteny") from the common ancestor of multicellular animal life 1234, but the biological explanation for this conservation is still unknown. Double cut and join (DCJ) is a simple, well-studied model of neutral genome evolution amenable to both simulation and mathematical analysis 5, but as we show here, it is not sufficent to explain long-term macro-synteny conservation. We examine a family of simple (one-parameter) extensions of DCJ to identify models and choices of parameters consistent with the levels of macro- and micro-synteny conservation observed among animal genomes. Our software implements a flexible strategy for incorporating genomic context into the DCJ model to incorporate various types of genomic context ("DCJ-[C]"), and is available as open source software from http://github.com/putnamlab/dcj-c. A simple model of genome evolution, in which DCJ moves are allowed only if they maintain chromosomal linkage among a set of constrained genes, can simultaneously account for the level of macro-synteny conservation and for correlated conservation among multiple pairs of species. Simulations under this model indicate that a constraint on approximately 7% of metazoan genes is sufficient to constrain genome rearrangement to an average rate of 25 inversions and 1.7 translocations per million years.

  18. Disease Management in the Genomics Era-Summaries of Focus Issue Papers.

    PubMed

    Klosterman, S J; Rollins, J R; Sudarshana, M R; Vinatzer, B A

    2016-10-01

    The genomics revolution has contributed enormously to research and disease management applications in plant pathology. This development has rapidly increased our understanding of the molecular mechanisms underpinning pathogenesis and resistance, contributed novel markers for rapid pathogen detection and diagnosis, and offered further insights into the genetics of pathogen populations on a larger scale. The availability of whole genome resources coupled with next-generation sequencing (NGS) technologies has helped fuel genomics-based approaches to improve disease resistance in crops. NGS technologies have accelerated the pace at which whole plant and pathogen genomes have become available, and made possible the metagenomic analysis of plant-associated microbial communities. Furthermore, NGS technologies can now be applied routinely and cost effectively to rapidly generate plant and/or pathogen genome or transcriptome marker sequences associated with virulence phenotypes in the pathogen or resistance phenotypes in the plant, potentially leading to improvements in plant disease management. In some systems, investments in plant and pathogen genomics have led to immediate, tangible benefits. This focus issue covers some of the systems. The articles in this focus issue range from overall perspective articles to research articles describing specific genomics applications for detection and control of diseases caused by nematode, viral, bacterial, fungal, and oomycete pathogens. The following are representative short summaries of the articles that appear in this Focus Issue .

  19. Construction of Red Fox Chromosomal Fragments from the Short-Read Genome Assembly.

    PubMed

    Rando, Halie M; Farré, Marta; Robson, Michael P; Won, Naomi B; Johnson, Jennifer L; Buch, Ronak; Bastounes, Estelle R; Xiang, Xueyan; Feng, Shaohong; Liu, Shiping; Xiong, Zijun; Kim, Jaebum; Zhang, Guojie; Trut, Lyudmila N; Larkin, Denis M; Kukekova, Anna V

    2018-06-20

    The genome of a red fox ( Vulpes vulpes ) was recently sequenced and assembled using next-generation sequencing (NGS). The assembly is of high quality, with 94X coverage and a scaffold N50 of 11.8 Mbp, but is split into 676,878 scaffolds, some of which are likely to contain assembly errors. Fragmentation and misassembly hinder accurate gene prediction and downstream analysis such as the identification of loci under selection. Therefore, assembly of the genome into chromosome-scale fragments was an important step towards developing this genomic model. Scaffolds from the assembly were aligned to the dog reference genome and compared to the alignment of an outgroup genome (cat) against the dog to identify syntenic sequences among species. The program Reference-Assisted Chromosome Assembly (RACA) then integrated the comparative alignment with the mapping of the raw sequencing reads generated during assembly against the fox scaffolds. The 128 sequence fragments RACA assembled were compared to the fox meiotic linkage map to guide the construction of 40 chromosomal fragments. This computational approach to assembly was facilitated by prior research in comparative mammalian genomics, and the continued improvement of the red fox genome can in turn offer insight into canid and carnivore chromosome evolution. This assembly is also necessary for advancing genetic research in foxes and other canids.

  20. Genome Calligrapher: A Web Tool for Refactoring Bacterial Genome Sequences for de Novo DNA Synthesis.

    PubMed

    Christen, Matthias; Deutsch, Samuel; Christen, Beat

    2015-08-21

    Recent advances in synthetic biology have resulted in an increasing demand for the de novo synthesis of large-scale DNA constructs. Any process improvement that enables fast and cost-effective streamlining of digitized genetic information into fabricable DNA sequences holds great promise to study, mine, and engineer genomes. Here, we present Genome Calligrapher, a computer-aided design web tool intended for whole genome refactoring of bacterial chromosomes for de novo DNA synthesis. By applying a neutral recoding algorithm, Genome Calligrapher optimizes GC content and removes obstructive DNA features known to interfere with the synthesis of double-stranded DNA and the higher order assembly into large DNA constructs. Subsequent bioinformatics analysis revealed that synthesis constraints are prevalent among bacterial genomes. However, a low level of codon replacement is sufficient for refactoring bacterial genomes into easy-to-synthesize DNA sequences. To test the algorithm, 168 kb of synthetic DNA comprising approximately 20 percent of the synthetic essential genome of the cell-cycle bacterium Caulobacter crescentus was streamlined and then ordered from a commercial supplier of low-cost de novo DNA synthesis. The successful assembly into eight 20 kb segments indicates that Genome Calligrapher algorithm can be efficiently used to refactor difficult-to-synthesize DNA. Genome Calligrapher is broadly applicable to recode biosynthetic pathways, DNA sequences, and whole bacterial genomes, thus offering new opportunities to use synthetic biology tools to explore the functionality of microbial diversity. The Genome Calligrapher web tool can be accessed at https://christenlab.ethz.ch/GenomeCalligrapher  .

Top