Science.gov

Sample records for inferring genomic structural

  1. Structure-based inference of molecular functions of proteins of unknown function from Berkeley Structural Genomics Center

    SciTech Connect

    Kim, Sung-Hou; Shin, Dong Hae; Hou, Jingtong; Chandonia, John-Marc; Das, Debanu; Choi, In-Geol; Kim, Rosalind; Kim, Sung-Hou

    2007-09-02

    Advances in sequence genomics have resulted in an accumulation of a huge number of protein sequences derived from genome sequences. However, the functions of a large portion of them cannot be inferred based on the current methods of sequence homology detection to proteins of known functions. Three-dimensional structure can have an important impact in providing inference of molecular function (physical and chemical function) of a protein of unknown function. Structural genomics centers worldwide have been determining many 3-D structures of the proteins of unknown functions, and possible molecular functions of them have been inferred based on their structures. Combined with bioinformatics and enzymatic assay tools, the successful acceleration of the process of protein structure determination through high throughput pipelines enables the rapid functional annotation of a large fraction of hypothetical proteins. We present a brief summary of the process we used at the Berkeley Structural Genomics Center to infer molecular functions of proteins of unknown function.

  2. Structure-based inference of molecular functions of proteins of unknown function from Berkeley Structural Genomics Center.

    PubMed

    Shin, Dong Hae; Hou, Jingtong; Chandonia, John-Marc; Das, Debanu; Choi, In-Geol; Kim, Rosalind; Kim, Sung-Hou

    2007-09-01

    Advances in sequence genomics have resulted in an accumulation of a huge number of protein sequences derived from genome sequences. However, the functions of a large portion of them cannot be inferred based on the current methods of sequence homology detection to proteins of known functions. Three-dimensional structure can have an important impact in providing inference of molecular function (physical and chemical function) of a protein of unknown function. Structural genomics centers worldwide have been determining many 3-D structures of the proteins of unknown functions, and possible molecular functions of them have been inferred based on their structures. Combined with bioinformatics and enzymatic assay tools, the successful acceleration of the process of protein structure determination through high throughput pipelines enables the rapid functional annotation of a large fraction of hypothetical proteins. We present a brief summary of the process we used at the Berkeley Structural Genomics Center to infer molecular functions of proteins of unknown function.

  3. Gene network inference via structural equation modeling in genetical genomics experiments.

    PubMed

    Liu, Bing; de la Fuente, Alberto; Hoeschele, Ina

    2008-03-01

    Our goal is gene network inference in genetical genomics or systems genetics experiments. For species where sequence information is available, we first perform expression quantitative trait locus (eQTL) mapping by jointly utilizing cis-, cis-trans-, and trans-regulation. After using local structural models to identify regulator-target pairs for each eQTL, we construct an encompassing directed network (EDN) by assembling all retained regulator-target relationships. The EDN has nodes corresponding to expressed genes and eQTL and directed edges from eQTL to cis-regulated target genes, from cis-regulated genes to cis-trans-regulated target genes, from trans-regulator genes to target genes, and from trans-eQTL to target genes. For network inference within the strongly constrained search space defined by the EDN, we propose structural equation modeling (SEM), because it can model cyclic networks and the EDN indeed contains feedback relationships. On the basis of a factorization of the likelihood and the constrained search space, our SEM algorithm infers networks involving several hundred genes and eQTL. Structure inference is based on a penalized likelihood ratio and an adaptation of Occam's window model selection. The SEM algorithm was evaluated using data simulated with nonlinear ordinary differential equations and known cyclic network topologies and was applied to a real yeast data set.

  4. Inferring network structure in non-normal and mixed discrete-continuous genomic data.

    PubMed

    Bhadra, Anindya; Rao, Arvind; Baladandayuthapani, Veerabhadran

    2017-04-24

    Inferring dependence structure through undirected graphs is crucial for uncovering the major modes of multivariate interaction among high-dimensional genomic markers that are potentially associated with cancer. Traditionally, conditional independence has been studied using sparse Gaussian graphical models for continuous data and sparse Ising models for discrete data. However, there are two clear situations when these approaches are inadequate. The first occurs when the data are continuous but display non-normal marginal behavior such as heavy tails or skewness, rendering an assumption of normality inappropriate. The second occurs when a part of the data is ordinal or discrete (e.g., presence or absence of a mutation) and the other part is continuous (e.g., expression levels of genes or proteins). In this case, the existing Bayesian approaches typically employ a latent variable framework for the discrete part that precludes inferring conditional independence among the data that are actually observed. The current article overcomes these two challenges in a unified framework using Gaussian scale mixtures. Our framework is able to handle continuous data that are not normal and data that are of mixed continuous and discrete nature, while still being able to infer a sparse conditional sign independence structure among the observed data. Extensive performance comparison in simulations with alternative techniques and an analysis of a real cancer genomics data set demonstrate the effectiveness of the proposed approach. © 2017, The International Biometric Society.

  5. Inferring gene structures in genomic sequences using pattern recognition and expressed sequence tags

    SciTech Connect

    Xu, Y.; Mural, R.; Uberbacher, E.

    1997-02-01

    Computational methods for gene identification in genomic sequences typically have two phases: coding region prediction and gene parsing. While there are many effective methods for predicting coding regions (exons), parsing the predicted exons into proper gene structures, to a large extent, remains an unsolved problem. This paper presents an algorithm for inferring gene structures from predicted exon candidates, based on Expressed Sequence Tags (ESTs) and biological intuition/rules. The algorithm first finds all the related ESTs in the EST database (dbEST) for each predicted exon, and infers the boundaries of one or a series of genes based on the available EST information and biological rules. Then it constructs gene models within each pair of gene boundaries, that are most consistent with the EST information. By exploiting EST information and biological rules, the algorithm can (1) model complicated multiple gene structures, including embedded genes, (2) identify falsely-predicted exons and locate missed exons, and (3) make more accurate exon boundary predictions. The algorithm has been implemented and tested on long genomic sequences with a number of genes. Test results show that very accurate (predicted) gene models can be expected when related ESTs exist for the predicted exons.

  6. Inferring gene structures in genomic sequences using pattern recognition and expressed sequence tags.

    PubMed

    Xu, Y; Mural, R J; Uberbacher, E C

    1997-01-01

    Computational methods for gene identification in genomic sequences typically have two phases: coding region prediction and gene parsing. While there are many effective methods for predicting coding regions (exons), parsing the predicted exons into proper gene structures, to a large extent, remains an unsolved problem. This paper presents an algorithm for inferring gene structures from predicted exon candidates, based on Expressed Sequence Tags (ESTs) and biological intuition/rules. The algorithm first finds all the related ESTs in the EST database (dbEST) for each predicted exon, and infers the boundaries of one or a series of genes based on the available EST information and biological rules. Then it constructs gene models within each pair of gene boundaries, that are most consistent with the EST information. By exploiting EST information and biological rules, the algorithm can (1) model complicated multiple gene structures, including embedded genes, (2) identify falsely-predicted exons and locate missed exons, and (3) make more accurate exon boundary predictions. The algorithm has been implemented and tested on long genomic sequences with a number of genes. Test results show that very accurate (predicted) gene models can be expected when related ESTs exist for the predicted exons.

  7. Adaptive evolution of chloroplast genome structure inferred using a parametric bootstrap approach

    PubMed Central

    Cui, Liying; Leebens-Mack, Jim; Wang, Li-San; Tang, Jijun; Rymarquis, Linda; Stern, David B; dePamphilis, Claude W

    2006-01-01

    Background Genome rearrangements influence gene order and configuration of gene clusters in all genomes. Most land plant chloroplast DNAs (cpDNAs) share a highly conserved gene content and with notable exceptions, a largely co-linear gene order. Conserved gene orders may reflect a slow intrinsic rate of neutral chromosomal rearrangements, or selective constraint. It is unknown to what extent observed changes in gene order are random or adaptive. We investigate the influence of natural selection on gene order in association with increased rate of chromosomal rearrangement. We use a novel parametric bootstrap approach to test if directional selection is responsible for the clustering of functionally related genes observed in the highly rearranged chloroplast genome of the unicellular green alga Chlamydomonas reinhardtii, relative to ancestral chloroplast genomes. Results Ancestral gene orders were inferred and then subjected to simulated rearrangement events under the random breakage model with varying ratios of inversions and transpositions. We found that adjacent chloroplast genes in C. reinhardtii were located on the same strand much more frequently than in simulated genomes that were generated under a random rearrangement processes (increased sidedness; p < 0.0001). In addition, functionally related genes were found to be more clustered than those evolved under random rearrangements (p < 0.0001). We report evidence of co-transcription of neighboring genes, which may be responsible for the observed gene clusters in C. reinhardtii cpDNA. Conclusion Simulations and experimental evidence suggest that both selective maintenance and directional selection for gene clusters are determinants of chloroplast gene order. PMID:16469102

  8. Genetic Structure and Phylogeography of the Leopard Cat (Prionailurus bengalensis) Inferred from Mitochondrial Genomes.

    PubMed

    Patel, Riddhi P; Wutke, Saskia; Lenz, Dorina; Mukherjee, Shomita; Ramakrishnan, Uma; Veron, Géraldine; Fickel, Jörns; Wilting, Andreas; Förster, Daniel W

    2017-06-01

    The Leopard cat Prionailurus bengalensis is a habitat generalist that is widely distributed across Southeast Asia. Based on morphological traits, this species has been subdivided into 12 subspecies. Thus far, there have been few molecular studies investigating intraspecific variation, and those had been limited in geographic scope. For this reason, we aimed to study the genetic structure and evolutionary history of this species across its very large distribution range in Asia. We employed both PCR-based (short mtDNA fragments, 94 samples) and high throughput sequencing based methods (whole mitochondrial genomes, 52 samples) on archival, noninvasively collected and fresh samples to investigate the distribution of intraspecific genetic variation. Our comprehensive sampling coupled with the improved resolution of a mitochondrial genome analyses provided strong support for a deep split between Mainland and Sundaic Leopard cats. Although we identified multiple haplogroups within the species' distribution, we found no matrilineal evidence for the distinction of 12 subspecies. In the context of Leopard cat biogeography, we cautiously recommend a revision of the Prionailurus bengalensis subspecific taxonomy: namely, a reduction to 4 subspecies (2 mainland and 2 Sundaic forms). © The American Genetic Association 2017. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  9. Remarkable variation in maize genome structure inferred from haplotype diversity at the bz locus.

    PubMed

    Wang, Qinghua; Dooner, Hugo K

    2006-11-21

    Maize is probably the most diverse of all crop species. Unexpectedly large differences among haplotypes were first revealed in a comparison of the bz genomic regions of two different inbred lines, McC and B73. Retrotransposon clusters, which comprise most of the repetitive DNA in maize, varied markedly in makeup, and location relative to the genes in the region and genic sequences, later shown to be carried by two helitron transposons, also differed between the inbreds. Thus, the allelic bz regions of these Corn Belt inbreds shared only a minority of the total sequence. To investigate further the variation caused by retrotransposons, helitrons, and other insertions, we have analyzed the organization of the bz genomic region in five additional cultivars selected because of their geographic and genetic diversity: the inbreds A188, CML258, and I137TN, and the land races Coroico and NalTel. This vertical comparison has revealed the existence of several new helitrons, new retrotransposons, members of every superfamily of DNA transposons, numerous miniature elements, and novel insertions flanked at either end by TA repeats, which we call TAFTs (TA-flanked transposons). The extent of variation in the region is remarkable. In pairwise comparisons of eight bz haplotypes, the percentage of shared sequences ranges from 25% to 84%. Chimeric haplotypes were identified that combine retrotransposon clusters found in different haplotypes. We propose that recombination in the common gene space greatly amplifies the variability produced by the retrotransposition explosion in the maize ancestry, creating the heterogeneity in genome organization found in modern maize.

  10. Remarkable variation in maize genome structure inferred from haplotype diversity at the bz locus

    PubMed Central

    Wang, Qinghua; Dooner, Hugo K.

    2006-01-01

    Maize is probably the most diverse of all crop species. Unexpectedly large differences among haplotypes were first revealed in a comparison of the bz genomic regions of two different inbred lines, McC and B73. Retrotransposon clusters, which comprise most of the repetitive DNA in maize, varied markedly in makeup, and location relative to the genes in the region and genic sequences, later shown to be carried by two helitron transposons, also differed between the inbreds. Thus, the allelic bz regions of these Corn Belt inbreds shared only a minority of the total sequence. To investigate further the variation caused by retrotransposons, helitrons, and other insertions, we have analyzed the organization of the bz genomic region in five additional cultivars selected because of their geographic and genetic diversity: the inbreds A188, CML258, and I137TN, and the land races Coroico and NalTel. This vertical comparison has revealed the existence of several new helitrons, new retrotransposons, members of every superfamily of DNA transposons, numerous miniature elements, and novel insertions flanked at either end by TA repeats, which we call TAFTs (TA-flanked transposons). The extent of variation in the region is remarkable. In pairwise comparisons of eight bz haplotypes, the percentage of shared sequences ranges from 25% to 84%. Chimeric haplotypes were identified that combine retrotransposon clusters found in different haplotypes. We propose that recombination in the common gene space greatly amplifies the variability produced by the retrotransposition explosion in the maize ancestry, creating the heterogeneity in genome organization found in modern maize. PMID:17101975

  11. Pseudoscorpion mitochondria show rearranged genes and genome-wide reductions of RNA gene sizes and inferred structures, yet typical nucleotide composition bias

    PubMed Central

    2012-01-01

    Background Pseudoscorpions are chelicerates and have historically been viewed as being most closely related to solifuges, harvestmen, and scorpions. No mitochondrial genomes of pseudoscorpions have been published, but the mitochondrial genomes of some lineages of Chelicerata possess unusual features, including short rRNA genes and tRNA genes that lack sequence to encode arms of the canonical cloverleaf-shaped tRNA. Additionally, some chelicerates possess an atypical guanine-thymine nucleotide bias on the major coding strand of their mitochondrial genomes. Results We sequenced the mitochondrial genomes of two divergent taxa from the chelicerate order Pseudoscorpiones. We find that these genomes possess unusually short tRNA genes that do not encode cloverleaf-shaped tRNA structures. Indeed, in one genome, all 22 tRNA genes lack sequence to encode canonical cloverleaf structures. We also find that the large ribosomal RNA genes are substantially shorter than those of most arthropods. We inferred secondary structures of the LSU rRNAs from both pseudoscorpions, and find that they have lost multiple helices. Based on comparisons with the crystal structure of the bacterial ribosome, two of these helices were likely contact points with tRNA T-arms or D-arms as they pass through the ribosome during protein synthesis. The mitochondrial gene arrangements of both pseudoscorpions differ from the ancestral chelicerate gene arrangement. One genome is rearranged with respect to the location of protein-coding genes, the small rRNA gene, and at least 8 tRNA genes. The other genome contains 6 tRNA genes in novel locations. Most chelicerates with rearranged mitochondrial genes show a genome-wide reversal of the CA nucleotide bias typical for arthropods on their major coding strand, and instead possess a GT bias. Yet despite their extensive rearrangement, these pseudoscorpion mitochondrial genomes possess a CA bias on the major coding strand. Phylogenetic analyses of all 13

  12. Minimal-assumption inference from population-genomic data

    PubMed Central

    Weissman, Daniel B; Hallatschek, Oskar

    2017-01-01

    Samples of multiple complete genome sequences contain vast amounts of information about the evolutionary history of populations, much of it in the associations among polymorphisms at different loci. We introduce a method, Minimal-Assumption Genomic Inference of Coalescence (MAGIC), that reconstructs key features of the evolutionary history, including the distribution of coalescence times, by integrating information across genomic length scales without using an explicit model of coalescence or recombination, allowing it to analyze arbitrarily large samples without phasing while making no assumptions about ancestral structure, linked selection, or gene conversion. Using simulated data, we show that the performance of MAGIC is comparable to that of PSMC’ even on single diploid samples generated with standard coalescent and recombination models. Applying MAGIC to a sample of human genomes reveals evidence of non-demographic factors driving coalescence. DOI: http://dx.doi.org/10.7554/eLife.24836.001 PMID:28671549

  13. Structure, expression profile and phylogenetic inference of chalcone isomerase-like genes from the narrow-leafed lupin (Lupinus angustifolius L.) genome

    PubMed Central

    Przysiecka, Łucja; Książkiewicz, Michał; Wolko, Bogdan; Naganowska, Barbara

    2015-01-01

    Lupins, like other legumes, have a unique biosynthesis scheme of 5-deoxy-type flavonoids and isoflavonoids. A key enzyme in this pathway is chalcone isomerase (CHI), a member of CHI-fold protein family, encompassing subfamilies of CHI1, CHI2, CHI-like (CHIL), and fatty acid-binding (FAP) proteins. Here, two Lupinus angustifolius (narrow-leafed lupin) CHILs, LangCHIL1 and LangCHIL2, were identified and characterized using DNA fingerprinting, cytogenetic and linkage mapping, sequencing and expression profiling. Clones carrying CHIL sequences were assembled into two contigs. Full gene sequences were obtained from these contigs, and mapped in two L. angustifolius linkage groups by gene-specific markers. Bacterial artificial chromosome fluorescence in situ hybridization approach confirmed the localization of two LangCHIL genes in distinct chromosomes. The expression profiles of both LangCHIL isoforms were very similar. The highest level of transcription was in the roots of the third week of plant growth; thereafter, expression declined. The expression of both LangCHIL genes in leaves and stems was similar and low. Comparative mapping to reference legume genome sequences revealed strong syntenic links; however, LangCHIL2 contig had a much more conserved structure than LangCHIL1. LangCHIL2 is assumed to be an ancestor gene, whereas LangCHIL1 probably appeared as a result of duplication. As both copies are transcriptionally active, questions arise concerning their hypothetical functional divergence. Screening of the narrow-leafed lupin genome and transcriptome with CHI-fold protein sequences, followed by Bayesian inference of phylogeny and cross-genera synteny survey, identified representatives of all but one (CHI1) main subfamilies. They are as follows: two copies of CHI2, FAPa2 and CHIL, and single copies of FAPb and FAPa1. Duplicated genes are remnants of whole genome duplication which is assumed to have occurred after the divergence of Lupinus, Arachis, and Glycine

  14. Genome-Wide Inference of Ancestral Recombination Graphs

    PubMed Central

    Rasmussen, Matthew D.; Hubisz, Melissa J.; Gronau, Ilan; Siepel, Adam

    2014-01-01

    The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the “ancestral recombination graph” (ARG), a complete record of coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian genomes. The key idea of our approach is to sample an ARG of chromosomes conditional on an ARG of chromosomes, an operation we call “threading.” Using techniques based on hidden Markov models, we can perform this threading operation exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the posterior distribution over ARGs and is effective in recovering various features of the ARG for dozens of sequences generated under realistic parameters for human populations. In applications of ARGweaver to 54 human genome sequences from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry associated with balancing selection and reductions in allele age in sites under directional selection. The patterns we observe near protein-coding genes are consistent with a primary influence from background selection rather than hitchhiking, although we cannot rule out a contribution from recurrent selective sweeps. PMID:24831947

  15. Inferring Heterozygosity from Ancient and Low Coverage Genomes

    PubMed Central

    Kousathanas, Athanasios; Leuenberger, Christoph; Link, Vivian; Sell, Christian; Burger, Joachim; Wegmann, Daniel

    2017-01-01

    While genetic diversity can be quantified accurately from high coverage sequencing data, it is often desirable to obtain such estimates from data with low coverage, either to save costs or because of low DNA quality, as is observed for ancient samples. Here, we introduce a method to accurately infer heterozygosity probabilistically from sequences with average coverage <1× of a single individual. The method relaxes the infinite sites assumption of previous methods, does not require a reference sequence, except for the initial alignment of the sequencing data, and takes into account both variable sequencing errors and potential postmortem damage. It is thus also applicable to nonmodel organisms and ancient genomes. Since error rates as reported by sequencing machines are generally distorted and require recalibration, we also introduce a method to accurately infer recalibration parameters in the presence of postmortem damage. This method does not require knowledge about the underlying genome sequence, but instead works with haploid data (e.g., from the X-chromosome from mammalian males) and integrates over the unknown genotypes. Using extensive simulations we show that a few megabasepairs of haploid data are sufficient for accurate recalibration, even at average coverages as low as 1×. At similar coverages, our method also produces very accurate estimates of heterozygosity down to 10−4 within windows of about 1 Mbp. We further illustrate the usefulness of our approach by inferring genome-wide patterns of diversity for several ancient human samples, and we found that 3000–5000-year-old samples showed diversity patterns comparable to those of modern humans. In contrast, two European hunter-gatherer samples exhibited not only considerably lower levels of diversity than modern samples, but also highly distinct distributions of diversity along their genomes. Interestingly, these distributions were also very different between the two samples, supporting earlier

  16. Inferring network structure from cascades

    NASA Astrophysics Data System (ADS)

    Ghonge, Sushrut; Vural, Dervis Can

    2017-07-01

    Many physical, biological, and social phenomena can be described by cascades taking place on a network. Often, the activity can be empirically observed, but not the underlying network of interactions. In this paper we offer three topological methods to infer the structure of any directed network given a set of cascade arrival times. Our formulas hold for a very general class of models where the activation probability of a node is a generic function of its degree and the number of its active neighbors. We report high success rates for synthetic and real networks, for several different cascade models.

  17. Freshwater bacterial lifestyles inferred from comparative genomics.

    PubMed

    Livermore, Joshua A; Emrich, Scott J; Tan, John; Jones, Stuart E

    2014-03-01

    While micro-organisms actively mediate and participate in freshwater ecosystem services, we know little about freshwater microbial genetic diversity. Genome sequences are available for many bacteria from the human microbiome and the ocean (over 800 and 200, respectively), but only two freshwater genomes are currently available: the streamlined genomes of Polynucleobacter necessarius ssp. asymbioticus and the Actinobacterium AcI-B1. Here, we sequenced and analysed draft genomes of eight phylogentically diverse freshwater bacteria exhibiting a range of lifestyle characteristics. Comparative genomics of these bacteria reveals putative freshwater bacterial lifestyles based on differences in predicted growth rate, capability to respond to environmental stimuli and diversity of useable carbon substrates. Our conceptual model based on these genomic characteristics provides a foundation on which further ecophysiological and genomic studies can be built. In addition, these genomes greatly expand the diversity of existing genomic context for future studies on the ecology and genetics of freshwater bacteria.

  18. Inferring parental genomic ancestries using pooled semi-Markov processes.

    PubMed

    Zou, James Y; Halperin, Eran; Burchard, Esteban; Sankararaman, Sriram

    2015-06-15

    A basic problem of broad public and scientific interest is to use the DNA of an individual to infer the genomic ancestries of the parents. In particular, we are often interested in the fraction of each parent's genome that comes from specific ancestries (e.g. European, African, Native American, etc). This has many applications ranging from understanding the inheritance of ancestry-related risks and traits to quantifying human assortative mating patterns. We model the problem of parental genomic ancestry inference as a pooled semi-Markov process. We develop a general mathematical framework for pooled semi-Markov processes and construct efficient inference algorithms for these models. Applying our inference algorithm to genotype data from 231 Mexican trios and 258 Puerto Rican trios where we have the true genomic ancestry of each parent, we demonstrate that our method accurately infers parameters of the semi-Markov processes and parents' genomic ancestries. We additionally validated the method on simulations. Our model of pooled semi-Markov process and inference algorithms may be of independent interest in other settings in genomics and machine learning. © The Author 2015. Published by Oxford University Press.

  19. Genetic Network Inference Using Hierarchical Structure

    PubMed Central

    Kimura, Shuhei; Tokuhisa, Masato; Okada-Hatakeyama, Mariko

    2016-01-01

    Many methods for inferring genetic networks have been proposed, but the regulations they infer often include false-positives. Several researchers have attempted to reduce these erroneous regulations by proposing the use of a priori knowledge about the properties of genetic networks such as their sparseness, scale-free structure, and so on. This study focuses on another piece of a priori knowledge, namely, that biochemical networks exhibit hierarchical structures. Based on this idea, we propose an inference approach that uses the hierarchical structure in a target genetic network. To obtain a reasonable hierarchical structure, the first step of the proposed approach is to infer multiple genetic networks from the observed gene expression data. We take this step using an existing method that combines a genetic network inference method with a bootstrap method. The next step is to extract a hierarchical structure from the inferred networks that is consistent with most of the networks. Third, we use the hierarchical structure obtained to assign confidence values to all candidate regulations. Numerical experiments are also performed to demonstrate the effectiveness of using the hierarchical structure in the genetic network inference. The improvement accomplished by the use of the hierarchical structure is small. However, the hierarchical structure could be used to improve the performances of many existing inference methods. PMID:26941653

  20. Use of Whole Genome Sequence Data To Infer Baculovirus Phylogeny

    PubMed Central

    Herniou, Elisabeth A.; Luque, Teresa; Chen, Xinwen; Vlak, Just M.; Winstanley, Doreen; Cory, Jennifer S.; O'Reilly, David R.

    2001-01-01

    Several phylogenetic methods based on whole genome sequence data were evaluated using data from nine complete baculovirus genomes. The utility of three independent character sets was assessed. The first data set comprised the sequences of the 63 genes common to these viruses. The second set of characters was based on gene order, and phylogenies were inferred using both breakpoint distance analysis and a novel method developed here, termed neighbor pair analysis. The third set recorded gene content by scoring gene presence or absence in each genome. All three data sets yielded phylogenies supporting the separation of the Nucleopolyhedrovirus (NPV) and Granulovirus (GV) genera, the division of the NPVs into groups I and II, and species relationships within group I NPVs. Generation of phylogenies based on the combined sequences of all 63 shared genes proved to be the most effective approach to resolving the relationships among the group II NPVs and the GVs. The history of gene acquisitions and losses that have accompanied baculovirus diversification was visualized by mapping the gene content data onto the phylogenetic tree. This analysis highlighted the fluid nature of baculovirus genomes, with evidence of frequent genome rearrangements and multiple gene content changes during their evolution. Of more than 416 genes identified in the genomes analyzed, only 63 are present in all nine genomes, and 200 genes are found only in a single genome. Despite this fluidity, the whole genome-based methods we describe are sufficiently powerful to recover the underlying phylogeny of the viruses. PMID:11483757

  1. Inference of self-regulated transcriptional networks by comparative genomics.

    PubMed

    Cornish, Joseph P; Matthews, Fialelei; Thomas, Julien R; Erill, Ivan

    2012-01-01

    The assumption of basic properties, like self-regulation, in simple transcriptional regulatory networks can be exploited to infer regulatory motifs from the growing amounts of genomic and meta-genomic data. These motifs can in principle be used to elucidate the nature and scope of transcriptional networks through comparative genomics. Here we assess the feasibility of this approach using the SOS regulatory network of Gram-positive bacteria as a test case. Using experimentally validated data, we show that the known regulatory motif can be inferred through the assumption of self-regulation. Furthermore, the inferred motif provides a more robust search pattern for comparative genomics than the experimental motifs defined in reference organisms. We take advantage of this robustness to generate a functional map of the SOS response in Gram-positive bacteria. Our results reveal definite differences in the composition of the LexA regulon between Firmicutes and Actinobacteria, and confirm that regulation of cell-division inhibition is a widespread characteristic of this network among Gram-positive bacteria.

  2. SOP for pathway inference in Integrated Microbial Genomes (IMG).

    PubMed

    Anderson, Iain; Chen, Amy; Markowitz, Victor; Kyrpides, Nikos; Ivanova, Natalia

    2011-12-31

    One of the most important aspects of genomic analysis is the prediction of which pathways, both metabolic and non-metabolic, are present in an organism. In IMG, this is carried out by the assignment of IMG terms, which are organized into IMG pathways. Based on manual and automatic assignment of IMG terms, the presence or absence of IMG pathways is automatically inferred. The three categories of pathway assertion are asserted (likely present), not asserted (likely absent), and unknown. In the unknown category, at least one term necessary for the pathway is missing, but an ortholog in another organism has the corresponding term assigned to it. Automatic pathway inference is an important initial step in genome analysis.

  3. Genomic inferences from Afrotheria and the evolution of elephants.

    PubMed

    Roca, Alfred L; O'Brien, Stephen J

    2005-12-01

    Recent genetic studies have established that African forest and savanna elephants are distinct species with dissociated cytonuclear genomic patterns, and have identified Asian elephants from Borneo and Sumatra as conservation priorities. Representative of Afrotheria, a superordinal clade encompassing six eutherian orders, the African savanna elephant was among the first mammals chosen for whole-genome sequencing to provide a comparative understanding of the human genome. Elephants have large and complex brains and display advanced levels of social structure, communication, learning and intelligence. The elephant genome sequence might prove useful for comparative genomic studies of these advanced traits, which have appeared independently in only three mammalian orders: primates, cetaceans and proboscideans.

  4. GARLIC: Genomic Autozygosity Regions Likelihood-based Inference and Classification.

    PubMed

    Szpiech, Zachary A; Blant, Alexandra; Pemberton, Trevor J

    2017-07-01

    Runs of homozygosity (ROH) are important genomic features that manifest when identical-by-descent haplotypes are inherited from parents. Their length distributions and genomic locations are informative about population history and they are useful for mapping recessive loci contributing to both Mendelian and complex disease risk. Here, we present software implementing a model-based method ( Pemberton et al., 2012 ) for inferring ROH in genome-wide SNP datasets that incorporates population-specific parameters and a genotyping error rate as well as provides a length-based classification module to identify biologically interesting classes of ROH. Using simulations, we evaluate the performance of this method. GARLIC is written in C ++. Source code and pre-compiled binaries (Windows, OSX and Linux) are hosted on GitHub ( https://github.com/szpiech/garlic ) under the GNU General Public License version 3. zachary.szpiech@ucsf.edu. Supplementary data are available at Bioinformatics online.

  5. Bayesian structural inference for hidden processes.

    PubMed

    Strelioff, Christopher C; Crutchfield, James P

    2014-04-01

    We introduce a Bayesian approach to discovering patterns in structurally complex processes. The proposed method of Bayesian structural inference (BSI) relies on a set of candidate unifilar hidden Markov model (uHMM) topologies for inference of process structure from a data series. We employ a recently developed exact enumeration of topological ε-machines. (A sequel then removes the topological restriction.) This subset of the uHMM topologies has the added benefit that inferred models are guaranteed to be ε-machines, irrespective of estimated transition probabilities. Properties of ε-machines and uHMMs allow for the derivation of analytic expressions for estimating transition probabilities, inferring start states, and comparing the posterior probability of candidate model topologies, despite process internal structure being only indirectly present in data. We demonstrate BSI's effectiveness in estimating a process's randomness, as reflected by the Shannon entropy rate, and its structure, as quantified by the statistical complexity. We also compare using the posterior distribution over candidate models and the single, maximum a posteriori model for point estimation and show that the former more accurately reflects uncertainty in estimated values. We apply BSI to in-class examples of finite- and infinite-order Markov processes, as well to an out-of-class, infinite-state hidden process.

  6. Inference of distant genetic relations in humans using "1000 genomes".

    PubMed

    Al-Khudhair, Ahmed; Qiu, Shuhao; Wyse, Meghan; Chowdhury, Shilpi; Cheng, Xi; Bekbolsynov, Dulat; Saha-Mandal, Arnab; Dutta, Rajib; Fedorova, Larisa; Fedorov, Alexei

    2015-01-07

    Nucleotide sequence differences on the whole-genome scale have been computed for 1,092 people from 14 populations publicly available by the 1000 Genomes Project. Total number of differences in genetic variants between 96,464 human pairs has been calculated. The distributions of these differences for individuals within European, Asian, or African origin were characterized by narrow unimodal peaks with mean values of 3.8, 3.5, and 5.1 million, respectively, and standard deviations of 0.1-0.03 million. The total numbers of genomic differences between pairs of all known relatives were found to be significantly lower than their respective population means and in reverse proportion to the distance of their consanguinity. By counting the total number of genomic differences it is possible to infer familial relations for people that share down to 6% of common loci identical-by-descent. Detection of familial relations can be radically improved when only very rare genetic variants are taken into account. Counting of total number of shared very rare single nucleotide polymorphisms (SNPs) from whole-genome sequences allows establishing distant familial relations for persons with eighth and ninth degrees of relationship. Using this analysis we predicted 271 distant familial pairwise relations among 1,092 individuals that have not been declared by 1000 Genomes Project. Particularly, among 89 British and 97 Chinese individuals we found three British-Chinese pairs with distant genetic relationships. Individuals from these pairs share identical-by-descent DNA fragments that represent 0.001%, 0.004%, and 0.01% of their genomes. With affordable whole-genome sequencing techniques, very rare SNPs should become important genetic markers for familial relationships and population stratification. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  7. A human genome-wide library of local phylogeny predictions for whole-genome inference problems

    PubMed Central

    Sridhar, Srinath; Schwartz, Russell

    2008-01-01

    Background Many common inference problems in computational genetics depend on inferring aspects of the evolutionary history of a data set given a set of observed modern sequences. Detailed predictions of the full phylogenies are therefore of value in improving our ability to make further inferences about population history and sources of genetic variation. Making phylogenetic predictions on the scale needed for whole-genome analysis is, however, extremely computationally demanding. Results In order to facilitate phylogeny-based predictions on a genomic scale, we develop a library of maximum parsimony phylogenies within local regions spanning all autosomal human chromosomes based on Haplotype Map variation data. We demonstrate the utility of this library for population genetic inferences by examining a tree statistic we call 'imperfection,' which measures the reuse of variant sites within a phylogeny. This statistic is significantly predictive of recombination rate, shows additional regional and population-specific conservation, and allows us to identify outlier genes likely to have experienced unusual amounts of variation in recent human history. Conclusion Recent theoretical advances in algorithms for phylogenetic tree reconstruction have made it possible to perform large-scale inferences of local maximum parsimony phylogenies from single nucleotide polymorphism (SNP) data. As results from the imperfection statistic demonstrate, phylogeny predictions encode substantial information useful for detecting genomic features and population history. This data set should serve as a platform for many kinds of inferences one may wish to make about human population history and genetic variation. PMID:18710563

  8. Inferring Strain Mixture within Clinical Plasmodium falciparum Isolates from Genomic Sequence Data

    PubMed Central

    O’Brien, John D.; Amenga-Etego, Lucas

    2016-01-01

    We present a rigorous statistical model that infers the structure of P. falciparum mixtures—including the number of strains present, their proportion within the samples, and the amount of unexplained mixture—using whole genome sequence (WGS) data. Applied to simulation data, artificial laboratory mixtures, and field samples, the model provides reasonable inference with as few as 10 reads or 50 SNPs and works efficiently even with much larger data sets. Source code and example data for the model are provided in an open source fashion. We discuss the possible uses of this model as a window into within-host selection for clinical and epidemiological studies. PMID:27362949

  9. Population genetic inference from personal genome data: impact of ancestry and admixture on human genomic variation.

    PubMed

    Kidd, Jeffrey M; Gravel, Simon; Byrnes, Jake; Moreno-Estrada, Andres; Musharoff, Shaila; Bryc, Katarzyna; Degenhardt, Jeremiah D; Brisbin, Abra; Sheth, Vrunda; Chen, Rong; McLaughlin, Stephen F; Peckham, Heather E; Omberg, Larsson; Bormann Chung, Christina A; Stanley, Sarah; Pearlstein, Kevin; Levandowsky, Elizabeth; Acevedo-Acevedo, Suehelay; Auton, Adam; Keinan, Alon; Acuña-Alonzo, Victor; Barquera-Lozano, Rodrigo; Canizales-Quinteros, Samuel; Eng, Celeste; Burchard, Esteban G; Russell, Archie; Reynolds, Andy; Clark, Andrew G; Reese, Martin G; Lincoln, Stephen E; Butte, Atul J; De La Vega, Francisco M; Bustamante, Carlos D

    2012-10-05

    Full sequencing of individual human genomes has greatly expanded our understanding of human genetic variation and population history. Here, we present a systematic analysis of 50 human genomes from 11 diverse global populations sequenced at high coverage. Our sample includes 12 individuals who have admixed ancestry and who have varying degrees of recent (within the last 500 years) African, Native American, and European ancestry. We found over 21 million single-nucleotide variants that contribute to a 1.75-fold range in nucleotide heterozygosity across diverse human genomes. This heterozygosity ranged from a high of one heterozygous site per kilobase in west African genomes to a low of 0.57 heterozygous sites per kilobase in segments inferred to have diploid Native American ancestry from the genomes of Mexican and Puerto Rican individuals. We show evidence of all three continental ancestries in the genomes of Mexican, Puerto Rican, and African American populations, and the genome-wide statistics are highly consistent across individuals from a population once ancestry proportions have been accounted for. Using a generalized linear model, we identified subtle variations across populations in the proportion of neutral versus deleterious variation and found that genome-wide statistics vary in admixed populations even once ancestry proportions have been factored in. We further infer that multiple periods of gene flow shaped the diversity of admixed populations in the Americas-70% of the European ancestry in today's African Americans dates back to European gene flow happening only 7-8 generations ago.

  10. Population Genetic Inference from Personal Genome Data: Impact of Ancestry and Admixture on Human Genomic Variation

    PubMed Central

    Kidd, Jeffrey M.; Gravel, Simon; Byrnes, Jake; Moreno-Estrada, Andres; Musharoff, Shaila; Bryc, Katarzyna; Degenhardt, Jeremiah D.; Brisbin, Abra; Sheth, Vrunda; Chen, Rong; McLaughlin, Stephen F.; Peckham, Heather E.; Omberg, Larsson; Bormann Chung, Christina A.; Stanley, Sarah; Pearlstein, Kevin; Levandowsky, Elizabeth; Acevedo-Acevedo, Suehelay; Auton, Adam; Keinan, Alon; Acuña-Alonzo, Victor; Barquera-Lozano, Rodrigo; Canizales-Quinteros, Samuel; Eng, Celeste; Burchard, Esteban G.; Russell, Archie; Reynolds, Andy; Clark, Andrew G.; Reese, Martin G.; Lincoln, Stephen E.; Butte, Atul J.; De La Vega, Francisco M.; Bustamante, Carlos D.

    2012-01-01

    Full sequencing of individual human genomes has greatly expanded our understanding of human genetic variation and population history. Here, we present a systematic analysis of 50 human genomes from 11 diverse global populations sequenced at high coverage. Our sample includes 12 individuals who have admixed ancestry and who have varying degrees of recent (within the last 500 years) African, Native American, and European ancestry. We found over 21 million single-nucleotide variants that contribute to a 1.75-fold range in nucleotide heterozygosity across diverse human genomes. This heterozygosity ranged from a high of one heterozygous site per kilobase in west African genomes to a low of 0.57 heterozygous sites per kilobase in segments inferred to have diploid Native American ancestry from the genomes of Mexican and Puerto Rican individuals. We show evidence of all three continental ancestries in the genomes of Mexican, Puerto Rican, and African American populations, and the genome-wide statistics are highly consistent across individuals from a population once ancestry proportions have been accounted for. Using a generalized linear model, we identified subtle variations across populations in the proportion of neutral versus deleterious variation and found that genome-wide statistics vary in admixed populations even once ancestry proportions have been factored in. We further infer that multiple periods of gene flow shaped the diversity of admixed populations in the Americas—70% of the European ancestry in today’s African Americans dates back to European gene flow happening only 7–8 generations ago. PMID:23040495

  11. Genomic inference of the metabolism of cosmopolitan subsurface Archaea, Hadesarchaea.

    PubMed

    Baker, Brett J; Saw, Jimmy H; Lind, Anders E; Lazar, Cassandre Sara; Hinrichs, Kai-Uwe; Teske, Andreas P; Ettema, Thijs J G

    2016-02-15

    The subsurface biosphere is largely unexplored and contains a broad diversity of uncultured microbes(1). Despite being one of the few prokaryotic lineages that is cosmopolitan in both the terrestrial and marine subsurface(2-4), the physiological and ecological roles of SAGMEG (South-African Gold Mine Miscellaneous Euryarchaeal Group) Archaea are unknown. Here, we report the metabolic capabilities of this enigmatic group as inferred from genomic reconstructions. Four high-quality (63-90% complete) genomes were obtained from White Oak River estuary and Yellowstone National Park hot spring sediment metagenomes. Phylogenomic analyses place SAGMEG Archaea as a deeply rooting sister clade of the Thermococci, leading us to propose the name Hadesarchaea for this new Archaeal class. With an estimated genome size of around 1.5 Mbp, the genomes of Hadesarchaea are distinctly streamlined, yet metabolically versatile. They share several physiological mechanisms with strict anaerobic Euryarchaeota. Several metabolic characteristics make them successful in the subsurface, including genes involved in CO and H2 oxidation (or H2 production), with potential coupling to nitrite reduction to ammonia (DNRA). This first glimpse into the metabolic capabilities of these cosmopolitan Archaea suggests they are mediating key geochemical processes and are specialized for survival in the subsurface biosphere.

  12. Neural Circuit Inference from Function to Structure.

    PubMed

    Real, Esteban; Asari, Hiroki; Gollisch, Tim; Meister, Markus

    2017-01-23

    Advances in technology are opening new windows on the structural connectivity and functional dynamics of brain circuits. Quantitative frameworks are needed that integrate these data from anatomy and physiology. Here, we present a modeling approach that creates such a link. The goal is to infer the structure of a neural circuit from sparse neural recordings, using partial knowledge of its anatomy as a regularizing constraint. We recorded visual responses from the output neurons of the retina, the ganglion cells. We then generated a systematic sequence of circuit models that represents retinal neurons and connections and fitted them to the experimental data. The optimal models faithfully recapitulated the ganglion cell outputs. More importantly, they made predictions about dynamics and connectivity among unobserved neurons internal to the circuit, and these were subsequently confirmed by experiment. This circuit inference framework promises to facilitate the integration and understanding of big data in neuroscience.

  13. The aggregate site frequency spectrum for comparative population genomic inference.

    PubMed

    Xue, Alexander T; Hickerson, Michael J

    2015-12-01

    Understanding how assemblages of species responded to past climate change is a central goal of comparative phylogeography and comparative population genomics, an endeavour that has increasing potential to integrate with community ecology. New sequencing technology now provides the potential to perform complex demographic inference at unprecedented resolution across assemblages of nonmodel species. To this end, we introduce the aggregate site frequency spectrum (aSFS), an expansion of the site frequency spectrum to use single nucleotide polymorphism (SNP) data sets collected from multiple, co-distributed species for assemblage-level demographic inference. We describe how the aSFS is constructed over an arbitrary number of independent population samples and then demonstrate how the aSFS can differentiate various multispecies demographic histories under a wide range of sampling configurations while allowing effective population sizes and expansion magnitudes to vary independently. We subsequently couple the aSFS with a hierarchical approximate Bayesian computation (hABC) framework to estimate degree of temporal synchronicity in expansion times across taxa, including an empirical demonstration with a data set consisting of five populations of the threespine stickleback (Gasterosteus aculeatus). Corroborating what is generally understood about the recent postglacial origins of these populations, the joint aSFS/hABC analysis strongly suggests that the stickleback data are most consistent with synchronous expansion after the Last Glacial Maximum (posterior probability = 0.99). The aSFS will have general application for multilevel statistical frameworks to test models involving assemblages and/or communities, and as large-scale SNP data from nonmodel species become routine, the aSFS expands the potential for powerful next-generation comparative population genomic inference.

  14. Improved genome inference in the MHC using a population reference graph.

    PubMed

    Dilthey, Alexander; Cox, Charles; Iqbal, Zamin; Nelson, Matthew R; McVean, Gil

    2015-06-01

    Although much is known about human genetic variation, such information is typically ignored in assembling new genomes. Instead, reads are mapped to a single reference, which can lead to poor characterization of regions of high sequence or structural diversity. We introduce a population reference graph, which combines multiple reference sequences and catalogs of variation. The genomes of new samples are reconstructed as paths through the graph using an efficient hidden Markov model, allowing for recombination between different haplotypes and additional variants. By applying the method to the 4.5-Mb extended MHC region on human chromosome 6, combining 8 assembled haplotypes, the sequences of known classical HLA alleles and 87,640 SNP variants from the 1000 Genomes Project, we demonstrate using simulations, SNP genotyping, and short-read and long-read data how the method improves the accuracy of genome inference and identified regions where the current set of reference sequences is substantially incomplete.

  15. Nonparametric inference of network structure and dynamics

    NASA Astrophysics Data System (ADS)

    Peixoto, Tiago P.

    The network structure of complex systems determine their function and serve as evidence for the evolutionary mechanisms that lie behind them. Despite considerable effort in recent years, it remains an open challenge to formulate general descriptions of the large-scale structure of network systems, and how to reliably extract such information from data. Although many approaches have been proposed, few methods attempt to gauge the statistical significance of the uncovered structures, and hence the majority cannot reliably separate actual structure from stochastic fluctuations. Due to the sheer size and high-dimensionality of many networks, this represents a major limitation that prevents meaningful interpretations of the results obtained with such nonstatistical methods. In this talk, I will show how these issues can be tackled in a principled and efficient fashion by formulating appropriate generative models of network structure that can have their parameters inferred from data. By employing a Bayesian description of such models, the inference can be performed in a nonparametric fashion, that does not require any a priori knowledge or ad hoc assumptions about the data. I will show how this approach can be used to perform model comparison, and how hierarchical models yield the most appropriate trade-off between model complexity and quality of fit based on the statistical evidence present in the data. I will also show how this general approach can be elegantly extended to networks with edge attributes, that are embedded in latent spaces, and that change in time. The latter is obtained via a fully dynamic generative network model, based on arbitrary-order Markov chains, that can also be inferred in a nonparametric fashion. Throughout the talk I will illustrate the application of the methods with many empirical networks such as the internet at the autonomous systems level, the global airport network, the network of actors and films, social networks, citations among

  16. Genome BLAST distance phylogenies inferred from whole plastid and whole mitochondrion genome sequences

    PubMed Central

    Auch, Alexander F; Henz, Stefan R; Holland, Barbara R; Göker, Markus

    2006-01-01

    Background Phylogenetic methods which do not rely on multiple sequence alignments are important tools in inferring trees directly from completely sequenced genomes. Here, we extend the recently described Genome BLAST Distance Phylogeny (GBDP) strategy to compute phylogenetic trees from all completely sequenced plastid genomes currently available and from a selection of mitochondrial genomes representing the major eukaryotic lineages. BLASTN, TBLASTX, or combinations of both are used to locate high-scoring segment pairs (HSPs) between two sequences from which pairwise similarities and distances are computed in different ways resulting in a total of 96 GBDP variants. The suitability of these distance formulae for phylogeny reconstruction is directly estimated by computing a recently described measure of "treelikeness", the so-called δ value, from the respective distance matrices. Additionally, we compare the trees inferred from these matrices using UPGMA, NJ, BIONJ, FastME, or STC, respectively, with the NCBI taxonomy tree of the taxa under study. Results Our results indicate that, at this taxonomic level, plastid genomes are much more valuable for inferring phylogenies than are mitochondrial genomes, and that distances based on breakpoints are of little use. Distances based on the proportion of "matched" HSP length to average genome length were best for tree estimation. Additionally we found that using TBLASTX instead of BLASTN and, particularly, combining TBLASTX and BLASTN leads to a small but significant increase in accuracy. Other factors do not significantly affect the phylogenetic outcome. The BIONJ algorithm results in phylogenies most in accordance with the current NCBI taxonomy, with NJ and FastME performing insignificantly worse, and STC performing as well if applied to high quality distance matrices. δ values are found to be a reliable predictor of phylogenetic accuracy. Conclusion Using the most treelike distance matrices, as judged by their δ values

  17. Baboon phylogeny as inferred from complete mitochondrial genomes

    PubMed Central

    Zinner, Dietmar; Wertheimer, Jenny; Liedigk, Rasmus; Groeneveld, Linn F; Roos, Christian

    2013-01-01

    Baboons (genus Papio) are an interesting phylogeographical primate model for the evolution of savanna species during the Pleistocene. Earlier studies, based on partial mitochondrial sequence information, revealed seven major haplogroups indicating multiple para- and polyphylies among the six baboon species. The most basal splits among baboon lineages remained unresolved and the credibility intervals for divergence time estimates were rather large. Assuming that genetic variation within the two studied mitochondrial loci so far was insufficient to infer the apparently rapid early radiation of baboons we used complete mitochondrial sequence information of ten specimens, representing all major baboon lineages, to reconstruct a baboon phylogeny and to re-estimate divergence times. Our data confirmed the earlier tree topology including the para- and polyphyletic relationships of most baboon species; divergence time estimates are slightly younger and credibility intervals narrowed substantially, thus making the estimates more precise. However, the most basal relationships could not be resolved and it remains open whether (1) the most southern population of baboons diverged first or (2) a major split occurred between southern and northern clades. Our study shows that complete mitochondrial genome sequences are more effective to reconstruct robust phylogenies and to narrow down estimated divergence time intervals than only short portions of the mitochondrial genome, although there are also limitations in resolving phylogenetic relationships. Am J Phys Anthropol, 2013. © 2012 Wiley Periodicals, Inc. PMID:23180628

  18. How to Infer Relative Fitness from a Sample of Genomic Sequences

    PubMed Central

    Dayarian, Adel; Shraiman, Boris I.

    2014-01-01

    Mounting evidence suggests that natural populations can harbor extensive fitness diversity with numerous genomic loci under selection. It is also known that genealogical trees for populations under selection are quantifiably different from those expected under neutral evolution and described statistically by Kingman’s coalescent. While differences in the statistical structure of genealogies have long been used as a test for the presence of selection, the full extent of the information that they contain has not been exploited. Here we demonstrate that the shape of the reconstructed genealogical tree for a moderately large number of random genomic samples taken from a fitness diverse, but otherwise unstructured, asexual population can be used to predict the relative fitness of individuals within the sample. To achieve this we define a heuristic algorithm, which we test in silico, using simulations of a Wright–Fisher model for a realistic range of mutation rates and selection strength. Our inferred fitness ranking is based on a linear discriminator that identifies rapidly coalescing lineages in the reconstructed tree. Inferred fitness ranking correlates strongly with actual fitness, with a genome in the top 10% ranked being in the top 20% fittest with false discovery rate of 0.1–0.3, depending on the mutation/selection parameters. The ranking also enables us to predict the genotypes that future populations inherit from the present one. While the inference accuracy increases monotonically with sample size, samples of 200 nearly saturate the performance. We propose that our approach can be used for inferring relative fitness of genomes obtained in single-cell sequencing of tumors and in monitoring viral outbreaks. PMID:24770330

  19. How to infer relative fitness from a sample of genomic sequences.

    PubMed

    Dayarian, Adel; Shraiman, Boris I

    2014-07-01

    Mounting evidence suggests that natural populations can harbor extensive fitness diversity with numerous genomic loci under selection. It is also known that genealogical trees for populations under selection are quantifiably different from those expected under neutral evolution and described statistically by Kingman's coalescent. While differences in the statistical structure of genealogies have long been used as a test for the presence of selection, the full extent of the information that they contain has not been exploited. Here we demonstrate that the shape of the reconstructed genealogical tree for a moderately large number of random genomic samples taken from a fitness diverse, but otherwise unstructured, asexual population can be used to predict the relative fitness of individuals within the sample. To achieve this we define a heuristic algorithm, which we test in silico, using simulations of a Wright-Fisher model for a realistic range of mutation rates and selection strength. Our inferred fitness ranking is based on a linear discriminator that identifies rapidly coalescing lineages in the reconstructed tree. Inferred fitness ranking correlates strongly with actual fitness, with a genome in the top 10% ranked being in the top 20% fittest with false discovery rate of 0.1-0.3, depending on the mutation/selection parameters. The ranking also enables us to predict the genotypes that future populations inherit from the present one. While the inference accuracy increases monotonically with sample size, samples of 200 nearly saturate the performance. We propose that our approach can be used for inferring relative fitness of genomes obtained in single-cell sequencing of tumors and in monitoring viral outbreaks.

  20. Inferring causal structure: a quantum advantage

    NASA Astrophysics Data System (ADS)

    Ried, Katja; Spekkens, Robert

    2014-03-01

    The problem of inferring causal relations from observed correlations is central to science, and extensive study has yielded both important conceptual insights and widely used practical applications. Yet some of the simplest questions are impossible to answer classically: for instance, if one observes correlations between two variables (such as taking a new medical treatment and the subject's recovery), does this show a direct causal influence, or is it due to some hidden common cause? We develop a framework for quantum causal inference, and show how quantum theory provides a unique advantage in this decision problem. The key insight is that certain quantum correlations can only arise from specific causal structures, whereas pairs of classical variables can exhibit any pattern of correlation regardless of whether they have a common cause or a direct-cause relation. For example, suppose one measures the same Pauli observable on two qubits. If they share a common cause, such as being prepared in an entangled state, then one never finds perfect (positive) correlations in every basis, whereas perfect anticorrelations are possible (if one prepares the singlet state). Conversely, if a channel connects the qubits, hence a direct causal influence, perfect anticorrelations are impossible.

  1. A Robust Method for Inferring Network Structures.

    PubMed

    Yang, Yang; Luo, Tingjin; Li, Zhoujun; Zhang, Xiaoming; Yu, Philip S

    2017-07-12

    Inferring the network structure from limited observable data is significant in molecular biology, communication and many other areas. It is challenging, primarily because the observable data are sparse, finite and noisy. The development of machine learning and network structure study provides a great chance to solve the problem. In this paper, we propose an iterative smoothing algorithm with structure sparsity (ISSS) method. The elastic penalty in the model is introduced for the sparse solution, identifying group features and avoiding over-fitting, and the total variation (TV) penalty in the model can effectively utilize the structure information to identify the neighborhood of the vertices. Due to the non-smoothness of the elastic and structural TV penalties, an efficient algorithm with the Nesterov's smoothing optimization technique is proposed to solve the non-smooth problem. The experimental results on both synthetic and real-world networks show that the proposed model is robust against insufficient data and high noise. In addition, we investigate many factors that play important roles in identifying the performance of ISSS.

  2. Inferring divergence of context-dependent substitution rates in Drosophila genomes with applications to comparative genomics.

    PubMed

    Chachick, Ran; Tanay, Amos

    2012-07-01

    Nucleotide substitution is a major evolutionary driving force that can incrementally and stochastically give rise to broad divergence patterns among species. The substitution process at each genomic position is frequently modeled independently of the other positions, although complex interactions between nearby bases are known to significantly affect mutation rates. Here, we study the evolution of 12 fly genomes using new algorithms for accurate inference of parameter-rich substitution models. By comparing models between lineages, we reveal the evolutionary histories of substitution rates at different flanking nucleotide contexts. We demonstrate these driving forces of molecular evolution to be constantly changing, suggesting that neutral drift of mutation rates is an important factor in the evolution of genomes and their sequence composition. This observation is used to develop a scalable approach for parameter-rich comparative genomics. By screening short DNA sequences, we demonstrate how homeoboxes and other transcription factor binding motifs are highly conserved based on our parameter-rich models but not according to standard conservation assays. With the increasing availability of genome sequences, rich substitution models become an attractive and practical approach for evolutionary analysis in general and comparative genomics in particular.

  3. The Structural Genomics Consortium

    PubMed Central

    Jones, Molly Morgan; Castle-Clarke, Sophie; Brooker, Daniel; Nason, Edward; Huzair, Farah; Chataway, Joanna

    2014-01-01

    Abstract The Structural Genomics Consortium (SGC) supports drug discovery efforts through a unique, open access model of public-private collaboration. This study presents the results of an independent evaluation of the Structural Genomics Consortium, conducted by RAND Europe with the Institute on Governance. The evaluation aimed to establish the role of the SGC within the wider drug discovery and PPP landscape, assessing the merits of the SGC open access model relative to alternative models of funding R&D in this space, as well as the key trends and opportunities in the external environment that may impact on the future of the SGC. It also established the incentives and disincentives for investment, strengths and weaknesses of the SGC's model, and the opportunities and threats the SGC will face in the future. This enabled us to assess the most convincing arguments for funding the SGC at present; important trade-offs or limitations that should be addressed in moving towards the next funding phase; and whether funders are anticipating changes either to the SGC or the wider PPP landscape. Finally, we undertook a quantitative analysis to ascertain what judgements can be made about the SGC's past and current performance track record, before unpacking the role of the external environment and particular actors within the SGC in developing scenarios for the future. PMID:28560088

  4. Inferring genome-wide patterns of admixture in Qataris using fifty-five ancestral populations.

    PubMed

    Omberg, Larsson; Salit, Jacqueline; Hackett, Neil; Fuller, Jennifer; Matthew, Rebecca; Chouchane, Lotfi; Rodriguez-Flores, Juan L; Bustamante, Carlos; Crystal, Ronald G; Mezey, Jason G

    2012-06-26

    Populations of the Arabian Peninsula have a complex genetic structure that reflects waves of migrations including the earliest human migrations from Africa and eastern Asia, migrations along ancient civilization trading routes and colonization history of recent centuries. Here, we present a study of genome-wide admixture in this region, using 156 genotyped individuals from Qatar, a country located at the crossroads of these migration patterns. Since haplotypes of these individuals could have originated from many different populations across the world, we have developed a machine learning method "SupportMix" to infer loci-specific genomic ancestry when simultaneously analyzing many possible ancestral populations. Simulations show that SupportMix is not only more accurate than other popular admixture discovery tools but is the first admixture inference method that can efficiently scale for simultaneous analysis of 50-100 putative ancestral populations while being independent of prior demographic information. By simultaneously using the 55 world populations from the Human Genome Diversity Panel, SupportMix was able to extract the fine-scale ancestry of the Qatar population, providing many new observations concerning the ancestry of the region. For example, as well as recapitulating the three major sub-populations in Qatar, composed of mainly Arabic, Persian, and African ancestry, SupportMix additionally identifies the specific ancestry of the Persian group to populations sampled in Greater Persia rather than from China and the ancestry of the African group to sub-Saharan origin and not Southern African Bantu origin as previously thought.

  5. Inferring patterns of folktale diffusion using genomic data.

    PubMed

    Bortolini, Eugenio; Pagani, Luca; Crema, Enrico R; Sarno, Stefania; Barbieri, Chiara; Boattini, Alessio; Sazzini, Marco; da Silva, Sara Graça; Martini, Gessica; Metspalu, Mait; Pettener, Davide; Luiselli, Donata; Tehrani, Jamshid J

    2017-08-22

    Observable patterns of cultural variation are consistently intertwined with demic movements, cultural diffusion, and adaptation to different ecological contexts [Cavalli-Sforza and Feldman (1981) Cultural Transmission and Evolution: A Quantitative Approach; Boyd and Richerson (1985) Culture and the Evolutionary Process]. The quantitative study of gene-culture coevolution has focused in particular on the mechanisms responsible for change in frequency and attributes of cultural traits, the spread of cultural information through demic and cultural diffusion, and detecting relationships between genetic and cultural lineages. Here, we make use of worldwide whole-genome sequences [Pagani et al. (2016) Nature 538:238-242] to assess the impact of processes involving population movement and replacement on cultural diversity, focusing on the variability observed in folktale traditions (n = 596) [Uther (2004) The Types of International Folktales: A Classification and Bibliography. Based on the System of Antti Aarne and Stith Thompson] in Eurasia. We find that a model of cultural diffusion predicted by isolation-by-distance alone is not sufficient to explain the observed patterns, especially at small spatial scales (up to [Formula: see text]4,000 km). We also provide an empirical approach to infer presence and impact of ethnolinguistic barriers preventing the unbiased transmission of both genetic and cultural information. After correcting for the effect of ethnolinguistic boundaries, we show that, of the alternative models that we propose, the one entailing cultural diffusion biased by linguistic differences is the most plausible. Additionally, we identify 15 tales that are more likely to be predominantly transmitted through population movement and replacement and locate putative focal areas for a set of tales that are spread worldwide.

  6. Phylogeny Inference of Closely Related Bacterial Genomes: Combining the Features of Both Overlapping Genes and Collinear Genomic Regions

    PubMed Central

    Zhang, Yan-Cong; Lin, Kui

    2015-01-01

    Overlapping genes (OGs) represent one type of widespread genomic feature in bacterial genomes and have been used as rare genomic markers in phylogeny inference of closely related bacterial species. However, the inference may experience a decrease in performance for phylogenomic analysis of too closely or too distantly related genomes. Another drawback of OGs as phylogenetic markers is that they usually take little account of the effects of genomic rearrangement on the similarity estimation, such as intra-chromosome/genome translocations, horizontal gene transfer, and gene losses. To explore such effects on the accuracy of phylogeny reconstruction, we combine phylogenetic signals of OGs with collinear genomic regions, here called locally collinear blocks (LCBs). By putting these together, we refine our previous metric of pairwise similarity between two closely related bacterial genomes. As a case study, we used this new method to reconstruct the phylogenies of 88 Enterobacteriale genomes of the class Gammaproteobacteria. Our results demonstrated that the topological accuracy of the inferred phylogeny was improved when both OGs and LCBs were simultaneously considered, suggesting that combining these two phylogenetic markers may reduce, to some extent, the influence of gene loss on phylogeny inference. Such phylogenomic studies, we believe, will help us to explore a more effective approach to increasing the robustness of phylogeny reconstruction of closely related bacterial organisms. PMID:26715828

  7. Hybrid Origins of Citrus Varieties Inferred from DNA Marker Analysis of Nuclear and Organelle Genomes

    PubMed Central

    Kitajima, Akira; Nonaka, Keisuke; Yoshioka, Terutaka; Ohta, Satoshi; Goto, Shingo; Toyoda, Atsushi; Fujiyama, Asao; Mochizuki, Takako; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2016-01-01

    Most indigenous citrus varieties are assumed to be natural hybrids, but their parentage has so far been determined in only a few cases because of their wide genetic diversity and the low transferability of DNA markers. Here we infer the parentage of indigenous citrus varieties using simple sequence repeat and indel markers developed from various citrus genome sequence resources. Parentage tests with 122 known hybrids using the selected DNA markers certify their transferability among those hybrids. Identity tests confirm that most variant strains are selected mutants, but we find four types of kunenbo (Citrus nobilis) and three types of tachibana (Citrus tachibana) for which we suggest different origins. Structure analysis with DNA markers that are in Hardy–Weinberg equilibrium deduce three basic taxa coinciding with the current understanding of citrus ancestors. Genotyping analysis of 101 indigenous citrus varieties with 123 selected DNA markers infers the parentages of 22 indigenous citrus varieties including Satsuma, Temple, and iyo, and single parents of 45 indigenous citrus varieties, including kunenbo, C. ichangensis, and Ichang lemon by allele-sharing and parentage tests. Genotyping analysis of chloroplast and mitochondrial genomes using 11 DNA markers classifies their cytoplasmic genotypes into 18 categories and deduces the combination of seed and pollen parents. Likelihood ratio analysis verifies the inferred parentages with significant scores. The reconstructed genealogy identifies 12 types of varieties consisting of Kishu, kunenbo, yuzu, koji, sour orange, dancy, kobeni mikan, sweet orange, tachibana, Cleopatra, willowleaf mandarin, and pummelo, which have played pivotal roles in the occurrence of these indigenous varieties. The inferred parentage of the indigenous varieties confirms their hybrid origins, as found by recent studies. PMID:27902727

  8. Hybrid Origins of Citrus Varieties Inferred from DNA Marker Analysis of Nuclear and Organelle Genomes.

    PubMed

    Shimizu, Tokurou; Kitajima, Akira; Nonaka, Keisuke; Yoshioka, Terutaka; Ohta, Satoshi; Goto, Shingo; Toyoda, Atsushi; Fujiyama, Asao; Mochizuki, Takako; Nagasaki, Hideki; Kaminuma, Eli; Nakamura, Yasukazu

    2016-01-01

    Most indigenous citrus varieties are assumed to be natural hybrids, but their parentage has so far been determined in only a few cases because of their wide genetic diversity and the low transferability of DNA markers. Here we infer the parentage of indigenous citrus varieties using simple sequence repeat and indel markers developed from various citrus genome sequence resources. Parentage tests with 122 known hybrids using the selected DNA markers certify their transferability among those hybrids. Identity tests confirm that most variant strains are selected mutants, but we find four types of kunenbo (Citrus nobilis) and three types of tachibana (Citrus tachibana) for which we suggest different origins. Structure analysis with DNA markers that are in Hardy-Weinberg equilibrium deduce three basic taxa coinciding with the current understanding of citrus ancestors. Genotyping analysis of 101 indigenous citrus varieties with 123 selected DNA markers infers the parentages of 22 indigenous citrus varieties including Satsuma, Temple, and iyo, and single parents of 45 indigenous citrus varieties, including kunenbo, C. ichangensis, and Ichang lemon by allele-sharing and parentage tests. Genotyping analysis of chloroplast and mitochondrial genomes using 11 DNA markers classifies their cytoplasmic genotypes into 18 categories and deduces the combination of seed and pollen parents. Likelihood ratio analysis verifies the inferred parentages with significant scores. The reconstructed genealogy identifies 12 types of varieties consisting of Kishu, kunenbo, yuzu, koji, sour orange, dancy, kobeni mikan, sweet orange, tachibana, Cleopatra, willowleaf mandarin, and pummelo, which have played pivotal roles in the occurrence of these indigenous varieties. The inferred parentage of the indigenous varieties confirms their hybrid origins, as found by recent studies.

  9. Inference of gene regulatory networks from genome-wide knockout fitness data

    PubMed Central

    Wang, Liming; Wang, Xiaodong; Arkin, Adam P.; Samoilov, Michael S.

    2013-01-01

    Motivation: Genome-wide fitness is an emerging type of high-throughput biological data generated for individual organisms by creating libraries of knockouts, subjecting them to broad ranges of environmental conditions, and measuring the resulting clone-specific fitnesses. Since fitness is an organism-scale measure of gene regulatory network behaviour, it may offer certain advantages when insights into such phenotypical and functional features are of primary interest over individual gene expression. Previous works have shown that genome-wide fitness data can be used to uncover novel gene regulatory interactions, when compared with results of more conventional gene expression analysis. Yet, to date, few algorithms have been proposed for systematically using genome-wide mutant fitness data for gene regulatory network inference. Results: In this article, we describe a model and propose an inference algorithm for using fitness data from knockout libraries to identify underlying gene regulatory networks. Unlike most prior methods, the presented approach captures not only structural, but also dynamical and non-linear nature of biomolecular systems involved. A state–space model with non-linear basis is used for dynamically describing gene regulatory networks. Network structure is then elucidated by estimating unknown model parameters. Unscented Kalman filter is used to cope with the non-linearities introduced in the model, which also enables the algorithm to run in on-line mode for practical use. Here, we demonstrate that the algorithm provides satisfying results for both synthetic data as well as empirical measurements of GAL network in yeast Saccharomyces cerevisiae and TyrR–LiuR network in bacteria Shewanella oneidensis. Availability: MATLAB code and datasets are available to download at http://www.duke.edu/∼lw174/Fitness.zip and http://genomics.lbl.gov/supplemental/fitness-bioinf/ Contact: wangx@ee.columbia.edu or mssamoilov@lbl.gov Supplementary information

  10. Inference of Population Structure using Dense Haplotype Data

    PubMed Central

    Lawson, Daniel John; Hellenthal, Garrett

    2012-01-01

    The advent of genome-wide dense variation data provides an opportunity to investigate ancestry in unprecedented detail, but presents new statistical challenges. We propose a novel inference framework that aims to efficiently capture information on population structure provided by patterns of haplotype similarity. Each individual in a sample is considered in turn as a recipient, whose chromosomes are reconstructed using chunks of DNA donated by the other individuals. Results of this “chromosome painting” can be summarized as a “coancestry matrix,” which directly reveals key information about ancestral relationships among individuals. If markers are viewed as independent, we show that this matrix almost completely captures the information used by both standard Principal Components Analysis (PCA) and model-based approaches such as STRUCTURE in a unified manner. Furthermore, when markers are in linkage disequilibrium, the matrix combines information across successive markers to increase the ability to discern fine-scale population structure using PCA. In parallel, we have developed an efficient model-based approach to identify discrete populations using this matrix, which offers advantages over PCA in terms of interpretability and over existing clustering algorithms in terms of speed, number of separable populations, and sensitivity to subtle population structure. We analyse Human Genome Diversity Panel data for 938 individuals and 641,000 markers, and we identify 226 populations reflecting differences on continental, regional, local, and family scales. We present multiple lines of evidence that, while many methods capture similar information among strongly differentiated groups, more subtle population structure in human populations is consistently present at a much finer level than currently available geographic labels and is only captured by the haplotype-based approach. The software used for this article, ChromoPainter and fineSTRUCTURE, is available from

  11. Inference of population structure using dense haplotype data.

    PubMed

    Lawson, Daniel John; Hellenthal, Garrett; Myers, Simon; Falush, Daniel

    2012-01-01

    The advent of genome-wide dense variation data provides an opportunity to investigate ancestry in unprecedented detail, but presents new statistical challenges. We propose a novel inference framework that aims to efficiently capture information on population structure provided by patterns of haplotype similarity. Each individual in a sample is considered in turn as a recipient, whose chromosomes are reconstructed using chunks of DNA donated by the other individuals. Results of this "chromosome painting" can be summarized as a "coancestry matrix," which directly reveals key information about ancestral relationships among individuals. If markers are viewed as independent, we show that this matrix almost completely captures the information used by both standard Principal Components Analysis (PCA) and model-based approaches such as STRUCTURE in a unified manner. Furthermore, when markers are in linkage disequilibrium, the matrix combines information across successive markers to increase the ability to discern fine-scale population structure using PCA. In parallel, we have developed an efficient model-based approach to identify discrete populations using this matrix, which offers advantages over PCA in terms of interpretability and over existing clustering algorithms in terms of speed, number of separable populations, and sensitivity to subtle population structure. We analyse Human Genome Diversity Panel data for 938 individuals and 641,000 markers, and we identify 226 populations reflecting differences on continental, regional, local, and family scales. We present multiple lines of evidence that, while many methods capture similar information among strongly differentiated groups, more subtle population structure in human populations is consistently present at a much finer level than currently available geographic labels and is only captured by the haplotype-based approach. The software used for this article, ChromoPainter and fineSTRUCTURE, is available from http://www.paintmychromosomes.com/.

  12. fastSTRUCTURE: variational inference of population structure in large SNP data sets.

    PubMed

    Raj, Anil; Stephens, Matthew; Pritchard, Jonathan K

    2014-06-01

    Tools for estimating population structure from genetic data are now used in a wide variety of applications in population genetics. However, inferring population structure in large modern data sets imposes severe computational challenges. Here, we develop efficient algorithms for approximate inference of the model underlying the STRUCTURE program using a variational Bayesian framework. Variational methods pose the problem of computing relevant posterior distributions as an optimization problem, allowing us to build on recent advances in optimization theory to develop fast inference tools. In addition, we propose useful heuristic scores to identify the number of populations represented in a data set and a new hierarchical prior to detect weak population structure in the data. We test the variational algorithms on simulated data and illustrate using genotype data from the CEPH-Human Genome Diversity Panel. The variational algorithms are almost two orders of magnitude faster than STRUCTURE and achieve accuracies comparable to those of ADMIXTURE. Furthermore, our results show that the heuristic scores for choosing model complexity provide a reasonable range of values for the number of populations represented in the data, with minimal bias toward detecting structure when it is very weak. Our algorithm, fastSTRUCTURE, is freely available online at http://pritchardlab.stanford.edu/structure.html. Copyright © 2014 by the Genetics Society of America.

  13. Robust and scalable inference of population history from hundreds of unphased whole genomes.

    PubMed

    Terhorst, Jonathan; Kamm, John A; Song, Yun S

    2017-02-01

    It has recently been demonstrated that inference methods based on genealogical processes with recombination can uncover past population history in unprecedented detail. However, these methods scale poorly with sample size, limiting resolution in the recent past, and they require phased genomes, which contain switch errors that can catastrophically distort the inferred history. Here we present SMC++, a new statistical tool capable of analyzing orders of magnitude more samples than existing methods while requiring only unphased genomes (its results are independent of phasing). SMC++ can jointly infer population size histories and split times in diverged populations, and it employs a novel spline regularization scheme that greatly reduces estimation error. We apply SMC++ to analyze sequence data from over a thousand human genomes in Africa and Eurasia, hundreds of genomes from a Drosophila melanogaster population in Africa, and tens of genomes from zebra finch and long-tailed finch populations in Australia.

  14. A molecular phylogeny of Hemiptera inferred from mitochondrial genome sequences.

    PubMed

    Song, Nan; Liang, Ai-Ping; Bu, Cui-Ping

    2012-01-01

    Classically, Hemiptera is comprised of two suborders: Homoptera and Heteroptera. Homoptera includes Cicadomorpha, Fulgoromorpha and Sternorrhyncha. However, according to previous molecular phylogenetic studies based on 18S rDNA, Fulgoromorpha has a closer relationship to Heteroptera than to other hemipterans, leaving Homoptera as paraphyletic. Therefore, the position of Fulgoromorpha is important for studying phylogenetic structure of Hemiptera. We inferred the evolutionary affiliations of twenty-five superfamilies of Hemiptera using mitochondrial protein-coding genes and rRNAs. We sequenced three mitogenomes, from Pyrops candelaria, Lycorma delicatula and Ricania marginalis, representing two additional families in Fulgoromorpha. Pyrops and Lycorma are representatives of an additional major family Fulgoridae in Fulgoromorpha, whereas Ricania is a second representative of the highly derived clade Ricaniidae. The organization and size of these mitogenomes are similar to those of the sequenced fulgoroid species. Our consensus phylogeny of Hemiptera largely supported the relationships (((Fulgoromorpha,Sternorrhyncha),Cicadomorpha),Heteroptera), and thus supported the classic phylogeny of Hemiptera. Selection of optimal evolutionary models (exclusion and inclusion of two rRNA genes or of third codon positions of protein-coding genes) demonstrated that rapidly evolving and saturated sites should be removed from the analyses.

  15. A Molecular Phylogeny of Hemiptera Inferred from Mitochondrial Genome Sequences

    PubMed Central

    Song, Nan; Liang, Ai-Ping; Bu, Cui-Ping

    2012-01-01

    Classically, Hemiptera is comprised of two suborders: Homoptera and Heteroptera. Homoptera includes Cicadomorpha, Fulgoromorpha and Sternorrhyncha. However, according to previous molecular phylogenetic studies based on 18S rDNA, Fulgoromorpha has a closer relationship to Heteroptera than to other hemipterans, leaving Homoptera as paraphyletic. Therefore, the position of Fulgoromorpha is important for studying phylogenetic structure of Hemiptera. We inferred the evolutionary affiliations of twenty-five superfamilies of Hemiptera using mitochondrial protein-coding genes and rRNAs. We sequenced three mitogenomes, from Pyrops candelaria, Lycorma delicatula and Ricania marginalis, representing two additional families in Fulgoromorpha. Pyrops and Lycorma are representatives of an additional major family Fulgoridae in Fulgoromorpha, whereas Ricania is a second representative of the highly derived clade Ricaniidae. The organization and size of these mitogenomes are similar to those of the sequenced fulgoroid species. Our consensus phylogeny of Hemiptera largely supported the relationships (((Fulgoromorpha,Sternorrhyncha),Cicadomorpha),Heteroptera), and thus supported the classic phylogeny of Hemiptera. Selection of optimal evolutionary models (exclusion and inclusion of two rRNA genes or of third codon positions of protein-coding genes) demonstrated that rapidly evolving and saturated sites should be removed from the analyses. PMID:23144967

  16. [Genomic structure of the autotetraploid oat species Avena macrostachya inferred from comparative analysis of the ITS1 and ITS2 sequences: on the oat karyotype evolution during the early stages of the Avena species divergence].

    PubMed

    Rodionov, A V; Tiupa, N B; Kim, E S; Machs, E M; Loskutov, I G

    2005-05-01

    To examine the genomic structure of Avena macrostachya, internal transcribed spacers, ITS1 and ITS2, as well as nuclear 5.8S tRNA genes from three oat species with AsAs karyotype (A. wiestii, A. hirtula, and A. atlantica), and those from A. longiglumis (AlAl), A. canariensis (AcAc), A. ventricosa (CvCv), A. pilosa, and A. clauda (CpCp) were sequenced. All species of the genus Avena examined represented a monophyletic group (bootstrap index = 98), within which two branches, i.e., species with A- and C-genomes, were distinguished (bootstrap indices = 100). The subject of our study, A. macrostachya, albeit belonging to the phylogenetic branch of C-genome oat species (karyotype with submetacentic and subacrocentric chromosomes), has preserved an isobrachyal karyotype, (i.e., that containing metacentric chromosomes), probably typical of the common Avena ancestor. It was suggested to classify the A. macrostachya genome as a specific form of C-genome, Cm-genome. Among the species from other genera studied, Arrhenatherum elatius was found to be the closest to Avena in ITS1 and ITS structure. Phylogenetic relationships between Avena and Helictotrichon remain intriguingly uncertain. The HPR389153 sequence from H. pratense genome was closest to the ITS1 sequences specific to the Avena A-genomes (p-distance = 0.0237), while the differences of this sequence from the ITS1 of A. macrostachya reached 0.1221. On the other hand, HAD389117 from H. adsurgens was close to the ITS1 specific to Avena C-genomes (p-distance = 0.0189), while its differences from the A-genome specific ITS1 sequences reached 0.1221. It seems likely that the appearance of highly polyploid (2n = 12-21x) species of H. pratense and H. adsurgens could be associated with interspecific hybridization involving Mediterranean oat species carrying A- and C-genomes. A hypothesis on the pathways of Avena chromosomes evolution during the early stages the oat species divergence is proposed.

  17. On the importance of being structured: instantaneous coalescence rates and human evolution—lessons for ancestral population size inference?

    PubMed Central

    Mazet, O; Rodríguez, W; Grusea, S; Boitard, S; Chikhi, L

    2016-01-01

    Most species are structured and influenced by processes that either increased or reduced gene flow between populations. However, most population genetic inference methods assume panmixia and reconstruct a history characterized by population size changes. This is potentially problematic as population structure can generate spurious signals of population size change through time. Moreover, when the model assumed for demographic inference is misspecified, genomic data will likely increase the precision of misleading if not meaningless parameters. For instance, if data were generated under an n-island model (characterized by the number of islands and migrants exchanged) inference based on a model of population size change would produce precise estimates of a bottleneck that would be meaningless. In addition, archaeological or climatic events around the bottleneck's timing might provide a reasonable but potentially misleading scenario. In a context of model uncertainty (panmixia versus structure) genomic data may thus not necessarily lead to improved statistical inference. We consider two haploid genomes and develop a theory that explains why any demographic model with structure will necessarily be interpreted as a series of changes in population size by inference methods ignoring structure. We formalize a parameter, the inverse instantaneous coalescence rate, and show that it is equivalent to a population size only in panmictic models, and is mostly misleading for structured models. We argue that this issue affects all population genetics methods ignoring population structure which may thus infer population size changes that never took place. We apply our approach to human genomic data. PMID:26647653

  18. On the importance of being structured: instantaneous coalescence rates and human evolution--lessons for ancestral population size inference?

    PubMed

    Mazet, O; Rodríguez, W; Grusea, S; Boitard, S; Chikhi, L

    2016-04-01

    Most species are structured and influenced by processes that either increased or reduced gene flow between populations. However, most population genetic inference methods assume panmixia and reconstruct a history characterized by population size changes. This is potentially problematic as population structure can generate spurious signals of population size change through time. Moreover, when the model assumed for demographic inference is misspecified, genomic data will likely increase the precision of misleading if not meaningless parameters. For instance, if data were generated under an n-island model (characterized by the number of islands and migrants exchanged) inference based on a model of population size change would produce precise estimates of a bottleneck that would be meaningless. In addition, archaeological or climatic events around the bottleneck's timing might provide a reasonable but potentially misleading scenario. In a context of model uncertainty (panmixia versus structure) genomic data may thus not necessarily lead to improved statistical inference. We consider two haploid genomes and develop a theory that explains why any demographic model with structure will necessarily be interpreted as a series of changes in population size by inference methods ignoring structure. We formalize a parameter, the inverse instantaneous coalescence rate, and show that it is equivalent to a population size only in panmictic models, and is mostly misleading for structured models. We argue that this issue affects all population genetics methods ignoring population structure which may thus infer population size changes that never took place. We apply our approach to human genomic data.

  19. AD-LIBS: inferring ancestry across hybrid genomes using low-coverage sequence data.

    PubMed

    Schaefer, Nathan K; Shapiro, Beth; Green, Richard E

    2017-04-04

    Inferring the ancestry of each region of admixed individuals' genomes is useful in studies ranging from disease gene mapping to speciation genetics. Current methods require high-coverage genotype data and phased reference panels, and are therefore inappropriate for many data sets. We present a software application, AD-LIBS, that uses a hidden Markov model to infer ancestry across hybrid genomes without requiring variant calling or phasing. This approach is useful for non-model organisms and in cases of low-coverage data, such as ancient DNA. We demonstrate the utility of AD-LIBS with synthetic data. We then use AD-LIBS to infer ancestry in two published data sets: European human genomes with Neanderthal ancestry and brown bear genomes with polar bear ancestry. AD-LIBS correctly infers 87-91% of ancestry in simulations and produces ancestry maps that agree with published results and global ancestry estimates in humans. In brown bears, we find more polar bear ancestry than has been published previously, using both AD-LIBS and an existing software application for local ancestry inference, HAPMIX. We validate AD-LIBS polar bear ancestry maps by recovering a geographic signal within bears that mirrors what is seen in SNP data. Finally, we demonstrate that AD-LIBS is more effective than HAPMIX at inferring ancestry when preexisting phased reference data are unavailable and genomes are sequenced to low coverage. AD-LIBS is an effective tool for ancestry inference that can be used even when few individuals are available for comparison or when genomes are sequenced to low coverage. AD-LIBS is therefore likely to be useful in studies of non-model or ancient organisms that lack large amounts of genomic DNA. AD-LIBS can therefore expand the range of studies in which admixture mapping is a viable tool.

  20. An Algebraic Approach to Inference in Complex Networked Structures

    DTIC Science & Technology

    2015-07-09

    AFRL-AFOSR-VA-TR-2015-0265 An Algebraic Approach to Inference in Complex Networked Structures Jose Moura CARNEGIE MELLON UNIVERSITY Final Report 07...31-03-2015 4.  TITLE AND SUBTITLE An Algebraic Approach to Inference in Complex Networked Structures 5a.  CONTRACT NUMBER 5b.  GRANT NUMBER FA9550-12-1...Final Report: An Algebraic Approach to Inference in Complex Networked Structures FA9550-12-1-0087 4/1/2012-3/31/2015 José M F Moura Carnegie Mellon

  1. Informational laws of genome structures

    PubMed Central

    Bonnici, Vincenzo; Manca, Vincenzo

    2016-01-01

    In recent years, the analysis of genomes by means of strings of length k occurring in the genomes, called k-mers, has provided important insights into the basic mechanisms and design principles of genome structures. In the present study, we focus on the proper choice of the value of k for applying information theoretic concepts that express intrinsic aspects of genomes. The value k = lg2(n), where n is the genome length, is determined to be the best choice in the definition of some genomic informational indexes that are studied and computed for seventy genomes. These indexes, which are based on information entropies and on suitable comparisons with random genomes, suggest five informational laws, to which all of the considered genomes obey. Moreover, an informational genome complexity measure is proposed, which is a generalized logistic map that balances entropic and anti-entropic components of genomes and is related to their evolutionary dynamics. Finally, applications to computational synthetic biology are briefly outlined. PMID:27354155

  2. Informational laws of genome structures

    NASA Astrophysics Data System (ADS)

    Bonnici, Vincenzo; Manca, Vincenzo

    2016-06-01

    In recent years, the analysis of genomes by means of strings of length k occurring in the genomes, called k-mers, has provided important insights into the basic mechanisms and design principles of genome structures. In the present study, we focus on the proper choice of the value of k for applying information theoretic concepts that express intrinsic aspects of genomes. The value k = lg2(n), where n is the genome length, is determined to be the best choice in the definition of some genomic informational indexes that are studied and computed for seventy genomes. These indexes, which are based on information entropies and on suitable comparisons with random genomes, suggest five informational laws, to which all of the considered genomes obey. Moreover, an informational genome complexity measure is proposed, which is a generalized logistic map that balances entropic and anti-entropic components of genomes and is related to their evolutionary dynamics. Finally, applications to computational synthetic biology are briefly outlined.

  3. Inferring genome-wide patterns of admixture in Qataris using fifty-five ancestral populations

    PubMed Central

    2012-01-01

    Background Populations of the Arabian Peninsula have a complex genetic structure that reflects waves of migrations including the earliest human migrations from Africa and eastern Asia, migrations along ancient civilization trading routes and colonization history of recent centuries. Results Here, we present a study of genome-wide admixture in this region, using 156 genotyped individuals from Qatar, a country located at the crossroads of these migration patterns. Since haplotypes of these individuals could have originated from many different populations across the world, we have developed a machine learning method "SupportMix" to infer loci-specific genomic ancestry when simultaneously analyzing many possible ancestral populations. Simulations show that SupportMix is not only more accurate than other popular admixture discovery tools but is the first admixture inference method that can efficiently scale for simultaneous analysis of 50-100 putative ancestral populations while being independent of prior demographic information. Conclusions By simultaneously using the 55 world populations from the Human Genome Diversity Panel, SupportMix was able to extract the fine-scale ancestry of the Qatar population, providing many new observations concerning the ancestry of the region. For example, as well as recapitulating the three major sub-populations in Qatar, composed of mainly Arabic, Persian, and African ancestry, SupportMix additionally identifies the specific ancestry of the Persian group to populations sampled in Greater Persia rather than from China and the ancestry of the African group to sub-Saharan origin and not Southern African Bantu origin as previously thought. PMID:22734698

  4. Structural Genomics: Correlation Blocks, Population Structure, and Genome Architecture

    PubMed Central

    Hu, Xin-Sheng; Yeh, Francis C.; Wang, Zhiquan

    2011-01-01

    An integration of the pattern of genome-wide inter-site associations with evolutionary forces is important for gaining insights into the genomic evolution in natural or artificial populations. Here, we assess the inter-site correlation blocks and their distributions along chromosomes. A correlation block is broadly termed as the DNA segment within which strong correlations exist between genetic diversities at any two sites. We bring together the population genetic structure and the genomic diversity structure that have been independently built on different scales and synthesize the existing theories and methods for characterizing genomic structure at the population level. We discuss how population structure could shape correlation blocks and their patterns within and between populations. Effects of evolutionary forces (selection, migration, genetic drift, and mutation) on the pattern of genome-wide correlation blocks are discussed. In eukaryote organisms, we briefly discuss the associations between the pattern of correlation blocks and genome assembly features in eukaryote organisms, including the impacts of multigene family, the perturbation of transposable elements, and the repetitive nongenic sequences and GC-rich isochores. Our reviews suggest that the observable pattern of correlation blocks can refine our understanding of the ecological and evolutionary processes underlying the genomic evolution at the population level. PMID:21886455

  5. Ecophysiology of Freshwater Verrucomicrobia Inferred from Metagenome-Assembled Genomes

    PubMed Central

    He, Shaomei; Stevens, Sarah L. R.; Chan, Leong-Keat; Bertilsson, Stefan; Glavina del Rio, Tijana; Tringe, Susannah G.; Malmstrom, Rex R.

    2017-01-01

    ABSTRACT Microbes are critical in carbon and nutrient cycling in freshwater ecosystems. Members of the Verrucomicrobia are ubiquitous in such systems, and yet their roles and ecophysiology are not well understood. In this study, we recovered 19 Verrucomicrobia draft genomes by sequencing 184 time-series metagenomes from a eutrophic lake and a humic bog that differ in carbon source and nutrient availabilities. These genomes span four of the seven previously defined Verrucomicrobia subdivisions and greatly expand knowledge of the genomic diversity of freshwater Verrucomicrobia. Genome analysis revealed their potential role as (poly)saccharide degraders in freshwater, uncovered interesting genomic features for this lifestyle, and suggested their adaptation to nutrient availabilities in their environments. Verrucomicrobia populations differ significantly between the two lakes in glycoside hydrolase gene abundance and functional profiles, reflecting the autochthonous and terrestrially derived allochthonous carbon sources of the two ecosystems, respectively. Interestingly, a number of genomes recovered from the bog contained gene clusters that potentially encode a novel porin-multiheme cytochrome c complex and might be involved in extracellular electron transfer in the anoxic humus-rich environment. Notably, most epilimnion genomes have large numbers of so-called “Planctomycete-specific” cytochrome c-encoding genes, which exhibited distribution patterns nearly opposite to those seen with glycoside hydrolase genes, probably associated with the different levels of environmental oxygen availability and carbohydrate complexity between lakes/layers. Overall, the recovered genomes represent a major step toward understanding the role, ecophysiology, and distribution of Verrucomicrobia in freshwater. IMPORTANCE Freshwater Verrucomicrobia spp. are cosmopolitan in lakes and rivers, and yet their roles and ecophysiology are not well understood, as cultured freshwater

  6. Efficient Exact Inference With Loss Augmented Objective in Structured Learning.

    PubMed

    Bauer, Alexander; Nakajima, Shinichi; Muller, Klaus-Robert

    2016-08-19

    Structural support vector machine (SVM) is an elegant approach for building complex and accurate models with structured outputs. However, its applicability relies on the availability of efficient inference algorithms--the state-of-the-art training algorithms repeatedly perform inference to compute a subgradient or to find the most violating configuration. In this paper, we propose an exact inference algorithm for maximizing nondecomposable objectives due to special type of a high-order potential having a decomposable internal structure. As an important application, our method covers the loss augmented inference, which enables the slack and margin scaling formulations of structural SVM with a variety of dissimilarity measures, e.g., Hamming loss, precision and recall, Fβ-loss, intersection over union, and many other functions that can be efficiently computed from the contingency table. We demonstrate the advantages of our approach in natural language parsing and sequence segmentation applications.

  7. Inferring Ancestral Recombination Graphs from Bacterial Genomic Data.

    PubMed

    Vaughan, Timothy G; Welch, David; Drummond, Alexei J; Biggs, Patrick J; George, Tessy; French, Nigel P

    2017-02-01

    Homologous recombination is a central feature of bacterial evolution, yet it confounds traditional phylogenetic methods. While a number of methods specific to bacterial evolution have been developed, none of these permit joint inference of a bacterial recombination graph and associated parameters. In this article, we present a new method which addresses this shortcoming. Our method uses a novel Markov chain Monte Carlo algorithm to perform phylogenetic inference under the ClonalOrigin model. We demonstrate the utility of our method by applying it to ribosomal multilocus sequence typing data sequenced from pathogenic and nonpathogenic Escherichia coli serotype O157 and O26 isolates collected in rural New Zealand. The method is implemented as an open source BEAST 2 package, Bacter, which is available via the project web page at http://tgvaughan.github.io/bacter. Copyright © 2017 Vaughan et al.

  8. Inferring Ancestral Recombination Graphs from Bacterial Genomic Data

    PubMed Central

    Vaughan, Timothy G.; Welch, David; Drummond, Alexei J.; Biggs, Patrick J.; George, Tessy; French, Nigel P.

    2017-01-01

    Homologous recombination is a central feature of bacterial evolution, yet it confounds traditional phylogenetic methods. While a number of methods specific to bacterial evolution have been developed, none of these permit joint inference of a bacterial recombination graph and associated parameters. In this article, we present a new method which addresses this shortcoming. Our method uses a novel Markov chain Monte Carlo algorithm to perform phylogenetic inference under the ClonalOrigin model. We demonstrate the utility of our method by applying it to ribosomal multilocus sequence typing data sequenced from pathogenic and nonpathogenic Escherichia coli serotype O157 and O26 isolates collected in rural New Zealand. The method is implemented as an open source BEAST 2 package, Bacter, which is available via the project web page at http://tgvaughan.github.io/bacter. PMID:28007885

  9. OMA 2011: orthology inference among 1000 complete genomes.

    PubMed

    Altenhoff, Adrian M; Schneider, Adrian; Gonnet, Gaston H; Dessimoz, Christophe

    2011-01-01

    OMA (Orthologous MAtrix) is a database that identifies orthologs among publicly available, complete genomes. Initiated in 2004, the project is at its 11th release. It now includes 1000 genomes, making it one of the largest resources of its kind. Here, we describe recent developments in terms of species covered; the algorithmic pipeline--in particular regarding the treatment of alternative splicing, and new features of the web (OMA Browser) and programming interface (SOAP API). In the second part, we review the various representations provided by OMA and their typical applications. The database is publicly accessible at http://omabrowser.org.

  10. Using Genetic Distance to Infer the Accuracy of Genomic Prediction

    PubMed Central

    Scutari, Marco; Mackay, Ian

    2016-01-01

    The prediction of phenotypic traits using high-density genomic data has many applications such as the selection of plants and animals of commercial interest; and it is expected to play an increasing role in medical diagnostics. Statistical models used for this task are usually tested using cross-validation, which implicitly assumes that new individuals (whose phenotypes we would like to predict) originate from the same population the genomic prediction model is trained on. In this paper we propose an approach based on clustering and resampling to investigate the effect of increasing genetic distance between training and target populations when predicting quantitative traits. This is important for plant and animal genetics, where genomic selection programs rely on the precision of predictions in future rounds of breeding. Therefore, estimating how quickly predictive accuracy decays is important in deciding which training population to use and how often the model has to be recalibrated. We find that the correlation between true and predicted values decays approximately linearly with respect to either FST or mean kinship between the training and the target populations. We illustrate this relationship using simulations and a collection of data sets from mice, wheat and human genetics. PMID:27589268

  11. Inferring Epidemic Contact Structure from Phylogenetic Trees

    PubMed Central

    Leventhal, Gabriel E.; Kouyos, Roger; Stadler, Tanja; von Wyl, Viktor; Yerly, Sabine; Böni, Jürg; Cellerai, Cristina; Klimkait, Thomas; Günthard, Huldrych F.; Bonhoeffer, Sebastian

    2012-01-01

    Contact structure is believed to have a large impact on epidemic spreading and consequently using networks to model such contact structure continues to gain interest in epidemiology. However, detailed knowledge of the exact contact structure underlying real epidemics is limited. Here we address the question whether the structure of the contact network leaves a detectable genetic fingerprint in the pathogen population. To this end we compare phylogenies generated by disease outbreaks in simulated populations with different types of contact networks. We find that the shape of these phylogenies strongly depends on contact structure. In particular, measures of tree imbalance allow us to quantify to what extent the contact structure underlying an epidemic deviates from a null model contact network and illustrate this in the case of random mixing. Using a phylogeny from the Swiss HIV epidemic, we show that this epidemic has a significantly more unbalanced tree than would be expected from random mixing. PMID:22412361

  12. Inferring epidemic contact structure from phylogenetic trees.

    PubMed

    Leventhal, Gabriel E; Kouyos, Roger; Stadler, Tanja; Wyl, Viktor von; Yerly, Sabine; Böni, Jürg; Cellerai, Cristina; Klimkait, Thomas; Günthard, Huldrych F; Bonhoeffer, Sebastian

    2012-01-01

    Contact structure is believed to have a large impact on epidemic spreading and consequently using networks to model such contact structure continues to gain interest in epidemiology. However, detailed knowledge of the exact contact structure underlying real epidemics is limited. Here we address the question whether the structure of the contact network leaves a detectable genetic fingerprint in the pathogen population. To this end we compare phylogenies generated by disease outbreaks in simulated populations with different types of contact networks. We find that the shape of these phylogenies strongly depends on contact structure. In particular, measures of tree imbalance allow us to quantify to what extent the contact structure underlying an epidemic deviates from a null model contact network and illustrate this in the case of random mixing. Using a phylogeny from the Swiss HIV epidemic, we show that this epidemic has a significantly more unbalanced tree than would be expected from random mixing.

  13. Structural genomics in North America.

    PubMed

    Terwilliger, T C

    2000-11-01

    Structural genomics in North America has moved remarkably quickly from ideas to pilot projects. Just three years ago, the field was only a concept, independently being discussed by its many inventors. Now it is already a well-organized, increasingly-funded, consortium-based effort to determine protein structures on a large scale.

  14. GWIS: Genome-Wide Inferred Statistics for Functions of Multiple Phenotypes.

    PubMed

    Nieuwboer, Harold A; Pool, René; Dolan, Conor V; Boomsma, Dorret I; Nivard, Michel G

    2016-10-06

    Here we present a method of genome-wide inferred study (GWIS) that provides an approximation of genome-wide association study (GWAS) summary statistics for a variable that is a function of phenotypes for which GWAS summary statistics, phenotypic means, and covariances are available. A GWIS can be performed regardless of sample overlap between the GWAS of the phenotypes on which the function depends. Because a GWIS provides association estimates and their standard errors for each SNP, a GWIS can form the basis for polygenic risk scoring, LD score regression, Mendelian randomization studies, biological annotation, and other analyses. GWISs can also be used to boost power of a GWAS meta-analysis where cohorts have not measured all constituent phenotypes in the function. We demonstrate the accuracy of a BMI GWIS by performing power simulations and type I error simulations under varying circumstances, and we apply a GWIS by reconstructing a body mass index (BMI) GWAS based on a weight GWAS and a height GWAS. Furthermore, we apply a GWIS to further our understanding of the underlying genetic structure of bipolar disorder and schizophrenia and their relation to educational attainment. Our analyses suggest that the previously reported genetic correlation between schizophrenia and educational attainment is probably induced by the observed genetic correlation between schizophrenia and bipolar disorder and the previously reported genetic correlation between bipolar disorder and educational attainment. Copyright © 2016 American Society of Human Genetics. Published by Elsevier Inc. All rights reserved.

  15. Species Delimitation and Interspecific Relationships of the Genus Orychophragmus (Brassicaceae) Inferred from Whole Chloroplast Genomes

    PubMed Central

    Hu, Huan; Hu, Quanjun; Al-Shehbaz, Ihsan A.; Luo, Xin; Zeng, Tingting; Guo, Xinyi; Liu, Jianquan

    2016-01-01

    Genetic variations from few chloroplast DNA fragments show lower discriminatory power in the delimitation of closely related species and less resolution ability in discerning interspecific relationships than from nrITS. Here we use Orychophragmus (Brassicaceae) as a model system to test the hypothesis that the whole chloroplast genomes (plastomes), with accumulation of more variations despite the slow evolution, can overcome these weaknesses. We used Illumina sequencing technology via a reference-guided assembly to construct complete plastomes of 17 individuals from six putatively assumed species in the genus. All plastomes are highly conserved in genome structure, gene order, and orientation, and they are around 153 kb in length and contain 113 unique genes. However, nucleotide variations are quite substantial to support the delimitation of all sampled species and to resolve interspecific relationships with high statistical supports. As expected, the estimated divergences between major clades and species are lower than those estimated from nrITS probably due to the slow substitution rate of the plastomes. However, the plastome and nrITS phylogenies were contradictory in the placements of most species, thus suggesting that these species may have experienced complex non-bifurcating evolutions with incomplete lineage sorting and/or hybrid introgressions. Overall, our case study highlights the importance of using plastomes to examine species boundaries and establish an independent phylogeny to infer the speciation history of plants. PMID:27999584

  16. Structural Genomics of Protein Phosphatases

    SciTech Connect

    Almo,S.; Bonanno, J.; Sauder, J.; Emtage, S.; Dilorenzo, T.; Malashkevich, V.; Wasserman, S.; Swaminathan, S.; Eswaramoorthy, S.; et al

    2007-01-01

    The New York SGX Research Center for Structural Genomics (NYSGXRC) of the NIGMS Protein Structure Initiative (PSI) has applied its high-throughput X-ray crystallographic structure determination platform to systematic studies of all human protein phosphatases and protein phosphatases from biomedically-relevant pathogens. To date, the NYSGXRC has determined structures of 21 distinct protein phosphatases: 14 from human, 2 from mouse, 2 from the pathogen Toxoplasma gondii, 1 from Trypanosoma brucei, the parasite responsible for African sleeping sickness, and 2 from the principal mosquito vector of malaria in Africa, Anopheles gambiae. These structures provide insights into both normal and pathophysiologic processes, including transcriptional regulation, regulation of major signaling pathways, neural development, and type 1 diabetes. In conjunction with the contributions of other international structural genomics consortia, these efforts promise to provide an unprecedented database and materials repository for structure-guided experimental and computational discovery of inhibitors for all classes of protein phosphatases.

  17. Quantum inferring acausal structures and the Monty Hall problem

    NASA Astrophysics Data System (ADS)

    Kurzyk, Dariusz; Glos, Adam

    2016-12-01

    This paper presents a quantum version of the Monty Hall problem based upon the quantum inferring acausal structures, which can be identified with generalization of Bayesian networks. Considered structures are expressed in formalism of quantum information theory, where density operators are identified with quantum generalization of probability distributions. Conditional relations between quantum counterpart of random variables are described by quantum conditional operators. Presented quantum inferring structures are used to construct a model inspired by scenario of well-known Monty Hall game, where we show the differences between classical and quantum Bayesian reasoning.

  18. Accelerated probabilistic inference of RNA structure evolution

    PubMed Central

    Holmes, Ian

    2005-01-01

    Background Pairwise stochastic context-free grammars (Pair SCFGs) are powerful tools for evolutionary analysis of RNA, including simultaneous RNA sequence alignment and secondary structure prediction, but the associated algorithms are intensive in both CPU and memory usage. The same problem is faced by other RNA alignment-and-folding algorithms based on Sankoff's 1985 algorithm. It is therefore desirable to constrain such algorithms, by pre-processing the sequences and using this first pass to limit the range of structures and/or alignments that can be considered. Results We demonstrate how flexible classes of constraint can be imposed, greatly reducing the computational costs while maintaining a high quality of structural homology prediction. Any score-attributed context-free grammar (e.g. energy-based scoring schemes, or conditionally normalized Pair SCFGs) is amenable to this treatment. It is now possible to combine independent structural and alignment constraints of unprecedented general flexibility in Pair SCFG alignment algorithms. We outline several applications to the bioinformatics of RNA sequence and structure, including Waterman-Eggert N-best alignments and progressive multiple alignment. We evaluate the performance of the algorithm on test examples from the RFAM database. Conclusion A program, Stemloc, that implements these algorithms for efficient RNA sequence alignment and structure prediction is available under the GNU General Public License. PMID:15790387

  19. Genome Size Variation and Species Relationships in Hieracium Sub-genus Pilosella (Asteraceae) as Inferred by Flow Cytometry

    PubMed Central

    Suda, Jan; Krahulcová, Anna; Trávníček, Pavel; Rosenbaumová, Radka; Peckert, Tomáš; Krahulec, František

    2007-01-01

    Background and Aims Hieracium sub-genus Pilosella (hawkweeds) is a taxonomically complicated group of vascular plants, the structure of which is substantially influenced by frequent interspecific hybridization and polyploidization. Two kinds of species, ‘basic’ and ‘intermediate’ (i.e. hybridogenous), are usually recognized. In this study, genome size variation was investigated in a representative set of Central European hawkweeds in order to assess the value of such a data set for species delineation and inference of evolutionary relationships. Methods Holoploid and monoploid genome sizes (C- and Cx-values) were determined using propidium iodide flow cytometry for 376 homogeneously cultivated individuals of Hieracium sub-genus Pilosella, including 24 species (271 individuals), five recent natural hybrids (seven individuals) and experimental F1 hybrids from four parental combinations (98 individuals). Chromosome counts were available for more than half of the plant accessions. Base composition (proportion of AT/GC bases) was cytometrically estimated in 73 individuals. Key Results Seven different ploidy levels (2x–8x) were detected, with intraspecific ploidy polymorphism (up to four different cytotypes) occurring in 11 wild species. Mean 2C-values varied approx. 4·3-fold from 3·53 pg in diploid H. hoppeanum to 15·30 pg in octoploid H. brachiatum. 1Cx-values ranged from 1·72 pg in H. pilosella to 2·16 pg in H. echioides (1·26-fold). The DNA content of (high) polyploids was usually proportional to the DNA values of their diploid/low polyploid counterparts, indicating lack of processes altering genome size (i.e. genome down-sizing). Most species showed constant nuclear DNA amounts, exceptions being three hybridogenous taxa, in which introgressive hybridization was suggested as a presumable trigger for genome size variation. Monoploid genome sizes of hybridogenous species were always between the corresponding values of their putative parents. In addition

  20. EMu: probabilistic inference of mutational processes and their localization in the cancer genome

    PubMed Central

    2013-01-01

    The spectrum of mutations discovered in cancer genomes can be explained by the activity of a few elementary mutational processes. We present a novel probabilistic method, EMu, to infer the mutational signatures of these processes from a collection of sequenced tumors. EMu naturally incorporates the tumor-specific opportunity for different mutation types according to sequence composition. Applying EMu to breast cancer data, we derive detailed maps of the activity of each process, both genome-wide and within specific local regions of the genome. Our work provides new opportunities to study the mutational processes underlying cancer development. EMu is available at http://www.sanger.ac.uk/resources/software/emu/. PMID:23628380

  1. The History of Slavs Inferred from Complete Mitochondrial Genome Sequences

    PubMed Central

    Mielnik-Sikorska, Marta; Daca, Patrycja; Malyarchuk, Boris; Derenko, Miroslava; Skonieczna, Katarzyna; Perkova, Maria; Dobosz, Tadeusz; Grzybowski, Tomasz

    2013-01-01

    To shed more light on the processes leading to crystallization of a Slavic identity, we investigated variability of complete mitochondrial genomes belonging to haplogroups H5 and H6 (63 mtDNA genomes) from the populations of Eastern and Western Slavs, including new samples of Poles, Ukrainians and Czechs presented here. Molecular dating implies formation of H5 approximately 11.5–16 thousand years ago (kya) in the areas of southern Europe. Within ancient haplogroup H6, dated at around 15–28 kya, there is a subhaplogroup H6c, which probably survived the last glaciation in Europe and has undergone expansion only 3–4 kya, together with the ancestors of some European groups, including the Slavs, because H6c has been detected in Czechs, Poles and Slovaks. Detailed analysis of complete mtDNAs allowed us to identify a number of lineages that seem specific for Central and Eastern Europe (H5a1f, H5a2, H5a1r, H5a1s, H5b4, H5e1a, H5u1, some subbranches of H5a1a and H6a1a9). Some of them could possibly be traced back to at least ∼4 kya, which indicates that some of the ancestors of today's Slavs (Poles, Czechs, Slovaks, Ukrainians and Russians) inhabited areas of Central and Eastern Europe much earlier than it was estimated on the basis of archaeological and historical data. We also sequenced entire mitochondrial genomes of several non-European lineages (A, C, D, G, L) found in contemporary populations of Poland and Ukraine. The analysis of these haplogroups confirms the presence of Siberian (C5c1, A8a1) and Ashkenazi-specific (L2a1l2a) mtDNA lineages in Slavic populations. Moreover, we were able to pinpoint some lineages which could possibly reflect the relatively recent contacts of Slavs with nomadic Altaic peoples (C4a1a, G2a, D5a2a1a1). PMID:23342138

  2. The history of Slavs inferred from complete mitochondrial genome sequences.

    PubMed

    Mielnik-Sikorska, Marta; Daca, Patrycja; Malyarchuk, Boris; Derenko, Miroslava; Skonieczna, Katarzyna; Perkova, Maria; Dobosz, Tadeusz; Grzybowski, Tomasz

    2013-01-01

    To shed more light on the processes leading to crystallization of a Slavic identity, we investigated variability of complete mitochondrial genomes belonging to haplogroups H5 and H6 (63 mtDNA genomes) from the populations of Eastern and Western Slavs, including new samples of Poles, Ukrainians and Czechs presented here. Molecular dating implies formation of H5 approximately 11.5-16 thousand years ago (kya) in the areas of southern Europe. Within ancient haplogroup H6, dated at around 15-28 kya, there is a subhaplogroup H6c, which probably survived the last glaciation in Europe and has undergone expansion only 3-4 kya, together with the ancestors of some European groups, including the Slavs, because H6c has been detected in Czechs, Poles and Slovaks. Detailed analysis of complete mtDNAs allowed us to identify a number of lineages that seem specific for Central and Eastern Europe (H5a1f, H5a2, H5a1r, H5a1s, H5b4, H5e1a, H5u1, some subbranches of H5a1a and H6a1a9). Some of them could possibly be traced back to at least ∼4 kya, which indicates that some of the ancestors of today's Slavs (Poles, Czechs, Slovaks, Ukrainians and Russians) inhabited areas of Central and Eastern Europe much earlier than it was estimated on the basis of archaeological and historical data. We also sequenced entire mitochondrial genomes of several non-European lineages (A, C, D, G, L) found in contemporary populations of Poland and Ukraine. The analysis of these haplogroups confirms the presence of Siberian (C5c1, A8a1) and Ashkenazi-specific (L2a1l2a) mtDNA lineages in Slavic populations. Moreover, we were able to pinpoint some lineages which could possibly reflect the relatively recent contacts of Slavs with nomadic Altaic peoples (C4a1a, G2a, D5a2a1a1).

  3. LASSIM-A network inference toolbox for genome-wide mechanistic modeling.

    PubMed

    Magnusson, Rasmus; Mariotti, Guido Pio; Köpsén, Mattias; Lövfors, William; Gawel, Danuta R; Jörnsten, Rebecka; Linde, Jörg; Nordling, Torbjörn E M; Nyman, Elin; Schulze, Sylvie; Nestor, Colm E; Zhang, Huan; Cedersund, Gunnar; Benson, Mikael; Tjärnberg, Andreas; Gustafsson, Mika

    2017-06-01

    Recent technological advancements have made time-resolved, quantitative, multi-omics data available for many model systems, which could be integrated for systems pharmacokinetic use. Here, we present large-scale simulation modeling (LASSIM), which is a novel mathematical tool for performing large-scale inference using mechanistically defined ordinary differential equations (ODE) for gene regulatory networks (GRNs). LASSIM integrates structural knowledge about regulatory interactions and non-linear equations with multiple steady state and dynamic response expression datasets. The rationale behind LASSIM is that biological GRNs can be simplified using a limited subset of core genes that are assumed to regulate all other gene transcription events in the network. The LASSIM method is implemented as a general-purpose toolbox using the PyGMO Python package to make the most of multicore computers and high performance clusters, and is available at https://gitlab.com/Gustafsson-lab/lassim. As a method, LASSIM works in two steps, where it first infers a non-linear ODE system of the pre-specified core gene expression. Second, LASSIM in parallel optimizes the parameters that model the regulation of peripheral genes by core system genes. We showed the usefulness of this method by applying LASSIM to infer a large-scale non-linear model of naïve Th2 cell differentiation, made possible by integrating Th2 specific bindings, time-series together with six public and six novel siRNA-mediated knock-down experiments. ChIP-seq showed significant overlap for all tested transcription factors. Next, we performed novel time-series measurements of total T-cells during differentiation towards Th2 and verified that our LASSIM model could monitor those data significantly better than comparable models that used the same Th2 bindings. In summary, the LASSIM toolbox opens the door to a new type of model-based data analysis that combines the strengths of reliable mechanistic models with truly

  4. LASSIM—A network inference toolbox for genome-wide mechanistic modeling

    PubMed Central

    Mariotti, Guido Pio; Lövfors, William; Gawel, Danuta R.; Jörnsten, Rebecka; Linde, Jörg; Schulze, Sylvie; Nestor, Colm E.; Zhang, Huan; Cedersund, Gunnar; Benson, Mikael

    2017-01-01

    Recent technological advancements have made time-resolved, quantitative, multi-omics data available for many model systems, which could be integrated for systems pharmacokinetic use. Here, we present large-scale simulation modeling (LASSIM), which is a novel mathematical tool for performing large-scale inference using mechanistically defined ordinary differential equations (ODE) for gene regulatory networks (GRNs). LASSIM integrates structural knowledge about regulatory interactions and non-linear equations with multiple steady state and dynamic response expression datasets. The rationale behind LASSIM is that biological GRNs can be simplified using a limited subset of core genes that are assumed to regulate all other gene transcription events in the network. The LASSIM method is implemented as a general-purpose toolbox using the PyGMO Python package to make the most of multicore computers and high performance clusters, and is available at https://gitlab.com/Gustafsson-lab/lassim. As a method, LASSIM works in two steps, where it first infers a non-linear ODE system of the pre-specified core gene expression. Second, LASSIM in parallel optimizes the parameters that model the regulation of peripheral genes by core system genes. We showed the usefulness of this method by applying LASSIM to infer a large-scale non-linear model of naïve Th2 cell differentiation, made possible by integrating Th2 specific bindings, time-series together with six public and six novel siRNA-mediated knock-down experiments. ChIP-seq showed significant overlap for all tested transcription factors. Next, we performed novel time-series measurements of total T-cells during differentiation towards Th2 and verified that our LASSIM model could monitor those data significantly better than comparable models that used the same Th2 bindings. In summary, the LASSIM toolbox opens the door to a new type of model-based data analysis that combines the strengths of reliable mechanistic models with truly

  5. Covariance Between Genotypic Effects and its Use for Genomic Inference in Half-Sib Families

    PubMed Central

    Wittenburg, Dörte; Teuscher, Friedrich; Klosa, Jan; Reinsch, Norbert

    2016-01-01

    In livestock, current statistical approaches utilize extensive molecular data, e.g., single nucleotide polymorphisms (SNPs), to improve the genetic evaluation of individuals. The number of model parameters increases with the number of SNPs, so the multicollinearity between covariates can affect the results obtained using whole genome regression methods. In this study, dependencies between SNPs due to linkage and linkage disequilibrium among the chromosome segments were explicitly considered in methods used to estimate the effects of SNPs. The population structure affects the extent of such dependencies, so the covariance among SNP genotypes was derived for half-sib families, which are typical in livestock populations. Conditional on the SNP haplotypes of the common parent (sire), the theoretical covariance was determined using the haplotype frequencies of the population from which the individual parent (dam) was derived. The resulting covariance matrix was included in a statistical model for a trait of interest, and this covariance matrix was then used to specify prior assumptions for SNP effects in a Bayesian framework. The approach was applied to one family in simulated scenarios (few and many quantitative trait loci) and using semireal data obtained from dairy cattle to identify genome segments that affect performance traits, as well as to investigate the impact on predictive ability. Compared with a method that does not explicitly consider any of the relationship among predictor variables, the accuracy of genetic value prediction was improved by 10–22%. The results show that the inclusion of dependence is particularly important for genomic inference based on small sample sizes. PMID:27402363

  6. Evolutionary History of Chimpanzees Inferred from Complete Mitochondrial Genomes

    PubMed Central

    Bjork, Adam; Liu, Weimin; Wertheim, Joel O.; Hahn, Beatrice H.; Worobey, Michael

    2011-01-01

    Investigations into the evolutionary history of the common chimpanzee, Pan troglodytes, have produced inconsistent results due to differences in the types of molecular data considered, the model assumptions employed, and the quantity and geographical range of samples used. We amplified and sequenced 24 complete P. troglodytes mitochondrial genomes from fecal samples collected at multiple study sites throughout sub-Saharan Africa. Using a “relaxed molecular clock,” fossil calibrations, and 12 additional complete primate mitochondrial genomes, we analyzed the pattern and timing of primate diversification in a Bayesian framework. Our results support the recognition of four chimpanzee subspecies. Within P. troglodytes, we report a mean (95% highest posterior density [HPD]) time since most recent common ancestor (tMRCA) of 1.026 (0.811–1.263) Ma for the four proposed subspecies, with two major lineages. One of these lineages (tMRCA = 0.510 [0.387–0.650] Ma) contains P. t. verus (tMRCA = 0.155 [0.101–0.213] Ma) and P. t. ellioti (formerly P. t. vellerosus; tMRCA = 0.157 [0.102–0.215] Ma), both of which are monophyletic. The other major lineage contains P. t. schweinfurthii (tMRCA = 0.111 [0.077–0.146] Ma), a monophyletic clade nested within the P. t. troglodytes lineage (tMRCA = 0.380 [0.296–0.476] Ma). We utilized two analysis techniques that may be of widespread interest. First, we implemented a Yule speciation prior across the entire primate tree with separate coalescent priors on each of the chimpanzee subspecies. The validity of this approach was confirmed by estimates based on more traditional techniques. We also suggest that accurate tMRCA estimates from large computationally difficult sequence alignments may be obtained by implementing our novel method of bootstrapping smaller randomly subsampled alignments. PMID:20802239

  7. Genealogical lineage sorting leads to significant, but incorrect Bayesian multilocus inference of population structure

    PubMed Central

    OROZCO-terWENGEL, PABLO; CORANDER, JUKKA; SCHLÖTTERER, CHRISTIAN

    2011-01-01

    Over the past decades, the use of molecular markers has revolutionized biology and led to the foundation of a new research discipline—phylogeography. Of particular interest has been the inference of population structure and biogeography. While initial studies focused on mtDNA as a molecular marker, it has become apparent that selection and genealogical lineage sorting could lead to erroneous inferences. As it is not clear to what extent these forces affect a given marker, it has become common practice to use the combined evidence from a set of molecular markers as an attempt to recover the signals that approximate the true underlying demography. Typically, the number of markers used is determined by either budget constraints or by statistical power required to recognize significant population differentiation. Using microsatellite markers from Drosophila and humans, we show that even large numbers of loci (>50) can frequently result in statistically well-supported, but incorrect inference of population structure using the software baps. Most importantly, genomic features, such as chromosomal location, variability of the markers, or recombination rate, cannot explain this observation. Instead, it can be attributed to sampling variation among loci with different realizations of the stochastic lineage sorting. This phenomenon is particularly pronounced for low levels of population differentiation. Our results have important implications for ongoing studies of population differentiation, as we unambiguously demonstrate that statistical significance of population structure inferred from a random set of genetic markers cannot necessarily be taken as evidence for a reliable demographic inference. PMID:21244537

  8. Structure and inference in annotated networks

    NASA Astrophysics Data System (ADS)

    Newman, M. E. J.; Clauset, Aaron

    2016-06-01

    For many networks of scientific interest we know both the connections of the network and information about the network nodes, such as the age or gender of individuals in a social network. Here we demonstrate how this `metadata' can be used to improve our understanding of network structure. We focus in particular on the problem of community detection in networks and develop a mathematically principled approach that combines a network and its metadata to detect communities more accurately than can be done with either alone. Crucially, the method does not assume that the metadata are correlated with the communities we are trying to find. Instead, the method learns whether a correlation exists and correctly uses or ignores the metadata depending on whether they contain useful information. We demonstrate our method on synthetic networks with known structure and on real-world networks, large and small, drawn from social, biological and technological domains.

  9. Structure and inference in annotated networks

    PubMed Central

    Newman, M. E. J.; Clauset, Aaron

    2016-01-01

    For many networks of scientific interest we know both the connections of the network and information about the network nodes, such as the age or gender of individuals in a social network. Here we demonstrate how this ‘metadata' can be used to improve our understanding of network structure. We focus in particular on the problem of community detection in networks and develop a mathematically principled approach that combines a network and its metadata to detect communities more accurately than can be done with either alone. Crucially, the method does not assume that the metadata are correlated with the communities we are trying to find. Instead, the method learns whether a correlation exists and correctly uses or ignores the metadata depending on whether they contain useful information. We demonstrate our method on synthetic networks with known structure and on real-world networks, large and small, drawn from social, biological and technological domains. PMID:27306566

  10. Inference of homologous recombination in bacteria using whole-genome sequences.

    PubMed

    Didelot, Xavier; Lawson, Daniel; Darling, Aaron; Falush, Daniel

    2010-12-01

    Bacteria and archaea reproduce clonally, but sporadically import DNA into their chromosomes from other organisms. In many of these events, the imported DNA replaces an homologous segment in the recipient genome. Here we present a new method to reconstruct the history of recombination events that affected a given sample of bacterial genomes. We introduce a mathematical model that represents both the donor and the recipient of each DNA import as an ancestor of the genomes in the sample. The model represents a simplification of the previously described coalescent with gene conversion. We implement a Monte Carlo Markov chain algorithm to perform inference under this model from sequence data alignments and show that inference is feasible for whole-genome alignments through parallelization. Using simulated data, we demonstrate accurate and reliable identification of individual recombination events and global recombination rate parameters. We applied our approach to an alignment of 13 whole genomes from the Bacillus cereus group. We find, as expected from laboratory experiments, that the recombination rate is higher between closely related organisms and also that the genome contains several broad regions of elevated levels of recombination. Application of the method to the genomic data sets that are becoming available should reveal the evolutionary history and private lives of populations of bacteria and archaea. The methods described in this article have been implemented in a computer software package, ClonalOrigin, which is freely available from http://code.google.com/p/clonalorigin/.

  11. Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance

    SciTech Connect

    Ahn, Tae-Hyuk; Chai, Juanjuan; Pan, Chongle

    2014-09-29

    Motivation: Metagenomic sequencing of clinical samples provides a promising technique for direct pathogen detection and characterization in biosurveillance. Taxonomic analysis at the strain level can be used to resolve serotypes of a pathogen in biosurveillance. Sigma was developed for strain-level identification and quantification of pathogens using their reference genomes based on metagenomic analysis. Results: Sigma provides not only accurate strain-level inferences, but also three unique capabilities: (i) Sigma quantifies the statistical uncertainty of its inferences, which includes hypothesis testing of identified genomes and confidence interval estimation of their relative abundances; (ii) Sigma enables strain variant calling by assigning metagenomic reads to their most likely reference genomes; and (iii) Sigma supports parallel computing for fast analysis of large datasets. In conclusion, the algorithm performance was evaluated using simulated mock communities and fecal samples with spike-in pathogen strains. Availability and Implementation: Sigma was implemented in C++ with source codes and binaries freely available at http://sigma.omicsbio.org.

  12. Coronal structure inferred from remote sensing observations

    SciTech Connect

    Feldman, W.C.

    1996-09-01

    Remote-sensing observations of the Sun and inner heliosphere are reviewed to appraise our understanding of the mix of the mechanisms that heat the corona and accelerate the solar wind. An assessment of experimental uncertainties and the basic assumptions needed to translate measurables into physical models, reveals very large fundamental uncertainties in our knowledge of coronal structure near the Sun. We develop a time-dependent, filamentary model of the extended corona that is consistent with a large number of remote sensing observations of the solar atmosphere and the solar wind.

  13. Insights and inferences about integron evolution from genomic data

    PubMed Central

    Nemergut, Diana R; Robeson, Michael S; Kysela, Robert F; Martin, Andrew P; Schmidt, Steven K; Knight, Rob

    2008-01-01

    Background Integrons are mechanisms that facilitate horizontal gene transfer, allowing bacteria to integrate and express foreign DNA. These are important in the exchange of antibiotic resistance determinants, but can also transfer a diverse suite of genes unrelated to pathogenicity. Here, we provide a systematic analysis of the distribution and diversity of integron intI genes and integron-containing bacteria. Results We found integrons in 103 different pathogenic and non-pathogenic bacteria, in six major phyla. Integrons were widely scattered, and their presence was not confined to specific clades within bacterial orders. Nearly 1/3 of the intI genes that we identified were pseudogenes, containing either an internal stop codon or a frameshift mutation that would render the protein product non-functional. Additionally, 20% of bacteria contained more than one integrase gene. dN/dS ratios revealed mutational hotspots in clades of Vibrio and Shewanella intI genes. Finally, we characterized the gene cassettes associated with integrons in Methylobacillus flagellatus KT and Dechloromonas aromatica RCB, and found a heavy metal efflux gene as well as genes involved in protein folding and stability. Conclusion Our analysis suggests that the present distribution of integrons is due to multiple losses and gene transfer events. While, in some cases, the ability to integrate and excise foreign DNA may be selectively advantageous, the gain, loss, or rearrangment of gene cassettes could also be deleterious, selecting against functional integrases. Thus, such a high fraction of pseudogenes may suggest that the selective impact of integrons on genomes is variable, oscillating between beneficial and deleterious, possibly depending on environmental conditions. PMID:18513439

  14. Structural variations in plant genomes

    PubMed Central

    Edwards, David; Varshney, Rajeev K.

    2014-01-01

    Differences between plant genomes range from single nucleotide polymorphisms to large-scale duplications, deletions and rearrangements. The large polymorphisms are termed structural variants (SVs). SVs have received significant attention in human genetics and were found to be responsible for various chronic diseases. However, little effort has been directed towards understanding the role of SVs in plants. Many recent advances in plant genetics have resulted from improvements in high-resolution technologies for measuring SVs, including microarray-based techniques, and more recently, high-throughput DNA sequencing. In this review we describe recent reports of SV in plants and describe the genomic technologies currently used to measure these SVs. PMID:24907366

  15. Untangling statistical and biological models to understand network inference: the need for a genomics network ontology.

    PubMed

    Emmert-Streib, Frank; Dehmer, Matthias; Haibe-Kains, Benjamin

    2014-01-01

    In this paper, we shed light on approaches that are currently used to infer networks from gene expression data with respect to their biological meaning. As we will show, the biological interpretation of these networks depends on the chosen theoretical perspective. For this reason, we distinguish a statistical perspective from a mathematical modeling perspective and elaborate their differences and implications. Our results indicate the imperative need for a genomic network ontology in order to avoid increasing confusion about the biological interpretation of inferred networks, which can be even enhanced by approaches that integrate multiple data sets, respectively, data types.

  16. Systematic Inference of Copy-Number Genotypes from Personal Genome Sequencing Data Reveals Extensive Olfactory Receptor Gene Content Diversity

    PubMed Central

    Waszak, Sebastian M.; Hasin, Yehudit; Zichner, Thomas; Olender, Tsviya; Keydar, Ifat; Khen, Miriam; Stütz, Adrian M.; Schlattl, Andreas; Lancet, Doron; Korbel, Jan O.

    2010-01-01

    Copy-number variations (CNVs) are widespread in the human genome, but comprehensive assignments of integer locus copy-numbers (i.e., copy-number genotypes) that, for example, enable discrimination of homozygous from heterozygous CNVs, have remained challenging. Here we present CopySeq, a novel computational approach with an underlying statistical framework that analyzes the depth-of-coverage of high-throughput DNA sequencing reads, and can incorporate paired-end and breakpoint junction analysis based CNV-analysis approaches, to infer locus copy-number genotypes. We benchmarked CopySeq by genotyping 500 chromosome 1 CNV regions in 150 personal genomes sequenced at low-coverage. The assessed copy-number genotypes were highly concordant with our performed qPCR experiments (Pearson correlation coefficient 0.94), and with the published results of two microarray platforms (95–99% concordance). We further demonstrated the utility of CopySeq for analyzing gene regions enriched for segmental duplications by comprehensively inferring copy-number genotypes in the CNV-enriched >800 olfactory receptor (OR) human gene and pseudogene loci. CopySeq revealed that OR loci display an extensive range of locus copy-numbers across individuals, with zero to two copies in some OR loci, and two to nine copies in others. Among genetic variants affecting OR loci we identified deleterious variants including CNVs and SNPs affecting ∼15% and ∼20% of the human OR gene repertoire, respectively, implying that genetic variants with a possible impact on smell perception are widespread. Finally, we found that for several OR loci the reference genome appears to represent a minor-frequency variant, implying a necessary revision of the OR repertoire for future functional studies. CopySeq can ascertain genomic structural variation in specific gene families as well as at a genome-wide scale, where it may enable the quantitative evaluation of CNVs in genome-wide association studies involving high

  17. Comparative Analysis of Mitochondrial Genomes in Diplura (Hexapoda, Arthropoda): Taxon Sampling Is Crucial for Phylogenetic Inferences

    PubMed Central

    Chen, Wan-Jun; Koch, Markus; Mallatt, Jon M.; Luan, Yun-Xia

    2014-01-01

    Two-pronged bristletails (Diplura) are traditionally classified into three major superfamilies: Campodeoidea, Projapygoidea, and Japygoidea. The interrelationships of these three superfamilies and the monophyly of Diplura have been much debated. Few previous studies included Projapygoidea in their phylogenetic considerations, and its position within Diplura still is a puzzle from both morphological and molecular points of view. Until now, no mitochondrial genome has been sequenced for any projapygoid species. To fill in this gap, we determined and annotated the complete mitochondrial genome of Octostigma sinensis (Octostigmatidae, Projapygoidea), and of three more dipluran species, one each from the Campodeidae, Parajapygidae, and Japygidae. All four newly sequenced dipluran mtDNAs encode the same set of genes in the same gene order as shared by most crustaceans and hexapods. Secondary structure truncations have occurred in trnR, trnC, trnS1, and trnS2, and the reduction of transfer RNA D-arms was found to be taxonomically correlated, with Campodeoidea having experienced the most reduction. Partitioned phylogenetic analyses, based on both amino acids and nucleotides of the protein-coding genes plus the ribosomal RNA genes, retrieve significant support for a monophyletic Diplura within Pancrustacea, with Projapygoidea more closely related to Campodeoidea than to Japygoidea. Another key finding is that monophyly of Diplura cannot be recovered unless Projapygoidea is included in the phylogenetic analyses; this explains the dipluran polyphyly found by past mitogenomic studies. Including Projapygoidea increased the sampling density within Diplura and probably helped by breaking up a long-branch-attraction artifact. This finding provides an example of how proper sampling is significant for phylogenetic inference. PMID:24391151

  18. Genome-wide copy number analysis using copy number inferring tool (CNIT) and DNA pooling.

    PubMed

    Lin, Chien-hsing; Huang, Mei-chu; Li, Ling-hui; Wu, Jer-yuarn; Chen, Yuan-tsong; Fann, Cathy S J

    2008-08-01

    Copy number variation (CNV) has become an important genomic structure element in the human population, and some CNVs are related to specific traits and diseases. Moreover, analysis of human genomes has been potentiated by the use of high-resolution microarrays that assess single nucleotide polymorphisms (SNPs). Although many programs have been designed to analyze data from Affymetrix SNP microarrays, they all have high false-positive rates (FPRs) in copy number (CN) analyses. Copy number analysis tool (CNAT) 4.0 is a recently developed program that offers improved CN estimation, but small amplifications and deletions are lost when using the smoothing procedure. Here, we propose a copy number inferring tool (CNIT) algorithm for the 100K SNP microarray to investigate CNVs at 29.6-kb resolution. CNIT estimated SNP allelic and total CN with reliable P values based on intensity data. In addition, the hidden Markov model (HMM) method was applied to predict regions having altered CN by considering contiguous SNPs. Based on a CN analysis of 23 unrelated Taiwanese and 30 HapMap Centre d'Etude du Polymorphisme Humain (CEPH) trios, CNIT showed higher accuracy and power than other programs. The FPRs and false-negative rates (FNRs) of CNIT were 0.1% and 0.16%, respectively. CNIT also showed better sensitivity for detecting small amplifications and deletions. Furthermore, DNA pooling of 10 and 30 normal unrelated individuals were applied to the 100K SNP microarray, respectively, and 12 common CN-variable regions were identified, suggesting that DNA pooling can be applied to discover common CNVs.

  19. Comparative analysis of mitochondrial genomes in Diplura (hexapoda, arthropoda): taxon sampling is crucial for phylogenetic inferences.

    PubMed

    Chen, Wan-Jun; Koch, Markus; Mallatt, Jon M; Luan, Yun-Xia

    2014-01-01

    Two-pronged bristletails (Diplura) are traditionally classified into three major superfamilies: Campodeoidea, Projapygoidea, and Japygoidea. The interrelationships of these three superfamilies and the monophyly of Diplura have been much debated. Few previous studies included Projapygoidea in their phylogenetic considerations, and its position within Diplura still is a puzzle from both morphological and molecular points of view. Until now, no mitochondrial genome has been sequenced for any projapygoid species. To fill in this gap, we determined and annotated the complete mitochondrial genome of Octostigma sinensis (Octostigmatidae, Projapygoidea), and of three more dipluran species, one each from the Campodeidae, Parajapygidae, and Japygidae. All four newly sequenced dipluran mtDNAs encode the same set of genes in the same gene order as shared by most crustaceans and hexapods. Secondary structure truncations have occurred in trnR, trnC, trnS1, and trnS2, and the reduction of transfer RNA D-arms was found to be taxonomically correlated, with Campodeoidea having experienced the most reduction. Partitioned phylogenetic analyses, based on both amino acids and nucleotides of the protein-coding genes plus the ribosomal RNA genes, retrieve significant support for a monophyletic Diplura within Pancrustacea, with Projapygoidea more closely related to Campodeoidea than to Japygoidea. Another key finding is that monophyly of Diplura cannot be recovered unless Projapygoidea is included in the phylogenetic analyses; this explains the dipluran polyphyly found by past mitogenomic studies. Including Projapygoidea increased the sampling density within Diplura and probably helped by breaking up a long-branch-attraction artifact. This finding provides an example of how proper sampling is significant for phylogenetic inference.

  20. Streamlining and Large Ancestral Genomes in Archaea Inferred with a Phylogenetic Birth-and-Death Model

    PubMed Central

    Miklós, István

    2009-01-01

    Homologous genes originate from a common ancestor through vertical inheritance, duplication, or horizontal gene transfer. Entire homolog families spawned by a single ancestral gene can be identified across multiple genomes based on protein sequence similarity. The sequences, however, do not always reveal conclusively the history of large families. To study the evolution of complete gene repertoires, we propose here a mathematical framework that does not rely on resolved gene family histories. We show that so-called phylogenetic profiles, formed by family sizes across multiple genomes, are sufficient to infer principal evolutionary trends. The main novelty in our approach is an efficient algorithm to compute the likelihood of a phylogenetic profile in a model of birth-and-death processes acting on a phylogeny. We examine known gene families in 28 archaeal genomes using a probabilistic model that involves lineage- and family-specific components of gene acquisition, duplication, and loss. The model enables us to consider all possible histories when inferring statistics about archaeal evolution. According to our reconstruction, most lineages are characterized by a net loss of gene families. Major increases in gene repertoire have occurred only a few times. Our reconstruction underlines the importance of persistent streamlining processes in shaping genome composition in Archaea. It also suggests that early archaeal genomes were as complex as typical modern ones, and even show signs, in the case of the methanogenic ancestor, of an extremely large gene repertoire. PMID:19570746

  1. Inferring chromatin-bound protein complexes from genome-wide binding assays

    PubMed Central

    Giannopoulou, Eugenia G.; Elemento, Olivier

    2013-01-01

    Genome-wide binding assays can determine where individual transcription factors bind in the genome. However, these factors rarely bind chromatin alone, but instead frequently bind to cis-regulatory elements (CREs) together with other factors thus forming protein complexes. Currently there are no integrative analytical approaches that can predict which complexes are formed on chromatin. Here, we describe a computational methodology to systematically capture protein complexes and infer their impact on gene expression. We applied our method to three human cell types, identified thousands of CREs, inferred known and undescribed complexes recruited to these CREs, and determined the role of the complexes as activators or repressors. Importantly, we found that the predicted complexes have a higher number of physical interactions between their members than expected by chance. Our work provides a mechanism for developing hypotheses about gene regulation via binding partners, and deciphering the interplay between combinatorial binding and gene expression. PMID:23554462

  2. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics.

    PubMed

    Gruber, Susan; van der Laan, Mark J

    2010-01-01

    A concrete example of the collaborative double-robust targeted likelihood estimator (C-TMLE) introduced in a companion article in this issue is presented, and applied to the estimation of causal effects and variable importance parameters in genomic data. The focus is on non-parametric estimation in a point treatment data structure. Simulations illustrate the performance of C-TMLE relative to current competitors such as the augmented inverse probability of treatment weighted estimator that relies on an external non-collaborative estimator of the treatment mechanism, and inefficient estimation procedures including propensity score matching and standard inverse probability of treatment weighting. C-TMLE is also applied to the estimation of the covariate-adjusted marginal effect of individual HIV mutations on resistance to the anti-retroviral drug lopinavir. The influence curve of the C-TMLE is used to establish asymptotically valid statistical inference. The list of mutations found to have a statistically significant association with resistance is in excellent agreement with mutation scores provided by the Stanford HIVdb mutation scores database.

  3. Causal inference and the hierarchical structure of experience

    PubMed Central

    Johnson, Samuel G. B.; Keil, Frank C.

    2014-01-01

    Children and adults make rich causal inferences about the physical and social world, even in novel situations where they cannot rely on prior knowledge of causal mechanisms. We propose that this capacity is supported in part by constraints provided by event structure—the cognitive organization of experience into discrete events that are hierarchically organized. These event-structured causal inferences are guided by a level-matching principle, with events conceptualized at one level of an event hierarchy causally matched to other events at that same level, and a boundary-blocking principle, with events causally matched to other events that are parts of the same superordinate event. These principles are used to constrain inferences about plausible causal candidates in unfamiliar situations, both in diagnosing causes (Experiment 1) and predicting effects (Experiment 2). The results could not be explained by construal level (Experiment 3) or similarity-matching (Experiment 4), and were robust across a variety of physical and social causal systems. Taken together, these experiments demonstrate a novel way in which non-causal information we extract from the environment can help to constrain inferences about causal structure. PMID:25347533

  4. Towards the unification of inference structures in medical diagnostic tasks.

    PubMed

    Mira, J; Rives, J; Delgado, A E; Martínez, R

    1998-01-01

    The central purpose of artificial intelligence applied to medicine is to develop models for diagnosis and therapy planning at the knowledge level, in the Newell sense, and software environments to facilitate the reduction of these models to the symbol level. The usual methodology (KADS, Common-KADS, GAMES, HELIOS, Protégé, etc) has been to develop libraries of generic tasks and reusable problem-solving methods with explicit ontologies. The principal problem which clinicians have with these methodological developments concerns the diversity and complexity of new terms whose meaning is not sufficiently clear, precise, unambiguous and consensual for them to be accessible in the daily clinical environment. As a contribution to the solution of this problem, we develop in this article the conjecture that one inference structure is enough to describe the set of analysis tasks associated with medical diagnoses. To this end, we first propose a modification of the systematic diagnostic inference scheme to obtain an analysis generic task and then compare it with the monitoring and the heuristic classification task inference schemes using as comparison criteria the compatibility of domain roles (data structures), the similarity in the inferences, and the commonality in the set of assumptions which underlie the functionally equivalent models. The equivalences proposed are illustrated with several examples. Note that though our ongoing work aims to simplify the methodology and to increase the precision of the terms used, the proposal presented here should be viewed more in the nature of a conjecture.

  5. Structural Genomics on the Web

    PubMed Central

    Wixon, Jo

    2001-01-01

    In this review we provide a brief guide to some of the resources and databases that can be used to locate information and aid research in the growing field of structural genomics. The review will provide examples, for less experienced users, of what can be achieved using a selection of the available sites. We hope that this will encourage you to use these sites to their full potential and whet your appetite to search for other related sites. PMID:18628900

  6. Higher-level phylogeny of paraneopteran insects inferred from mitochondrial genome sequences

    PubMed Central

    Li, Hu; Shao, Renfu; Song, Nan; Song, Fan; Jiang, Pei; Li, Zhihong; Cai, Wanzhi

    2015-01-01

    Mitochondrial (mt) genome data have been proven to be informative for animal phylogenetic studies but may also suffer from systematic errors, due to the effects of accelerated substitution rate and compositional heterogeneity. We analyzed the mt genomes of 25 insect species from the four paraneopteran orders, aiming to better understand how accelerated substitution rate and compositional heterogeneity affect the inferences of the higher-level phylogeny of this diverse group of hemimetabolous insects. We found substantial heterogeneity in base composition and contrasting rates in nucleotide substitution among these paraneopteran insects, which complicate the inference of higher-level phylogeny. The phylogenies inferred with concatenated sequences of mt genes using maximum likelihood and Bayesian methods and homogeneous models failed to recover Psocodea and Hemiptera as monophyletic groups but grouped, instead, the taxa that had accelerated substitution rates together, including Sternorrhyncha (a suborder of Hemiptera), Thysanoptera, Phthiraptera and Liposcelididae (a family of Psocoptera). Bayesian inference with nucleotide sequences and heterogeneous models (CAT and CAT + GTR), however, recovered Psocodea, Thysanoptera and Hemiptera each as a monophyletic group. Within Psocodea, Liposcelididae is more closely related to Phthiraptera than to other species of Psocoptera. Furthermore, Thysanoptera was recovered as the sister group to Hemiptera. PMID:25704094

  7. Microarray Data Processing Techniques for Genome-Scale Network Inference from Large Public Repositories.

    PubMed

    Chockalingam, Sriram; Aluru, Maneesha; Aluru, Srinivas

    2016-09-19

    Pre-processing of microarray data is a well-studied problem. Furthermore, all popular platforms come with their own recommended best practices for differential analysis of genes. However, for genome-scale network inference using microarray data collected from large public repositories, these methods filter out a considerable number of genes. This is primarily due to the effects of aggregating a diverse array of experiments with different technical and biological scenarios. Here we introduce a pre-processing pipeline suitable for inferring genome-scale gene networks from large microarray datasets. We show that partitioning of the available microarray datasets according to biological relevance into tissue- and process-specific categories significantly extends the limits of downstream network construction. We demonstrate the effectiveness of our pre-processing pipeline by inferring genome-scale networks for the model plant Arabidopsis thaliana using two different construction methods and a collection of 11,760 Affymetrix ATH1 microarray chips. Our pre-processing pipeline and the datasets used in this paper are made available at http://alurulab.cc.gatech.edu/microarray-pp.

  8. 2004 Structural, Function and Evolutionary Genomics

    SciTech Connect

    Douglas L. Brutlag Nancy Ryan Gray

    2005-03-23

    This Gordon conference will cover the areas of structural, functional and evolutionary genomics. It will take a systematic approach to genomics, examining the evolution of proteins, protein functional sites, protein-protein interactions, regulatory networks, and metabolic networks. Emphasis will be placed on what we can learn from comparative genomics and entire genomes and proteomes.

  9. Inference of gorilla demographic and selective history from whole-genome sequence data.

    PubMed

    McManus, Kimberly F; Kelley, Joanna L; Song, Shiya; Veeramah, Krishna R; Woerner, August E; Stevison, Laurie S; Ryder, Oliver A; Ape Genome Project, Great; Kidd, Jeffrey M; Wall, Jeffrey D; Bustamante, Carlos D; Hammer, Michael F

    2015-03-01

    Although population-level genomic sequence data have been gathered extensively for humans, similar data from our closest living relatives are just beginning to emerge. Examination of genomic variation within great apes offers many opportunities to increase our understanding of the forces that have differentially shaped the evolutionary history of hominid taxa. Here, we expand upon the work of the Great Ape Genome Project by analyzing medium to high coverage whole-genome sequences from 14 western lowland gorillas (Gorilla gorilla gorilla), 2 eastern lowland gorillas (G. beringei graueri), and a single Cross River individual (G. gorilla diehli). We infer that the ancestors of western and eastern lowland gorillas diverged from a common ancestor approximately 261 ka, and that the ancestors of the Cross River population diverged from the western lowland gorilla lineage approximately 68 ka. Using a diffusion approximation approach to model the genome-wide site frequency spectrum, we infer a history of western lowland gorillas that includes an ancestral population expansion of 1.4-fold around 970 ka and a recent 5.6-fold contraction in population size 23 ka. The latter may correspond to a major reduction in African equatorial forests around the Last Glacial Maximum. We also analyze patterns of variation among western lowland gorillas to identify several genomic regions with strong signatures of recent selective sweeps. We find that processes related to taste, pancreatic and saliva secretion, sodium ion transmembrane transport, and cardiac muscle function are overrepresented in genomic regions predicted to have experienced recent positive selection. © The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  10. Inference of Gorilla Demographic and Selective History from Whole-Genome Sequence Data

    PubMed Central

    McManus, Kimberly F.; Kelley, Joanna L.; Song, Shiya; Veeramah, Krishna R.; Woerner, August E.; Stevison, Laurie S.; Ryder, Oliver A.; Ape Genome Project, Great; Kidd, Jeffrey M.; Wall, Jeffrey D.; Bustamante, Carlos D.; Hammer, Michael F.

    2015-01-01

    Although population-level genomic sequence data have been gathered extensively for humans, similar data from our closest living relatives are just beginning to emerge. Examination of genomic variation within great apes offers many opportunities to increase our understanding of the forces that have differentially shaped the evolutionary history of hominid taxa. Here, we expand upon the work of the Great Ape Genome Project by analyzing medium to high coverage whole-genome sequences from 14 western lowland gorillas (Gorilla gorilla gorilla), 2 eastern lowland gorillas (G. beringei graueri), and a single Cross River individual (G. gorilla diehli). We infer that the ancestors of western and eastern lowland gorillas diverged from a common ancestor approximately 261 ka, and that the ancestors of the Cross River population diverged from the western lowland gorilla lineage approximately 68 ka. Using a diffusion approximation approach to model the genome-wide site frequency spectrum, we infer a history of western lowland gorillas that includes an ancestral population expansion of 1.4-fold around 970 ka and a recent 5.6-fold contraction in population size 23 ka. The latter may correspond to a major reduction in African equatorial forests around the Last Glacial Maximum. We also analyze patterns of variation among western lowland gorillas to identify several genomic regions with strong signatures of recent selective sweeps. We find that processes related to taste, pancreatic and saliva secretion, sodium ion transmembrane transport, and cardiac muscle function are overrepresented in genomic regions predicted to have experienced recent positive selection. PMID:25534031

  11. Visualization of RNA structure models within the Integrative Genomics Viewer.

    PubMed

    Busan, Steven; Weeks, Kevin M

    2017-07-01

    Analyses of the interrelationships between RNA structure and function are increasingly important components of genomic studies. The SHAPE-MaP strategy enables accurate RNA structure probing and realistic structure modeling of kilobase-length noncoding RNAs and mRNAs. Existing tools for visualizing RNA structure models are not suitable for efficient analysis of long, structurally heterogeneous RNAs. In addition, structure models are often advantageously interpreted in the context of other experimental data and gene annotation information, for which few tools currently exist. We have developed a module within the widely used and well supported open-source Integrative Genomics Viewer (IGV) that allows visualization of SHAPE and other chemical probing data, including raw reactivities, data-driven structural entropies, and data-constrained base-pair secondary structure models, in context with linear genomic data tracks. We illustrate the usefulness of visualizing RNA structure in the IGV by exploring structure models for a large viral RNA genome, comparing bacterial mRNA structure in cells with its structure under cell- and protein-free conditions, and comparing a noncoding RNA structure modeled using SHAPE data with a base-pairing model inferred through sequence covariation analysis. © 2017 Busan and Weeks; Published by Cold Spring Harbor Laboratory Press for the RNA Society.

  12. Robust Inference of Genetic Exchange Communities from Microbial Genomes Using TF-IDF

    PubMed Central

    Cong, Yingnan; Chan, Yao-ban; Phillips, Charles A.; Langston, Michael A.; Ragan, Mark A.

    2017-01-01

    Bacteria and archaea can exchange genetic material across lineages through processes of lateral genetic transfer (LGT). Collectively, these exchange relationships can be modeled as a network and analyzed using concepts from graph theory. In particular, densely connected regions within an LGT network have been defined as genetic exchange communities (GECs). However, it has been problematic to construct networks in which edges solely represent LGT. Here we apply term frequency-inverse document frequency (TF-IDF), an alignment-free method originating from document analysis, to infer regions of lateral origin in bacterial genomes. We examine four empirical datasets of different size (number of genomes) and phyletic breadth, varying a key parameter (word length k) within bounds established in previous work. We map the inferred lateral regions to genes in recipient genomes, and construct networks in which the nodes are groups of genomes, and the edges natively represent LGT. We then extract maximum and maximal cliques (i.e., GECs) from these graphs, and identify nodes that belong to GECs across a wide range of k. Most surviving lateral transfer has happened within these GECs. Using Gene Ontology enrichment tests we demonstrate that biological processes associated with metabolism, regulation and transport are often over-represented among the genes affected by LGT within these communities. These enrichments are largely robust to change of k. PMID:28154557

  13. ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes

    PubMed Central

    Didelot, Xavier; Wilson, Daniel J.

    2015-01-01

    Recombination is an important evolutionary force in bacteria, but it remains challenging to reconstruct the imports that occurred in the ancestry of a genomic sample. Here we present ClonalFrameML, which uses maximum likelihood inference to simultaneously detect recombination in bacterial genomes and account for it in phylogenetic reconstruction. ClonalFrameML can analyse hundreds of genomes in a matter of hours, and we demonstrate its usefulness on simulated and real datasets. We find evidence for recombination hotspots associated with mobile elements in Clostridium difficile ST6 and a previously undescribed 310kb chromosomal replacement in Staphylococcus aureus ST582. ClonalFrameML is freely available at http://clonalframeml.googlecode.com/. PMID:25675341

  14. Genomic inferences of domestication events are corroborated by written records in Brassica rapa.

    PubMed

    Qi, Xinshuai; An, Hong; Ragsdale, Aaron P; Hall, Tara E; Gutenkunst, Ryan N; Chris Pires, J; Barker, Michael S

    2017-07-01

    Demographic modelling is often used with population genomic data to infer the relationships and ages among populations. However, relatively few analyses are able to validate these inferences with independent data. Here, we leverage written records that describe distinct Brassica rapa crops to corroborate demographic models of domestication. Brassica rapa crops are renowned for their outstanding morphological diversity, but the relationships and order of domestication remain unclear. We generated genomewide SNPs from 126 accessions collected globally using high-throughput transcriptome data. Analyses of more than 31,000 SNPs across the B. rapa genome revealed evidence for five distinct genetic groups and supported a European-Central Asian origin of B. rapa crops. Our results supported the traditionally recognized South Asian and East Asian B. rapa groups with evidence that pak choi, Chinese cabbage and yellow sarson are likely monophyletic groups. In contrast, the oil-type B. rapa subsp. oleifera and brown sarson were polyphyletic. We also found no evidence to support the contention that rapini is the wild type or the earliest domesticated subspecies of B. rapa. Demographic analyses suggested that B. rapa was introduced to Asia 2,400-4,100 years ago, and that Chinese cabbage originated 1,200-2,100 years ago via admixture of pak choi and European-Central Asian B. rapa. We also inferred significantly different levels of founder effect among the B. rapa subspecies. Written records from antiquity that document these crops are consistent with these inferences. The concordance between our age estimates of domestication events with historical records provides unique support for our demographic inferences. © 2017 John Wiley & Sons Ltd.

  15. Modulated Modularity Clustering as an Exploratory Tool for Functional Genomic Inference

    PubMed Central

    Stone, Eric A.; Ayroles, Julien F.

    2009-01-01

    In recent years, the advent of high-throughput assays, coupled with their diminishing cost, has facilitated a systems approach to biology. As a consequence, massive amounts of data are currently being generated, requiring efficient methodology aimed at the reduction of scale. Whole-genome transcriptional profiling is a standard component of systems-level analyses, and to reduce scale and improve inference clustering genes is common. Since clustering is often the first step toward generating hypotheses, cluster quality is critical. Conversely, because the validation of cluster-driven hypotheses is indirect, it is critical that quality clusters not be obtained by subjective means. In this paper, we present a new objective-based clustering method and demonstrate that it yields high-quality results. Our method, modulated modularity clustering (MMC), seeks community structure in graphical data. MMC modulates the connection strengths of edges in a weighted graph to maximize an objective function (called modularity) that quantifies community structure. The result of this maximization is a clustering through which tightly-connected groups of vertices emerge. Our application is to systems genetics, and we quantitatively compare MMC both to the hierarchical clustering method most commonly employed and to three popular spectral clustering approaches. We further validate MMC through analyses of human and Drosophila melanogaster expression data, demonstrating that the clusters we obtain are biologically meaningful. We show MMC to be effective and suitable to applications of large scale. In light of these features, we advocate MMC as a standard tool for exploration and hypothesis generation. PMID:19424432

  16. ELISA: Structure-Function Inferences based on statistically significant and evolutionarily inspired observations

    PubMed Central

    Shakhnovich, Boris E; Harvey, John M; Comeau, Steve; Lorenz, David; DeLisi, Charles; Shakhnovich, Eugene

    2003-01-01

    The problem of functional annotation based on homology modeling is primary to current bioinformatics research. Researchers have noted regularities in sequence, structure and even chromosome organization that allow valid functional cross-annotation. However, these methods provide a lot of false negatives due to limited specificity inherent in the system. We want to create an evolutionarily inspired organization of data that would approach the issue of structure-function correlation from a new, probabilistic perspective. Such organization has possible applications in phylogeny, modeling of functional evolution and structural determination. ELISA (Evolutionary Lineage Inferred from Structural Analysis, ) is an online database that combines functional annotation with structure and sequence homology modeling to place proteins into sequence-structure-function "neighborhoods". The atomic unit of the database is a set of sequences and structural templates that those sequences encode. A graph that is built from the structural comparison of these templates is called PDUG (protein domain universe graph). We introduce a method of functional inference through a probabilistic calculation done on an arbitrary set of PDUG nodes. Further, all PDUG structures are mapped onto all fully sequenced proteomes allowing an easy interface for evolutionary analysis and research into comparative proteomics. ELISA is the first database with applicability to evolutionary structural genomics explicitly in mind. Availability: The database is available at . PMID:12952559

  17. Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies

    PubMed Central

    Denton, James F.; Lugo-Martinez, Jose; Tucker, Abraham E.; Schrider, Daniel R.; Warren, Wesley C.; Hahn, Matthew W.

    2014-01-01

    Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process. PMID:25474019

  18. Bayesian Inference for Latent Biologic Structure with Determinantal Point Processes (DPP)

    PubMed Central

    Xu, Yanxun; Müller, Peter; Telesca, Donatello

    2016-01-01

    Summary We discuss the use of the determinantal point process (DPP) as a prior for latent structure in biomedical applications, where inference often centers on the interpretation of latent features as biologically or clinically meaningful structure. Typical examples include mixture models, when the terms of the mixture are meant to represent clinically meaningful subpopulations (of patients, genes, etc.). Another class of examples are feature allocation models. We propose the DPP prior as a repulsive prior on latent mixture components in the first example, and as prior on feature-specific parameters in the second case. We argue that the DPP is in general an attractive prior model for latent structure when biologically relevant interpretation of such structure is desired. We illustrate the advantages of DPP prior in three case studies, including inference in mixture models for magnetic resonance images (MRI) and for protein expression, and a feature allocation model for gene expression using data from The Cancer Genome Atlas. An important part of our argument are efficient and straightforward posterior simulation methods. We implement a variation of reversible jump Markov chain Monte Carlo simulation for inference under the DPP prior, using a density with respect to the unit rate Poisson process. PMID:26873271

  19. The feasibility of genome-scale biological network inference using Graphics Processing Units.

    PubMed

    Thiagarajan, Raghuram; Alavi, Amir; Podichetty, Jagdeep T; Bazil, Jason N; Beard, Daniel A

    2017-01-01

    Systems research spanning fields from biology to finance involves the identification of models to represent the underpinnings of complex systems. Formal approaches for data-driven identification of network interactions include statistical inference-based approaches and methods to identify dynamical systems models that are capable of fitting multivariate data. Availability of large data sets and so-called 'big data' applications in biology present great opportunities as well as major challenges for systems identification/reverse engineering applications. For example, both inverse identification and forward simulations of genome-scale gene regulatory network models pose compute-intensive problems. This issue is addressed here by combining the processing power of Graphics Processing Units (GPUs) and a parallel reverse engineering algorithm for inference of regulatory networks. It is shown that, given an appropriate data set, information on genome-scale networks (systems of 1000 or more state variables) can be inferred using a reverse-engineering algorithm in a matter of days on a small-scale modern GPU cluster.

  20. Inferring drug-disease associations from integration of chemical, genomic and phenotype data using network propagation

    PubMed Central

    2013-01-01

    Background During the last few years, the knowledge of drug, disease phenotype and protein has been rapidly accumulated and more and more scientists have been drawn the attention to inferring drug-disease associations by computational method. Development of an integrated approach for systematic discovering drug-disease associations by those informational data is an important issue. Methods We combine three different networks of drug, genomic and disease phenotype and assign the weights to the edges from available experimental data and knowledge. Given a specific disease, we use our network propagation approach to infer the drug-disease associations. Results We apply prostate cancer and colorectal cancer as our test data. We use the manually curated drug-disease associations from comparative toxicogenomics database to be our benchmark. The ranked results show that our proposed method obtains higher specificity and sensitivity and clearly outperforms previous methods. Our result also show that our method with off-targets information gets higher performance than that with only primary drug targets in both test data. Conclusions We clearly demonstrate the feasibility and benefits of using network-based analyses of chemical, genomic and phenotype data to reveal drug-disease associations. The potential associations inferred by our method provide new perspectives for toxicogenomics and drug reposition evaluation. PMID:24565337

  1. An inference method from multi-layered structure of biomedical data.

    PubMed

    Kim, Myungjun; Nam, Yonghyun; Shin, Hyunjung

    2017-05-18

    Biological system is a multi-layered structure of omics with genome, epigenome, transcriptome, metabolome, proteome, etc., and can be further stretched to clinical/medical layers such as diseasome, drugs, and symptoms. One advantage of omics is that we can figure out an unknown component or its trait by inferring from known omics components. The component can be inferred by the ones in the same level of omics or the ones in different levels. To implement the inference process, an algorithm that can be applied to the multi-layered complex system is required. In this study, we develop a semi-supervised learning algorithm that can be applied to the multi-layered complex system. In order to verify the validity of the inference, it was applied to the prediction problem of disease co-occurrence with a two-layered network composed of symptom-layer and disease-layer. The symptom-disease layered network obtained a fairly high value of AUC, 0.74, which is regarded as noticeable improvement when comparing 0.59 AUC of single-layered disease network. If further stretched to whole layered structure of omics, the proposed method is expected to produce more promising results. This research has novelty in that it is a new integrative algorithm that incorporates the vertical structure of omics data, on contrary to other existing methods that integrate the data in parallel fashion. The results can provide enhanced guideline for disease co-occurrence prediction, thereby serve as a valuable tool for inference process of multi-layered biological system.

  2. Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance

    DOE PAGES

    Ahn, Tae-Hyuk; Chai, Juanjuan; Pan, Chongle

    2014-09-29

    Motivation: Metagenomic sequencing of clinical samples provides a promising technique for direct pathogen detection and characterization in biosurveillance. Taxonomic analysis at the strain level can be used to resolve serotypes of a pathogen in biosurveillance. Sigma was developed for strain-level identification and quantification of pathogens using their reference genomes based on metagenomic analysis. Results: Sigma provides not only accurate strain-level inferences, but also three unique capabilities: (i) Sigma quantifies the statistical uncertainty of its inferences, which includes hypothesis testing of identified genomes and confidence interval estimation of their relative abundances; (ii) Sigma enables strain variant calling by assigning metagenomic readsmore » to their most likely reference genomes; and (iii) Sigma supports parallel computing for fast analysis of large datasets. In conclusion, the algorithm performance was evaluated using simulated mock communities and fecal samples with spike-in pathogen strains. Availability and Implementation: Sigma was implemented in C++ with source codes and binaries freely available at http://sigma.omicsbio.org.« less

  3. Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance

    PubMed Central

    Ahn, Tae-Hyuk; Chai, Juanjuan; Pan, Chongle

    2015-01-01

    Motivation: Metagenomic sequencing of clinical samples provides a promising technique for direct pathogen detection and characterization in biosurveillance. Taxonomic analysis at the strain level can be used to resolve serotypes of a pathogen in biosurveillance. Sigma was developed for strain-level identification and quantification of pathogens using their reference genomes based on metagenomic analysis. Results: Sigma provides not only accurate strain-level inferences, but also three unique capabilities: (i) Sigma quantifies the statistical uncertainty of its inferences, which includes hypothesis testing of identified genomes and confidence interval estimation of their relative abundances; (ii) Sigma enables strain variant calling by assigning metagenomic reads to their most likely reference genomes; and (iii) Sigma supports parallel computing for fast analysis of large datasets. The algorithm performance was evaluated using simulated mock communities and fecal samples with spike-in pathogen strains. Availability and Implementation: Sigma was implemented in C++ with source codes and binaries freely available at http://sigma.omicsbio.org. Contact: panc@ornl.gov Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25266224

  4. Inferring Bottlenecks from Genome-Wide Samples of Short Sequence Blocks

    PubMed Central

    Bunnefeld, Lynsey; Frantz, Laurent A. F.; Lohse, Konrad

    2015-01-01

    The advent of the genomic era has necessitated the development of methods capable of analyzing large volumes of genomic data efficiently. Being able to reliably identify bottlenecks—extreme population size changes of short duration—not only is interesting in the context of speciation and extinction but also matters (as a null model) when inferring selection. Bottlenecks can be detected in polymorphism data via their distorting effect on the shape of the underlying genealogy. Here, we use the generating function of genealogies to derive the probability of mutational configurations in short sequence blocks under a simple bottleneck model. Given a large number of nonrecombining blocks, we can compute maximum-likelihood estimates of the time and strength of the bottleneck. Our method relies on a simple summary of the joint distribution of polymorphic sites. We extend the site frequency spectrum by counting mutations in frequency classes in short sequence blocks. Using linkage information over short distances in this way gives greater power to detect bottlenecks than the site frequency spectrum and potentially opens up a wide range of demographic histories to blockwise inference. Finally, we apply our method to genomic data from a species of pig (Sus cebifrons) endemic to islands in the center and west of the Philippines to estimate whether a bottleneck occurred upon island colonization and compare our scheme to Li and Durbin’s pairwise sequentially Markovian coalescent (PSMC) both for the pig data and using simulations. PMID:26341659

  5. RegPredict: an integrated system for regulon inference in prokaryotes by comparative genomics approach

    SciTech Connect

    Novichkov, Pavel S.; Rodionov, Dmitry A.; Stavrovskaya, Elena D.; Novichkova, Elena S.; Kazakov, Alexey E.; Gelfand, Mikhail S.; Arkin, Adam P.; Mironov, Andrey A.; Dubchak, Inna

    2010-05-26

    RegPredict web server is designed to provide comparative genomics tools for reconstruction and analysis of microbial regulons using comparative genomics approach. The server allows the user to rapidly generate reference sets of regulons and regulatory motif profiles in a group of prokaryotic genomes. The new concept of a cluster of co-regulated orthologous operons allows the user to distribute the analysis of large regulons and to perform the comparative analysis of multiple clusters independently. Two major workflows currently implemented in RegPredict are: (i) regulon reconstruction for a known regulatory motif and (ii) ab initio inference of a novel regulon using several scenarios for the generation of starting gene sets. RegPredict provides a comprehensive collection of manually curated positional weight matrices of regulatory motifs. It is based on genomic sequences, ortholog and operon predictions from the MicrobesOnline. An interactive web interface of RegPredict integrates and presents diverse genomic and functional information about the candidate regulon members from several web resources. RegPredict is freely accessible at http://regpredict.lbl.gov.

  6. Implementation of fuzzy inference with neural network: the NNFI structure

    NASA Astrophysics Data System (ADS)

    Shu, Shyh-Yeong; Hwang, Chung-Mu

    1993-12-01

    In many fuzzy system applications, the most difficult and time consuming problem is to built the fuzzy rule base. Usually, to build fuzzy rule base depends on a domain expert to reflect his experience. But for a complicated system, it is sometimes difficult for an expert to describe clearly the causal relationships among those linguistic variables. To overcome such a problem, a dense connectionist structure of artificial neural network, called as NN-Fuzzy Inferencer (NNFI), is constructed to implement the fuzzy inference. This NNFI incorporates the effects of neural network and fuzzy inference. It is trainable and gets a more desired output value than backpropagation neural network does. The idea of the NNFI architecture is driven from the traditional fuzzy inference method. It can avoid not only the difficulty that for a designer to define the casual relations between the input variables and output variables, but also determine the membership function for each linguistic value. Furthermore, the system will generate the weighting coefficients in antecedent part and consequent part respectively in every fuzzy rule.

  7. Inferring Where and When Replication Initiates from Genome-Wide Replication Timing Data

    NASA Astrophysics Data System (ADS)

    Baker, A.; Audit, B.; Yang, S. C.-H.; Bechhoefer, J.; Arneodo, A.

    2012-06-01

    Based on an analogy between DNA replication and one dimensional nucleation-and-growth processes, various attempts to infer the local initiation rate I(x,t) of DNA replication origins from replication timing data have been developed in the framework of phase transition kinetics theories. These works have all used curve-fit strategies to estimate I(x,t) from genome-wide replication timing data. Here, we show how to invert analytically the Kolmogorov-Johnson-Mehl-Avrami model and extract I(x,t) directly. Tests on both simulated and experimental budding-yeast data confirm the location and firing-time distribution of replication origins.

  8. Inferring human population size and separation history from multiple genome sequences

    PubMed Central

    Schiffels, Stephan; Durbin, Richard

    2014-01-01

    The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model their ancestral relationship under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20-30 thousand years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The Multiple Sequentially Markovian Coalescent (MSMC) analyses the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago, and give information about human population history as recently as 2,000 years ago, including the bottleneck in the peopling of the Americas, and separations within Africa, East Asia and Europe. PMID:24952747

  9. The root of the mammalian tree inferred from whole mitochondrial genomes.

    PubMed

    Phillips, Matthew J; Penny, David

    2003-08-01

    Morphological and molecular data are currently contradictory over the position of monotremes with respect to marsupial and placental mammals. As part of a re-evaluation of both forms of data we examine complete mitochondrial genomes in more detail. There is a particularly large discrepancy in the frequencies of thymine and cytosine (T-C) between mitochondrial genomes that appears to affect some deep divergences in the mammalian tree. We report that recoding nucleotides to RY-characters, and partitioning maximum-likelihood analyses among subsets of data reduces such biases, and improves the fit of models to the data, respectively. RY-coding also increases the signal on the internal branches relative to external, and thus increases the phylogenetic signal. In contrast to previous analyses of mitochondrial data, our analyses favor Theria (marsupials plus placentals) over Marsupionta (monotremes plus marsupials). However, a short therian stem lineage is inferred, which is at variance with the traditionally deep placement of monotremes on morphological data.

  10. Mitochondrial Genome Structure of Photosynthetic Eukaryotes.

    PubMed

    Yurina, N P; Odintsova, M S

    2016-02-01

    Current ideas of plant mitochondrial genome organization are presented. Data on the size and structural organization of mtDNA, gene content, and peculiarities are summarized. Special emphasis is given to characteristic features of the mitochondrial genomes of land plants and photosynthetic algae that distinguish them from the mitochondrial genomes of other eukaryotes. The data published before the end of 2014 are reviewed.

  11. mStruct: inference of population structure in light of both genetic admixing and allele mutations.

    PubMed

    Shringarpure, Suyash; Xing, Eric P

    2009-06-01

    Traditional methods for analyzing population structure, such as the Structure program, ignore the influence of the effect of allele mutations between the ancestral and current alleles of genetic markers, which can dramatically influence the accuracy of the structural estimation of current populations. Studying these effects can also reveal additional information about population evolution such as the divergence time and migration history of admixed populations. We propose mStruct, an admixture of population-specific mixtures of inheritance models that addresses the task of structure inference and mutation estimation jointly through a hierarchical Bayesian framework, and a variational algorithm for inference. We validated our method on synthetic data and used it to analyze the Human Genome Diversity Project-Centre d'Etude du Polymorphisme Humain (HGDP-CEPH) cell line panel of microsatellites and HGDP single-nucleotide polymorphism (SNP) data. A comparison of the structural maps of world populations estimated by mStruct and Structure is presented, and we also report potentially interesting mutation patterns in world populations estimated by mStruct.

  12. Proteomics-inferred genome typing (PIGT) demonstrates inter-populationrecombination as a strategy for environmental adaptation

    SciTech Connect

    Denef, Vincent; Verberkmoes, Nathan C; Shah, Manesh B; Abraham, Paul E; Lefsrud, Mark G; Hettich, Robert {Bob} L; Banfield, Jillian F.

    2009-01-01

    Analyses of ecological and evolutionary processes that shape microbial consortia are facilitated by comprehensive studies of ecosystems with low species richness. In the current study we evaluated the role of recombination in altering the fitness of chemoautotrophic bacteria in their natural environment. Proteomics-inferred genome typing (PIGT) was used to determine the genomic make-up of Leptospirillum group II populations in 27 biofilms sampled from six locations in the Richmond Mine acid mine drainage system (Iron Mountain, CA) over a four-year period. We observed six distinct genotypes that are recombinants comprised of segments from two parental genotypes. Community genomic analyses revealed additional low abundance recombinant variants. The dominance of some genotypes despite a larger available genome pool, and patterns of spatiotemporal distribution within the ecosystem, indicate selection for distinct recombinants. Genes involved in motility, signal transduction and transport were overrepresented in the tens to hundreds of kilobase recombinant blocks, whereas core metabolic functions were significantly underrepresented. Our findings demonstrate the power of PIGT and reveal that recombination is a mechanism for fine-scale adaptation in this system.

  13. Phylogenetics and biogeography of the dung beetle genus Onthophagus inferred from mitochondrial genomes.

    PubMed

    Breeschoten, Thijmen; Doorenweerd, Camiel; Tarasov, Sergei; Vogler, Alfried P

    2016-12-01

    Phylogenetic relationships of dung beetles in the tribe Onthophagini, including the species-rich, cosmopolitan genus Onthophagus, were inferred using whole mitochondrial genomes. Data were generated by shotgun sequencing of mixed genomic DNA from >100 individuals on 50% of an Illumina MiSeq flow cell. Genome assembly of the mixed reads produced contigs of 74 (nearly) complete mitogenomes. The final dataset included representatives of Onthophagus from all biogeographic regions, closely related genera of Onthophagini, and the related tribes Onitini and Oniticellini. The analysis defined four major clades of Onthophagini, which was paraphyletic for Oniticellini, with Onitini as sister group to all others. Several (sub)genera considered as members of Onthophagus in the older literature formed separate deep lineages. All New World species of Onthophagus formed a monophyletic group, and the Australian taxa are confined to a single or two closely related clades, one of which forms the sister group of the New World species. Dating the tree by constraining the basal splits with existing calibrations of Scarabaeoidea suggests an origin of Onthophagini sensu lato in the Eocene and a rapid spread from an African ancestral stock into the Oriental region, and secondarily to Australia and the Americas at about 20-24 Mya. The successful assembly of mitogenomes and the well-supported tree obtained from these sequences demonstrates the power of shotgun sequencing from total genomic DNA of species pools as an efficient tool in genus-level phylogenetics.

  14. BPhyOG: an interactive server for genome-wide inference of bacterial phylogenies based on overlapping genes.

    PubMed

    Luo, Yingqin; Fu, Cong; Zhang, Da-Yong; Lin, Kui

    2007-07-25

    Overlapping genes (OGs) in bacterial genomes are pairs of adjacent genes of which the coding sequences overlap partly or entirely. With the rapid accumulation of sequence data, many OGs in bacterial genomes have now been identified. Indeed, these might prove a consistent feature across all microbial genomes. Our previous work suggests that OGs can be considered as robust markers at the whole genome level for the construction of phylogenies. An online, interactive web server for inferring phylogenies is needed for biologists to analyze phylogenetic relationships among a set of bacterial genomes of interest. BPhyOG is an online interactive server for reconstructing the phylogenies of completely sequenced bacterial genomes on the basis of their shared overlapping genes. It provides two tree-reconstruction methods: Neighbor Joining (NJ) and Unweighted Pair-Group Method using Arithmetic averages (UPGMA). Users can apply the desired method to generate phylogenetic trees, which are based on an evolutionary distance matrix for the selected genomes. The distance between two genomes is defined by the normalized number of their shared OG pairs. BPhyOG also allows users to browse the OGs that were used to infer the phylogenetic relationships. It provides detailed annotation for each OG pair and the features of the component genes through hyperlinks. Users can also retrieve each of the homologous OG pairs that have been determined among 177 genomes. It is a useful tool for analyzing the tree of life and overlapping genes from a genomic standpoint. BPhyOG is a useful interactive web server for genome-wide inference of any potential evolutionary relationship among the genomes selected by users. It currently includes 177 completely sequenced bacterial genomes containing 79,855 OG pairs, the annotation and homologous OG pairs of which are integrated comprehensively. The reliability of phylogenies complemented by annotations make BPhyOG a powerful web server for genomic and genetic

  15. Inferring Planet Mass from Spiral Structures in Protoplanetary Disks

    NASA Astrophysics Data System (ADS)

    Fung, Jeffrey; Dong, Ruobing

    2015-12-01

    Recent observations of protoplanetary disk have reported spiral structures that are potential signatures of embedded planets, and modeling efforts have shown that a single planet can excite multiple spiral arms, in contrast to conventional disk-planet interaction theory. Using two and three-dimensional hydrodynamics simulations to perform a systematic parameter survey, we confirm the existence of multiple spiral arms in disks with a single planet, and discover a scaling relation between the azimuthal separation of the primary and secondary arm, {φ }{{sep}}, and the planet-to-star mass ratio q: {φ }{{sep}}=102^\\circ {(q/0.001)}0.2 for companions between Neptune mass and 16 Jupiter masses around a 1 solar mass star, and {φ }{{sep}}=180^\\circ for brown dwarf mass companions. This relation is independent of the disk’s temperature, and can be used to infer a planet’s mass to within an accuracy of about 30% given only the morphology of a face-on disk. Combining hydrodynamics and Monte-Carlo radiative transfer calculations, we verify that our numerical measurements of {φ }{{sep}} are accurate representations of what would be measured in near-infrared scattered light images, such as those expected to be taken by Gemini/GPI, Very Large Telescope/SPHERE, or Subaru/SCExAO in the future. Finally, we are able to infer, using our scaling relation, that the planet responsible for the spiral structure in SAO 206462 has a mass of about 6 Jupiter masses.

  16. From algae to angiosperms-inferring the phylogeny of green plants (Viridiplantae) from 360 plastid genomes.

    PubMed

    Ruhfel, Brad R; Gitzendanner, Matthew A; Soltis, Pamela S; Soltis, Douglas E; Burleigh, J Gordon

    2014-02-17

    Next-generation sequencing has provided a wealth of plastid genome sequence data from an increasingly diverse set of green plants (Viridiplantae). Although these data have helped resolve the phylogeny of numerous clades (e.g., green algae, angiosperms, and gymnosperms), their utility for inferring relationships across all green plants is uncertain. Viridiplantae originated 700-1500 million years ago and may comprise as many as 500,000 species. This clade represents a major source of photosynthetic carbon and contains an immense diversity of life forms, including some of the smallest and largest eukaryotes. Here we explore the limits and challenges of inferring a comprehensive green plant phylogeny from available complete or nearly complete plastid genome sequence data. We assembled protein-coding sequence data for 78 genes from 360 diverse green plant taxa with complete or nearly complete plastid genome sequences available from GenBank. Phylogenetic analyses of the plastid data recovered well-supported backbone relationships and strong support for relationships that were not observed in previous analyses of major subclades within Viridiplantae. However, there also is evidence of systematic error in some analyses. In several instances we obtained strongly supported but conflicting topologies from analyses of nucleotides versus amino acid characters, and the considerable variation in GC content among lineages and within single genomes affected the phylogenetic placement of several taxa. Analyses of the plastid sequence data recovered a strongly supported framework of relationships for green plants. This framework includes: i) the placement of Zygnematophyceace as sister to land plants (Embryophyta), ii) a clade of extant gymnosperms (Acrogymnospermae) with cycads + Ginkgo sister to remaining extant gymnosperms and with gnetophytes (Gnetophyta) sister to non-Pinaceae conifers (Gnecup trees), and iii) within the monilophyte clade (Monilophyta), Equisetales

  17. From algae to angiosperms–inferring the phylogeny of green plants (Viridiplantae) from 360 plastid genomes

    PubMed Central

    2014-01-01

    Background Next-generation sequencing has provided a wealth of plastid genome sequence data from an increasingly diverse set of green plants (Viridiplantae). Although these data have helped resolve the phylogeny of numerous clades (e.g., green algae, angiosperms, and gymnosperms), their utility for inferring relationships across all green plants is uncertain. Viridiplantae originated 700-1500 million years ago and may comprise as many as 500,000 species. This clade represents a major source of photosynthetic carbon and contains an immense diversity of life forms, including some of the smallest and largest eukaryotes. Here we explore the limits and challenges of inferring a comprehensive green plant phylogeny from available complete or nearly complete plastid genome sequence data. Results We assembled protein-coding sequence data for 78 genes from 360 diverse green plant taxa with complete or nearly complete plastid genome sequences available from GenBank. Phylogenetic analyses of the plastid data recovered well-supported backbone relationships and strong support for relationships that were not observed in previous analyses of major subclades within Viridiplantae. However, there also is evidence of systematic error in some analyses. In several instances we obtained strongly supported but conflicting topologies from analyses of nucleotides versus amino acid characters, and the considerable variation in GC content among lineages and within single genomes affected the phylogenetic placement of several taxa. Conclusions Analyses of the plastid sequence data recovered a strongly supported framework of relationships for green plants. This framework includes: i) the placement of Zygnematophyceace as sister to land plants (Embryophyta), ii) a clade of extant gymnosperms (Acrogymnospermae) with cycads + Ginkgo sister to remaining extant gymnosperms and with gnetophytes (Gnetophyta) sister to non-Pinaceae conifers (Gnecup trees), and iii) within the monilophyte clade

  18. The aggregate site frequency spectrum (aSFS) for comparative population genomic inference

    PubMed Central

    Xue, Alexander T.; Hickerson, Michael J.

    2015-01-01

    Understanding how assemblages of species responded to past climate change is a central goal of comparative phylogeography and comparative population genomics, an endeavor that has increasing potential to integrate with community ecology. New sequencing technology now provides the potential to perform complex demographic inference at unprecedented resolution across assemblages of non-model species. To this end, we introduce the aggregate site frequency spectrum (aSFS), an expansion of the site frequency spectrum to use single nucleotide polymorphism (SNP) datasets collected from multiple, co-distributed species for assemblage-level demographic inference. We describe how the aSFS is constructed over an arbitrary number of independent population samples and then demonstrate how the aSFS can differentiate various multi-species demographic histories under a wide range of sampling configurations while allowing effective population sizes and expansion magnitudes to vary independently. We subsequently couple the aSFS with a hierarchical approximate Bayesian computation (hABC) framework to estimate degree of temporal synchronicity in expansion times across taxa, including an empirical demonstration with a dataset consisting of five populations of the threespine stickleback (Gasterosteus aculeatus). Corroborating what is generally understood about the recent post-glacial origins of these populations, the joint aSFS/hABC analysis strongly suggests that the stickleback data are most consistent with synchronous expansion after the Last Glacial Maximum (posterior probability = 0.99). The aSFS will have general application for multi-level statistical frameworks to test models involving assemblages and/or communities and as large-scale SNP data from non-model species become routine, the aSFS expands the potential for powerful next-generation comparative population genomic inference. PMID:26769405

  19. Accurate inference of subtle population structure (and other genetic discontinuities) using principal coordinates

    USDA-ARS?s Scientific Manuscript database

    Accurate inference of genetic discontinuities between populations is an essential component of intraspecific biodiversity and evolution studies, as well as associative genetics. The most widely used methods to infer population structure are model based, Bayesian MCMC procedures that minimize Hardy...

  20. High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs

    PubMed Central

    Dilthey, Alexander T.; Gourraud, Pierre-Antoine; McVean, Gil

    2016-01-01

    Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes (HLA typing) as a by-product of standard whole-genome sequencing data would therefore be highly desirable and enable the immunogenetic characterization of samples in currently ongoing population sequencing projects. Extensive hyperpolymorphism and sequence similarity between the HLA genes, however, pose problems for accurate read mapping and make HLA type inference from whole-genome sequencing data a challenging problem. We describe how to address these challenges in a Population Reference Graph (PRG) framework. First, we construct a PRG for 46 (mostly HLA) genes and pseudogenes, their genomic context and their characterized sequence variants, integrating a database of over 10,000 known allele sequences. Second, we present a sequence-to-PRG paired-end read mapping algorithm that enables accurate read mapping for the HLA genes. Third, we infer the most likely pair of underlying alleles at G group resolution from the IMGT/HLA database at each locus, employing a simple likelihood framework. We show that HLA*PRG, our algorithm, outperforms existing methods by a wide margin. We evaluate HLA*PRG on six classical class I and class II HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) and on a set of 14 samples (3 samples with 2 x 100bp, 11 samples with 2 x 250bp Illumina HiSeq data). Of 158 alleles tested, we correctly infer 157 alleles (99.4%). We also identify and re-type two erroneous alleles in the original validation data. We conclude that HLA*PRG for the first time achieves accuracies comparable to gold-standard reference methods from standard whole-genome sequencing data, though high computational demands (currently ~30–250 CPU hours per sample) remain a significant

  1. High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs.

    PubMed

    Dilthey, Alexander T; Gourraud, Pierre-Antoine; Mentzer, Alexander J; Cereb, Nezih; Iqbal, Zamin; McVean, Gil

    2016-10-01

    Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes (HLA typing) as a by-product of standard whole-genome sequencing data would therefore be highly desirable and enable the immunogenetic characterization of samples in currently ongoing population sequencing projects. Extensive hyperpolymorphism and sequence similarity between the HLA genes, however, pose problems for accurate read mapping and make HLA type inference from whole-genome sequencing data a challenging problem. We describe how to address these challenges in a Population Reference Graph (PRG) framework. First, we construct a PRG for 46 (mostly HLA) genes and pseudogenes, their genomic context and their characterized sequence variants, integrating a database of over 10,000 known allele sequences. Second, we present a sequence-to-PRG paired-end read mapping algorithm that enables accurate read mapping for the HLA genes. Third, we infer the most likely pair of underlying alleles at G group resolution from the IMGT/HLA database at each locus, employing a simple likelihood framework. We show that HLA*PRG, our algorithm, outperforms existing methods by a wide margin. We evaluate HLA*PRG on six classical class I and class II HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) and on a set of 14 samples (3 samples with 2 x 100bp, 11 samples with 2 x 250bp Illumina HiSeq data). Of 158 alleles tested, we correctly infer 157 alleles (99.4%). We also identify and re-type two erroneous alleles in the original validation data. We conclude that HLA*PRG for the first time achieves accuracies comparable to gold-standard reference methods from standard whole-genome sequencing data, though high computational demands (currently ~30-250 CPU hours per sample) remain a significant

  2. CONE: Community Oriented Network Estimation Is a Versatile Framework for Inferring Population Structure in Large Scale Sequencing Data.

    PubMed

    Kuismin, Markku O; Ahlinder, Jon; Sillanpää, Mikko J

    2017-08-22

    Estimation of genetic population structure based on molecular markers is a common task in population genetics and ecology. We apply a generalized linear model with LASSO regularization to infer relationships between individuals and populations from molecular marker data. Specifically, we apply a neighborhood selection algorithm to infer population genetic structure and gene flow between populations. The resulting relationships are used to construct an individual-level population graph. Different network substructures known as communities are then dissociated from each other using a community detection algorithm. Inference of population structure using networks combines the good properties of: (i) network theory (broad collection of tools, including aesthetically pleasing visualization) (ii) principal component analysis (dimension reduction together with simple visual inspection) (iii) model-based methods (e.g. ancestry coefficients estimates). We have named our process as CONE (Community Oriented Network Estimation). CONE has fewer restrictions than conventional assignment methods in that properties such as the number of subpopulations need not be fixed before the analysis, the sample may include close relatives or involve uneven sampling. Applying CONE on simulated data sets resulted in more accurate estimates of the true number of subpopulations and provided comparable ancestry coefficient estimates than model-based methods. Inference of empirical data sets of teosinte single nucleotide polymorphism, bacterial disease outbreak, and human genome diversity panel illustrate that population structures estimated with CONE are consistent with the earlier findings. Copyright © 2017, G3: Genes, Genomes, Genetics.

  3. Structure identification in fuzzy inference using reinforcement learning

    NASA Technical Reports Server (NTRS)

    Berenji, Hamid R.; Khedkar, Pratap

    1993-01-01

    In our previous work on the GARIC architecture, we have shown that the system can start with surface structure of the knowledge base (i.e., the linguistic expression of the rules) and learn the deep structure (i.e., the fuzzy membership functions of the labels used in the rules) by using reinforcement learning. Assuming the surface structure, GARIC refines the fuzzy membership functions used in the consequents of the rules using a gradient descent procedure. This hybrid fuzzy logic and reinforcement learning approach can learn to balance a cart-pole system and to backup a truck to its docking location after a few trials. In this paper, we discuss how to do structure identification using reinforcement learning in fuzzy inference systems. This involves identifying both surface as well as deep structure of the knowledge base. The term set of fuzzy linguistic labels used in describing the values of each control variable must be derived. In this process, splitting a label refers to creating new labels which are more granular than the original label and merging two labels creates a more general label. Splitting and merging of labels directly transform the structure of the action selection network used in GARIC by increasing or decreasing the number of hidden layer nodes.

  4. Genome-Wide SNP Discovery, Genotyping and Their Preliminary Applications for Population Genetic Inference in Spotted Sea Bass (Lateolabrax maculatus)

    PubMed Central

    Wang, Juan; Xue, Dong-Xiu; Zhang, Bai-Dong; Li, Yu-Long; Liu, Bing-Jian; Liu, Jin-Xian

    2016-01-01

    Next-generation sequencing and the collection of genome-wide single-nucleotide polymorphisms (SNPs) allow identifying fine-scale population genetic structure and genomic regions under selection. The spotted sea bass (Lateolabrax maculatus) is a non-model species of ecological and commercial importance and widely distributed in northwestern Pacific. A total of 22 648 SNPs was discovered across the genome of L. maculatus by paired-end sequencing of restriction-site associated DNA (RAD-PE) for 30 individuals from two populations. The nucleotide diversity (π) for each population was 0.0028±0.0001 in Dandong and 0.0018±0.0001 in Beihai, respectively. Shallow but significant genetic differentiation was detected between the two populations analyzed by using both the whole data set (FST = 0.0550, P < 0.001) and the putatively neutral SNPs (FST = 0.0347, P < 0.001). However, the two populations were highly differentiated based on the putatively adaptive SNPs (FST = 0.6929, P < 0.001). Moreover, a total of 356 SNPs representing 298 unique loci were detected as outliers putatively under divergent selection by FST-based outlier tests as implemented in BAYESCAN and LOSITAN. Functional annotation of the contigs containing putatively adaptive SNPs yielded hits for 22 of 55 (40%) significant BLASTX matches. Candidate genes for local selection constituted a wide array of functions, including binding, catalytic and metabolic activities, etc. The analyses with the SNPs developed in the present study highlighted the importance of genome-wide genetic variation for inference of population structure and local adaptation in L. maculatus. PMID:27336696

  5. Inferring causal genomic alterations in breast cancer using gene expression data

    PubMed Central

    2011-01-01

    Background One of the primary objectives in cancer research is to identify causal genomic alterations, such as somatic copy number variation (CNV) and somatic mutations, during tumor development. Many valuable studies lack genomic data to detect CNV; therefore, methods that are able to infer CNVs from gene expression data would help maximize the value of these studies. Results We developed a framework for identifying recurrent regions of CNV and distinguishing the cancer driver genes from the passenger genes in the regions. By inferring CNV regions across many datasets we were able to identify 109 recurrent amplified/deleted CNV regions. Many of these regions are enriched for genes involved in many important processes associated with tumorigenesis and cancer progression. Genes in these recurrent CNV regions were then examined in the context of gene regulatory networks to prioritize putative cancer driver genes. The cancer driver genes uncovered by the framework include not only well-known oncogenes but also a number of novel cancer susceptibility genes validated via siRNA experiments. Conclusions To our knowledge, this is the first effort to systematically identify and validate drivers for expression based CNV regions in breast cancer. The framework where the wavelet analysis of copy number alteration based on expression coupled with the gene regulatory network analysis, provides a blueprint for leveraging genomic data to identify key regulatory components and gene targets. This integrative approach can be applied to many other large-scale gene expression studies and other novel types of cancer data such as next-generation sequencing based expression (RNA-Seq) as well as CNV data. PMID:21806811

  6. Insights into structural variations and genome rearrangements in prokaryotic genomes.

    PubMed

    Periwal, Vinita; Scaria, Vinod

    2015-01-01

    Structural variations (SVs) are genomic rearrangements that affect fairly large fragments of DNA. Most of the SVs such as inversions, deletions and translocations have been largely studied in context of genetic diseases in eukaryotes. However, recent studies demonstrate that genome rearrangements can also have profound impact on prokaryotic genomes, leading to altered cell phenotype. In contrast to single-nucleotide variations, SVs provide a much deeper insight into organization of bacterial genomes at a much better resolution. SVs can confer change in gene copy number, creation of new genes, altered gene expression and many other functional consequences. High-throughput technologies have now made it possible to explore SVs at a much refined resolution in bacterial genomes. Through this review, we aim to highlight the importance of the less explored field of SVs in prokaryotic genomes and their impact. We also discuss its potential applicability in the emerging fields of synthetic biology and genome engineering where targeted SVs could serve to create sophisticated and accurate genome editing. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  7. Impact of Sample Type and DNA Isolation Procedure on Genomic Inference of Microbiome Composition

    PubMed Central

    Munk, Patrick; Lukjancenko, Oksana; Priemé, Anders; Aarestrup, Frank M.

    2016-01-01

    ABSTRACT Explorations of complex microbiomes using genomics greatly enhance our understanding about their diversity, biogeography, and function. The isolation of DNA from microbiome specimens is a key prerequisite for such examinations, but challenges remain in obtaining sufficient DNA quantities required for certain sequencing approaches, achieving accurate genomic inference of microbiome composition, and facilitating comparability of findings across specimen types and sequencing projects. These aspects are particularly relevant for the genomics-based global surveillance of infectious agents and antimicrobial resistance from different reservoirs. Here, we compare in a stepwise approach a total of eight commercially available DNA extraction kits and 16 procedures based on these for three specimen types (human feces, pig feces, and hospital sewage). We assess DNA extraction using spike-in controls and different types of beads for bead beating, facilitating cell lysis. We evaluate DNA concentration, purity, and stability and microbial community composition using 16S rRNA gene sequencing and for selected samples using shotgun metagenomic sequencing. Our results suggest that inferred community composition was dependent on inherent specimen properties as well as DNA extraction method. We further show that bead beating or enzymatic treatment can increase the extraction of DNA from Gram-positive bacteria. Final DNA quantities could be increased by isolating DNA from a larger volume of cell lysate than that in standard protocols. Based on this insight, we designed an improved DNA isolation procedure optimized for microbiome genomics that can be used for the three examined specimen types and potentially also for other biological specimens. A standard operating procedure is available from https://dx.doi.org/10.6084/m9.figshare.3475406. IMPORTANCE Sequencing-based analyses of microbiomes may lead to a breakthrough in our understanding of the microbial worlds associated with

  8. Impact of Sample Type and DNA Isolation Procedure on Genomic Inference of Microbiome Composition.

    PubMed

    Knudsen, Berith E; Bergmark, Lasse; Munk, Patrick; Lukjancenko, Oksana; Priemé, Anders; Aarestrup, Frank M; Pamp, Sünje J

    2016-01-01

    Explorations of complex microbiomes using genomics greatly enhance our understanding about their diversity, biogeography, and function. The isolation of DNA from microbiome specimens is a key prerequisite for such examinations, but challenges remain in obtaining sufficient DNA quantities required for certain sequencing approaches, achieving accurate genomic inference of microbiome composition, and facilitating comparability of findings across specimen types and sequencing projects. These aspects are particularly relevant for the genomics-based global surveillance of infectious agents and antimicrobial resistance from different reservoirs. Here, we compare in a stepwise approach a total of eight commercially available DNA extraction kits and 16 procedures based on these for three specimen types (human feces, pig feces, and hospital sewage). We assess DNA extraction using spike-in controls and different types of beads for bead beating, facilitating cell lysis. We evaluate DNA concentration, purity, and stability and microbial community composition using 16S rRNA gene sequencing and for selected samples using shotgun metagenomic sequencing. Our results suggest that inferred community composition was dependent on inherent specimen properties as well as DNA extraction method. We further show that bead beating or enzymatic treatment can increase the extraction of DNA from Gram-positive bacteria. Final DNA quantities could be increased by isolating DNA from a larger volume of cell lysate than that in standard protocols. Based on this insight, we designed an improved DNA isolation procedure optimized for microbiome genomics that can be used for the three examined specimen types and potentially also for other biological specimens. A standard operating procedure is available from https://dx.doi.org/10.6084/m9.figshare.3475406. IMPORTANCE Sequencing-based analyses of microbiomes may lead to a breakthrough in our understanding of the microbial worlds associated with humans

  9. Comparative genome analyses of Arabidopsis spp.: Inferring chromosomal rearrangement events in the evolutionary history of A. thaliana

    PubMed Central

    Yogeeswaran, Krithika; Frary, Amy; York, Thomas L.; Amenta, Alison; Lesser, Andrew H.; Nasrallah, June B.; Tanksley, Steven D.; Nasrallah, Mikhail E.

    2005-01-01

    Comparative genome analysis is a powerful tool that can facilitate the reconstruction of the evolutionary history of the genomes of modern-day species. The model plant Arabidopsis thaliana with its n = 5 genome is thought to be derived from an ancestral n = 8 genome. Pairwise comparative genome analyses of A. thaliana with polyploid and diploid Brassicaceae species have suggested that rapid genome evolution, manifested by chromosomal rearrangements and duplications, characterizes the polyploid, but not the diploid, lineages of this family. In this study, we constructed a low-density genetic linkage map of Arabidopsis lyrata ssp. lyrata (A. l. lyrata; n = 8, diploid), the closest known relative of A. thaliana (MRCA ∼5 Mya), using A. thaliana-specific markers that resolve into the expected eight linkage groups. We then performed comparative Bayesian analyses using raw mapping data from this study and from a Capsella study to infer the number and nature of rearrangements that distinguish the n = 8 genomes of A. l. lyrata and Capsella from the n = 5 genome of A. thaliana. We conclude that there is strong statistical support in favor of the parsimony scenarios of 10 major chromosomal rearrangements separating these n = 8 genomes from A. thaliana. These chromosomal rearrangement events contribute to a rate of chromosomal evolution higher than previously reported in this lineage. We infer that at least seven of these events, common to both sets of data, are responsible for the change in karyotype and underlie genome reduction in A. thaliana. PMID:15805492

  10. The Phylogeny and Evolutionary Timescale of Muscoidea (Diptera: Brachycera: Calyptratae) Inferred from Mitochondrial Genomes

    PubMed Central

    Wang, Ning; Cameron, Stephen L.; Mao, Meng; Wang, Yuyu; Xi, Yuqiang; Yang, Ding

    2015-01-01

    Muscoidea is a significant dipteran clade that includes house flies (Family Muscidae), latrine flies (F. Fannidae), dung flies (F. Scathophagidae) and root maggot flies (F. Anthomyiidae). It is comprised of approximately 7000 described species. The monophyly of the Muscoidea and the precise relationships of muscoids to the closest superfamily the Oestroidea (blow flies, flesh flies etc) are both unresolved. Until now mitochondrial (mt) genomes were available for only two of the four muscoid families precluding a thorough test of phylogenetic relationships using this data source. Here we present the first two mt genomes for the families Fanniidae (Euryomma sp.) (family Fanniidae) and Anthomyiidae (Delia platura (Meigen, 1826)). We also conducted phylogenetic analyses containing of these newly sequenced mt genomes plus 15 other species representative of dipteran diversity to address the internal relationship of Muscoidea and its systematic position. Both maximum-likelihood and Bayesian analyses suggested that Muscoidea was not a monophyletic group with the relationship: (Fanniidae + Muscidae) + ((Anthomyiidae + Scathophagidae) + (Calliphoridae + Sarcophagidae)), supported by the majority of analysed datasets. This also infers that Oestroidea was paraphyletic in the majority of analyses. Divergence time estimation suggested that the earliest split within the Calyptratae, separating (Tachinidae + Oestridae) from the remaining families, occurred in the Early Eocene. The main divergence within the paraphyletic muscoidea grade was between Fanniidae + Muscidae and the lineage ((Anthomyiidae + Scathophagidae) + (Calliphoridae + Sarcophagidae)) which occurred in the Late Eocene. PMID:26225760

  11. The influence of genomic context on mutation patterns in the human genome inferred from rare variants

    PubMed Central

    Schaibley, Valerie M.; Zawistowski, Matthew; Wegmann, Daniel; Ehm, Margaret G.; Nelson, Matthew R.; St. Jean, Pamela L.; Abecasis, Gonçalo R.; Novembre, John; Zöllner, Sebastian; Li, Jun Z.

    2013-01-01

    Understanding patterns of spontaneous mutations is of fundamental interest in studies of human genome evolution and genetic disease. Here, we used extremely rare variants in humans to model the molecular spectrum of single-nucleotide mutations. Compared to common variants in humans and human–chimpanzee fixed differences (substitutions), rare variants, on average, arose more recently in the human lineage and are less affected by the potentially confounding effects of natural selection, population demographic history, and biased gene conversion. We analyzed variants obtained from a population-based sequencing study of 202 genes in >14,000 individuals. We observed considerable variability in the per-gene mutation rate, which was correlated with local GC content, but not recombination rate. Using >20,000 variants with a derived allele frequency ≤10−4, we examined the effect of local GC content and recombination rate on individual variant subtypes and performed comparisons with common variants and substitutions. The influence of local GC content on rare variants differed from that on common variants or substitutions, and the differences varied by variant subtype. Furthermore, recombination rate and recombination hotspots have little effect on rare variants of any subtype, yet both have a relatively strong impact on multiple variant subtypes in common variants and substitutions. This observation is consistent with the effect of biased gene conversion or selection-dependent processes. Our results highlight the distinct biases inherent in the initial mutation patterns and subsequent evolutionary processes that affect segregating variants. PMID:23990608

  12. The influence of genomic context on mutation patterns in the human genome inferred from rare variants.

    PubMed

    Schaibley, Valerie M; Zawistowski, Matthew; Wegmann, Daniel; Ehm, Margaret G; Nelson, Matthew R; St Jean, Pamela L; Abecasis, Gonçalo R; Novembre, John; Zöllner, Sebastian; Li, Jun Z

    2013-12-01

    Understanding patterns of spontaneous mutations is of fundamental interest in studies of human genome evolution and genetic disease. Here, we used extremely rare variants in humans to model the molecular spectrum of single-nucleotide mutations. Compared to common variants in humans and human-chimpanzee fixed differences (substitutions), rare variants, on average, arose more recently in the human lineage and are less affected by the potentially confounding effects of natural selection, population demographic history, and biased gene conversion. We analyzed variants obtained from a population-based sequencing study of 202 genes in >14,000 individuals. We observed considerable variability in the per-gene mutation rate, which was correlated with local GC content, but not recombination rate. Using >20,000 variants with a derived allele frequency ≤ 10(-4), we examined the effect of local GC content and recombination rate on individual variant subtypes and performed comparisons with common variants and substitutions. The influence of local GC content on rare variants differed from that on common variants or substitutions, and the differences varied by variant subtype. Furthermore, recombination rate and recombination hotspots have little effect on rare variants of any subtype, yet both have a relatively strong impact on multiple variant subtypes in common variants and substitutions. This observation is consistent with the effect of biased gene conversion or selection-dependent processes. Our results highlight the distinct biases inherent in the initial mutation patterns and subsequent evolutionary processes that affect segregating variants.

  13. Structural genomics of pathogenic protozoa: an overview.

    PubMed

    Fan, Erkang; Baker, David; Fields, Stanley; Gelb, Michael H; Buckner, Frederick S; Van Voorhis, Wesley C; Phizicky, Eric; Dumont, Mark; Mehlin, Christopher; Grayhack, Elizabeth; Sullivan, Mark; Verlinde, Christophe; Detitta, George; Meldrum, Deirdre R; Merritt, Ethan A; Earnest, Thomas; Soltis, Michael; Zucker, Frank; Myler, Peter J; Schoenfeld, Lori; Kim, David; Worthey, Liz; Lacount, Doug; Vignali, Marissa; Li, Jizhen; Mondal, Somnath; Massey, Archna; Carroll, Brian; Gulde, Stacey; Luft, Joseph; Desoto, Larry; Holl, Mark; Caruthers, Jonathan; Bosch, Jürgen; Robien, Mark; Arakaki, Tracy; Holmes, Margaret; Le Trong, Isolde; Hol, Wim G J

    2008-01-01

    The Structural Genomics of Pathogenic Protozoa (SGPP) Consortium aimed to determine crystal structures of proteins from trypanosomatid and malaria parasites in a high throughput manner. The pipeline of target selection, protein production, crystallization, and structure determination, is sketched. Special emphasis is given to a number of technology developments including domain prediction, the use of "co-crystallants," and capillary crystallization. "Fragment cocktail crystallography" for medical structural genomics is also described.

  14. Chiropteran types I and II interferon genes inferred from genome sequencing traces by a statistical gene-family assembler

    PubMed Central

    2010-01-01

    Background The rate of emergence of human pathogens is steadily increasing; most of these novel agents originate in wildlife. Bats, remarkably, are the natural reservoirs of many of the most pathogenic viruses in humans. There are two bat genome projects currently underway, a circumstance that promises to speed the discovery host factors important in the coevolution of bats with their viruses. These genomes, however, are not yet assembled and one of them will provide only low coverage, making the inference of most genes of immunological interest error-prone. Many more wildlife genome projects are underway and intend to provide only shallow coverage. Results We have developed a statistical method for the assembly of gene families from partial genomes. The method takes full advantage of the quality scores generated by base-calling software, incorporating them into a complete probabilistic error model, to overcome the limitation inherent in the inference of gene family members from partial sequence information. We validated the method by inferring the human IFNA genes from the genome trace archives, and used it to infer 61 type-I interferon genes, and single type-II interferon genes in the bats Pteropus vampyrus and Myotis lucifugus. We confirmed our inferences by direct cloning and sequencing of IFNA, IFNB, IFND, and IFNK in P. vampyrus, and by demonstrating transcription of some of the inferred genes by known interferon-inducing stimuli. Conclusion The statistical trace assembler described here provides a reliable method for extracting information from the many available and forthcoming partial or shallow genome sequencing projects, thereby facilitating the study of a wider variety of organisms with ecological and biomedical significance to humans than would otherwise be possible. PMID:20663124

  15. Chiropteran types I and II interferon genes inferred from genome sequencing traces by a statistical gene-family assembler.

    PubMed

    Kepler, Thomas B; Sample, Christopher; Hudak, Kathryn; Roach, Jeffrey; Haines, Albert; Walsh, Allyson; Ramsburg, Elizabeth A

    2010-07-21

    The rate of emergence of human pathogens is steadily increasing; most of these novel agents originate in wildlife. Bats, remarkably, are the natural reservoirs of many of the most pathogenic viruses in humans. There are two bat genome projects currently underway, a circumstance that promises to speed the discovery host factors important in the coevolution of bats with their viruses. These genomes, however, are not yet assembled and one of them will provide only low coverage, making the inference of most genes of immunological interest error-prone. Many more wildlife genome projects are underway and intend to provide only shallow coverage. We have developed a statistical method for the assembly of gene families from partial genomes. The method takes full advantage of the quality scores generated by base-calling software, incorporating them into a complete probabilistic error model, to overcome the limitation inherent in the inference of gene family members from partial sequence information. We validated the method by inferring the human IFNA genes from the genome trace archives, and used it to infer 61 type-I interferon genes, and single type-II interferon genes in the bats Pteropus vampyrus and Myotis lucifugus. We confirmed our inferences by direct cloning and sequencing of IFNA, IFNB, IFND, and IFNK in P. vampyrus, and by demonstrating transcription of some of the inferred genes by known interferon-inducing stimuli. The statistical trace assembler described here provides a reliable method for extracting information from the many available and forthcoming partial or shallow genome sequencing projects, thereby facilitating the study of a wider variety of organisms with ecological and biomedical significance to humans than would otherwise be possible.

  16. A Novel and Fast Approach for Population Structure Inference Using Kernel-PCA and Optimization

    PubMed Central

    Popescu, Andrei-Alin; Harper, Andrea L.; Trick, Martin; Bancroft, Ian; Huber, Katharina T.

    2014-01-01

    Population structure is a confounding factor in genome-wide association studies, increasing the rate of false positive associations. To correct for it, several model-based algorithms such as ADMIXTURE and STRUCTURE have been proposed. These tend to suffer from the fact that they have a considerable computational burden, limiting their applicability when used with large datasets, such as those produced by next generation sequencing techniques. To address this, nonmodel based approaches such as sparse nonnegative matrix factorization (sNMF) and EIGENSTRAT have been proposed, which scale better with larger data. Here we present a novel nonmodel-based approach, population structure inference using kernel-PCA and optimization (PSIKO), which is based on a unique combination of linear kernel-PCA and least-squares optimization and allows for the inference of admixture coefficients, principal components, and number of founder populations of a dataset. PSIKO has been compared against existing leading methods on a variety of simulation scenarios, as well as on real biological data. We found that in addition to producing results of the same quality as other tested methods, PSIKO scales extremely well with dataset size, being considerably (up to 30 times) faster for longer sequences than even state-of-the-art methods such as sNMF. PSIKO and accompanying manual are freely available at https://www.uea.ac.uk/computing/psiko. PMID:25326237

  17. Integration of Multiple Genomic and Phenotype Data to Infer Novel miRNA-Disease Associations

    PubMed Central

    Zhou, Meng; Cheng, Liang; Yang, Haixiu; Wang, Jing; Sun, Jie; Wang, Zhenzhen

    2016-01-01

    MicroRNAs (miRNAs) play an important role in the development and progression of human diseases. The identification of disease-associated miRNAs will be helpful for understanding the molecular mechanisms of diseases at the post-transcriptional level. Based on different types of genomic data sources, computational methods for miRNA-disease association prediction have been proposed. However, individual source of genomic data tends to be incomplete and noisy; therefore, the integration of various types of genomic data for inferring reliable miRNA-disease associations is urgently needed. In this study, we present a computational framework, CHNmiRD, for identifying miRNA-disease associations by integrating multiple genomic and phenotype data, including protein-protein interaction data, gene ontology data, experimentally verified miRNA-target relationships, disease phenotype information and known miRNA-disease connections. The performance of CHNmiRD was evaluated by experimentally verified miRNA-disease associations, which achieved an area under the ROC curve (AUC) of 0.834 for 5-fold cross-validation. In particular, CHNmiRD displayed excellent performance for diseases without any known related miRNAs. The results of case studies for three human diseases (glioblastoma, myocardial infarction and type 1 diabetes) showed that all of the top 10 ranked miRNAs having no known associations with these three diseases in existing miRNA-disease databases were directly or indirectly confirmed by our latest literature mining. All these results demonstrated the reliability and efficiency of CHNmiRD, and it is anticipated that CHNmiRD will serve as a powerful bioinformatics method for mining novel disease-related miRNAs and providing a new perspective into molecular mechanisms underlying human diseases at the post-transcriptional level. CHNmiRD is freely available at http://www.bio-bigdata.com/CHNmiRD. PMID:26849207

  18. Integration of Multiple Genomic and Phenotype Data to Infer Novel miRNA-Disease Associations.

    PubMed

    Shi, Hongbo; Zhang, Guangde; Zhou, Meng; Cheng, Liang; Yang, Haixiu; Wang, Jing; Sun, Jie; Wang, Zhenzhen

    2016-01-01

    MicroRNAs (miRNAs) play an important role in the development and progression of human diseases. The identification of disease-associated miRNAs will be helpful for understanding the molecular mechanisms of diseases at the post-transcriptional level. Based on different types of genomic data sources, computational methods for miRNA-disease association prediction have been proposed. However, individual source of genomic data tends to be incomplete and noisy; therefore, the integration of various types of genomic data for inferring reliable miRNA-disease associations is urgently needed. In this study, we present a computational framework, CHNmiRD, for identifying miRNA-disease associations by integrating multiple genomic and phenotype data, including protein-protein interaction data, gene ontology data, experimentally verified miRNA-target relationships, disease phenotype information and known miRNA-disease connections. The performance of CHNmiRD was evaluated by experimentally verified miRNA-disease associations, which achieved an area under the ROC curve (AUC) of 0.834 for 5-fold cross-validation. In particular, CHNmiRD displayed excellent performance for diseases without any known related miRNAs. The results of case studies for three human diseases (glioblastoma, myocardial infarction and type 1 diabetes) showed that all of the top 10 ranked miRNAs having no known associations with these three diseases in existing miRNA-disease databases were directly or indirectly confirmed by our latest literature mining. All these results demonstrated the reliability and efficiency of CHNmiRD, and it is anticipated that CHNmiRD will serve as a powerful bioinformatics method for mining novel disease-related miRNAs and providing a new perspective into molecular mechanisms underlying human diseases at the post-transcriptional level. CHNmiRD is freely available at http://www.bio-bigdata.com/CHNmiRD.

  19. Adaptation, Ecology, and Evolution of the Halophilic Stromatolite Archaeon Halococcus hamelinensis Inferred through Genome Analyses

    PubMed Central

    Gudhka, Reema K.; Neilan, Brett A.; Burns, Brendan P.

    2015-01-01

    Halococcus hamelinensis was the first archaeon isolated from stromatolites. These geomicrobial ecosystems are thought to be some of the earliest known on Earth, yet, despite their evolutionary significance, the role of Archaea in these systems is still not well understood. Detailed here is the genome sequencing and analysis of an archaeon isolated from stromatolites. The genome of H. hamelinensis consisted of 3,133,046 base pairs with an average G+C content of 60.08% and contained 3,150 predicted coding sequences or ORFs, 2,196 (68.67%) of which were protein-coding genes with functional assignments and 954 (29.83%) of which were of unknown function. Codon usage of the H. hamelinensis genome was consistent with a highly acidic proteome, a major adaptive mechanism towards high salinity. Amino acid transport and metabolism, inorganic ion transport and metabolism, energy production and conversion, ribosomal structure, and unknown function COG genes were overrepresented. The genome of H. hamelinensis also revealed characteristics reflecting its survival in its extreme environment, including putative genes/pathways involved in osmoprotection, oxidative stress response, and UV damage repair. Finally, genome analyses indicated the presence of putative transposases as well as positive matches of genes of H. hamelinensis against various genomes of Bacteria, Archaea, and viruses, suggesting the potential for horizontal gene transfer. PMID:25709556

  20. EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference

    PubMed Central

    Tian, Weidong; Arakaki, Adrian K.; Skolnick, Jeffrey

    2004-01-01

    EFICAz (Enzyme Function Inference by Combined Approach) is an automatic engine for large-scale enzyme function inference that combines predictions from four different methods developed and optimized to achieve high prediction accuracy: (i) recognition of functionally discriminating residues (FDRs) in enzyme families obtained by a Conservation-controlled HMM Iterative procedure for Enzyme Family classification (CHIEFc), (ii) pairwise sequence comparison using a family specific Sequence Identity Threshold, (iii) recognition of FDRs in Multiple Pfam enzyme families, and (iv) recognition of multiple Prosite patterns of high specificity. For FDR (i.e. conserved positions in an enzyme family that discriminate between true and false members of the family) identification, we have developed an Evolutionary Footprinting method that uses evolutionary information from homofunctional and heterofunctional multiple sequence alignments associated with an enzyme family. The FDRs show a significant correlation with annotated active site residues. In a jackknife test, EFICAz shows high accuracy (92%) and sensitivity (82%) for predicting four EC digits in testing sequences that are <40% identical to any member of the corresponding training set. Applied to Escherichia coli genome, EFICAz assigns more detailed enzymatic function than KEGG, and generates numerous novel predictions. PMID:15576349

  1. EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference.

    PubMed

    Tian, Weidong; Arakaki, Adrian K; Skolnick, Jeffrey

    2004-01-01

    EFICAz (Enzyme Function Inference by Combined Approach) is an automatic engine for large-scale enzyme function inference that combines predictions from four different methods developed and optimized to achieve high prediction accuracy: (i) recognition of functionally discriminating residues (FDRs) in enzyme families obtained by a Conservation-controlled HMM Iterative procedure for Enzyme Family classification (CHIEFc), (ii) pairwise sequence comparison using a family specific Sequence Identity Threshold, (iii) recognition of FDRs in Multiple Pfam enzyme families, and (iv) recognition of multiple Prosite patterns of high specificity. For FDR (i.e. conserved positions in an enzyme family that discriminate between true and false members of the family) identification, we have developed an Evolutionary Footprinting method that uses evolutionary information from homofunctional and heterofunctional multiple sequence alignments associated with an enzyme family. The FDRs show a significant correlation with annotated active site residues. In a jackknife test, EFICAz shows high accuracy (92%) and sensitivity (82%) for predicting four EC digits in testing sequences that are <40% identical to any member of the corresponding training set. Applied to Escherichia coli genome, EFICAz assigns more detailed enzymatic function than KEGG, and generates numerous novel predictions.

  2. Paleolithic Contingent in Modern Japanese: Estimation and Inference using Genome-wide Data

    PubMed Central

    He, Yungang; Wang, Wei R.; Xu, Shuhua; Jin, Li; SNP Consortium, Pan-Asia

    2012-01-01

    The genetic origins of Japanese populations have been controversial. Upper Paleolithic Japanese, i.e. Jomon, developed independently in Japanese islands for more than 10,000 years until the isolation was ended with the influxes of continental immigrants about 2,000 years ago. However, the knowledge of origin of Jomon and its contribution to the genetic pool of contemporary Japanese is still limited, albeit the extensive studies using mtDNA and Y chromosomes. In this report, we aimed to infer the origin of Jomon and to estimate its contribution to Japanese by fitting an admixture model with missing data from Jomon to a genome-wide data from 94 worldwide populations. Our results showed that the genetic contributions of Jomon, the Paleolithic contingent in Japanese, are 54.3∼62.3% in Ryukyuans and 23.1∼39.5% in mainland Japanese, respectively. Utilizing inferred allele frequencies of the Jomon population, we further showed the Paleolithic contingent in Japanese had a Northeast Asia origin. PMID:22482036

  3. ABC inference of multi-population divergence with admixture from unphased population genomic data.

    PubMed

    Robinson, John D; Bunnefeld, Lynsey; Hearn, Jack; Stone, Graham N; Hickerson, Michael J

    2014-09-01

    Rapidly developing sequencing technologies and declining costs have made it possible to collect genome-scale data from population-level samples in nonmodel systems. Inferential tools for historical demography given these data sets are, at present, underdeveloped. In particular, approximate Bayesian computation (ABC) has yet to be widely embraced by researchers generating these data. Here, we demonstrate the promise of ABC for analysis of the large data sets that are now attainable from nonmodel taxa through current genomic sequencing technologies. We develop and test an ABC framework for model selection and parameter estimation, given histories of three-population divergence with admixture. We then explore different sampling regimes to illustrate how sampling more loci, longer loci or more individuals affects the quality of model selection and parameter estimation in this ABC framework. Our results show that inferences improved substantially with increases in the number and/or length of sequenced loci, while less benefit was gained by sampling large numbers of individuals. Optimal sampling strategies given our inferential models included at least 2000 loci, each approximately 2 kb in length, sampled from five diploid individuals per population, although specific strategies are model and question dependent. We tested our ABC approach through simulation-based cross-validations and illustrate its application using previously analysed data from the oak gall wasp, Biorhiza pallida. © 2014 The Authors. Molecular Ecology published by John Wiley & Sons Ltd.

  4. ABC inference of multi-population divergence with admixture from unphased population genomic data

    PubMed Central

    Robinson, John D; Bunnefeld, Lynsey; Hearn, Jack; Stone, Graham N; Hickerson, Michael J

    2014-01-01

    Rapidly developing sequencing technologies and declining costs have made it possible to collect genome-scale data from population-level samples in nonmodel systems. Inferential tools for historical demography given these data sets are, at present, underdeveloped. In particular, approximate Bayesian computation (ABC) has yet to be widely embraced by researchers generating these data. Here, we demonstrate the promise of ABC for analysis of the large data sets that are now attainable from nonmodel taxa through current genomic sequencing technologies. We develop and test an ABC framework for model selection and parameter estimation, given histories of three-population divergence with admixture. We then explore different sampling regimes to illustrate how sampling more loci, longer loci or more individuals affects the quality of model selection and parameter estimation in this ABC framework. Our results show that inferences improved substantially with increases in the number and/or length of sequenced loci, while less benefit was gained by sampling large numbers of individuals. Optimal sampling strategies given our inferential models included at least 2000 loci, each approximately 2 kb in length, sampled from five diploid individuals per population, although specific strategies are model and question dependent. We tested our ABC approach through simulation-based cross-validations and illustrate its application using previously analysed data from the oak gall wasp, Biorhiza pallida. PMID:25113024

  5. Inferring Quantitative Trait Pathways Associated with Bull Fertility from a Genome-Wide Association Study

    PubMed Central

    Peñagaricano, Francisco; Weigel, Kent A.; Rosa, Guilherme J. M.; Khatib, Hasan

    2013-01-01

    Whole-genome association studies typically focus on genetic markers with the strongest evidence of association. However, single markers often explain only a small component of the genetic variance and hence offer a limited understanding of the trait under study. As such, the objective of this study was to perform a pathway-based association analysis in Holstein dairy cattle in order to identify relevant pathways involved in bull fertility. The results of a single-marker association analysis, using 1,755 bulls with sire conception rate data and genotypes for 38,650 single nucleotide polymorphisms (SNPs), were used in this study. A total of 16,819 annotated genes, including 2,767 significantly associated with bull fertility, were used to interrogate a total of 662 Gene Ontology (GO) terms and 248 InterPro (IP) entries using a test of proportions based on the cumulative hypergeometric distribution. After multiple-testing correction, 20 GO categories and one IP entry showed significant overrepresentation of genes statistically associated with bull fertility. Several of these functional categories such as small GTPases mediated signal transduction, neurogenesis, calcium ion binding, and cytoskeleton are known to be involved in biological processes closely related to male fertility. These results could provide insight into the genetic architecture of this complex trait in dairy cattle. In addition, this study shows that quantitative trait pathways inferred from single-marker analyses could enhance our interpretations of the results of genome-wide association studies. PMID:23335935

  6. Orthology Inference in Nonmodel Organisms Using Transcriptomes and Low-Coverage Genomes: Improving Accuracy and Matrix Occupancy for Phylogenomics

    PubMed Central

    Yang, Ya; Smith, Stephen A.

    2014-01-01

    Orthology inference is central to phylogenomic analyses. Phylogenomic data sets commonly include transcriptomes and low-coverage genomes that are incomplete and contain errors and isoforms. These properties can severely violate the underlying assumptions of orthology inference with existing heuristics. We present a procedure that uses phylogenies for both homology and orthology assignment. The procedure first uses similarity scores to infer putative homologs that are then aligned, constructed into phylogenies, and pruned of spurious branches caused by deep paralogs, misassembly, frameshifts, or recombination. These final homologs are then used to identify orthologs. We explore four alternative tree-based orthology inference approaches, of which two are new. These accommodate gene and genome duplications as well as gene tree discordance. We demonstrate these methods in three published data sets including the grape family, Hymenoptera, and millipedes with divergence times ranging from approximately 100 to over 400 Ma. The procedure significantly increased the completeness and accuracy of the inferred homologs and orthologs. We also found that data sets that are more recently diverged and/or include more high-coverage genomes had more complete sets of orthologs. To explicitly evaluate sources of conflicting phylogenetic signals, we applied serial jackknife analyses of gene regions keeping each locus intact. The methods described here can scale to over 100 taxa. They have been implemented in python with independent scripts for each step, making it easy to modify or incorporate them into existing pipelines. All scripts are available from https://bitbucket.org/yangya/phylogenomic_dataset_construction. PMID:25158799

  7. Chapter 6: Structural variation and medical genomics.

    PubMed

    Raphael, Benjamin J

    2012-01-01

    Differences between individual human genomes, or between human and cancer genomes, range in scale from single nucleotide variants (SNVs) through intermediate and large-scale duplications, deletions, and rearrangements of genomic segments. The latter class, called structural variants (SVs), have received considerable attention in the past several years as they are a previously under appreciated source of variation in human genomes. Much of this recent attention is the result of the availability of higher-resolution technologies for measuring these variants, including both microarray-based techniques, and more recently, high-throughput DNA sequencing. We describe the genomic technologies and computational techniques currently used to measure SVs, focusing on applications in human and cancer genomics.

  8. Ceres' internal structure as inferred from its large craters

    NASA Astrophysics Data System (ADS)

    Marchi, Simone; Raymond, Carol; Fu, Roger; Ermakov, Anton I.; O'Brien, David P.; De Sanctis, Cristina; Ammannito, Eleonora; Russell, Christopher T.

    2016-10-01

    The Dawn spacecraft has gathered important data about the surface composition, internal structure, and geomorphology of Ceres, revealing a cratered landscape. Digital terrain models and global mosaics have been used to derive a global catalog of impact craters larger than 10 km in diameter. A surface dichotomy appears evident: a large fraction of the northern hemisphere is heavily cratered as the result of several billion of years of collisions, while portions of the equatorial region and southern hemisphere are much less cratered. The latter are associated with the presence of the two largest (~270-280 km) impact craters, Kerwan and Yalode. The global crater count shows a severe depletion for diameters larger than 100-150 km with respect to collisional models and other large asteroids, like Vesta. This is a strong indication that a significant population of large cerean craters has been obliterated over geological time-scales. This observation is supported by the overall topographic power spectrum of Ceres, which shows that long wavelengths in topography are suppressed (that is, flatter surface) compared to short wavelengths.Viscous relaxation of topography may be a natural culprit for the observed paucity of large craters. Relaxation accommodated by the creep of water ice is expected to result in much more rapid and complete decay of topography than inferred. In contrast, we favor a strong crust composed of a mixture of silicates and salt species (<30% vol water ice) with viscosity decreasing by two-three orders of magnitude in the top 45-70 km of Ceres' crust. This model can account for the observed topography power spectrum and explain the lack of craters in the size range ~100-600 km.Interestingly, Ceres' surface exhibits an 800-km-wide, 4-km-deep depression, known as Vendimia Planitia. The overall topography of Vendimia Planitia is compatible with a partially relaxed mega impact structure. The presence of such a large scale depression bears implications for

  9. PICARA, an analytical pipeline providing probabilistic inference about a priori candidates genes underlying genome-wide association QTL in plants

    USDA-ARS?s Scientific Manuscript database

    PICARA is an analytical pipeline designed to systematically summarize observed SNP/trait associations identified by genome wide association studies (GWAS) and to identify candidate genes involved in the regulation of complex trait variation. The pipeline provides probabilistic inference about a prio...

  10. Genomic analysis of circulating cell-free DNA infers breast cancer dormancy

    PubMed Central

    Shaw, Jacqueline A.; Page, Karen; Blighe, Kevin; Hava, Natasha; Guttery, David; Ward, Becky; Brown, James; Ruangpratheep, Chetana; Stebbing, Justin; Payne, Rachel; Palmieri, Carlo; Cleator, Suzy; Walker, Rosemary A.; Coombes, R. Charles

    2012-01-01

    Biomarkers in breast cancer to monitor minimal residual disease have remained elusive. We hypothesized that genomic analysis of circulating free DNA (cfDNA) isolated from plasma may form the basis for a means of detecting and monitoring breast cancer. We profiled 251 genomes using Affymetrix SNP 6.0 arrays to determine copy number variations (CNVs) and loss of heterozygosity (LOH), comparing 138 cfDNA samples with matched primary tumor and normal leukocyte DNA in 65 breast cancer patients and eight healthy female controls. Concordance of SNP genotype calls in paired cfDNA and leukocyte DNA samples distinguished between breast cancer patients and healthy female controls (P < 0.0001) and between preoperative patients and patients on follow-up who had surgery and treatment (P = 0.0016). Principal component analyses of cfDNA SNP/copy number results also separated presurgical breast cancer patients from the healthy controls, suggesting specific CNVs in cfDNA have clinical significance. We identified focal high-level DNA amplification in paired tumor and cfDNA clustered in a number of chromosome arms, some of which harbor genes with oncogenic potential, including USP17L2 (DUB3), BRF1, MTA1, and JAG2. Remarkably, in 50 patients on follow-up, specific CNVs were detected in cfDNA, mirroring the primary tumor, up to 12 yr after diagnosis despite no other evidence of disease. These data demonstrate the potential of SNP/CNV analysis of cfDNA to distinguish between patients with breast cancer and healthy controls during routine follow-up. The genomic profiles of cfDNA infer dormancy/minimal residual disease in the majority of patients on follow-up. PMID:21990379

  11. Genomic analysis of circulating cell-free DNA infers breast cancer dormancy.

    PubMed

    Shaw, Jacqueline A; Page, Karen; Blighe, Kevin; Hava, Natasha; Guttery, David; Ward, Becky; Brown, James; Ruangpratheep, Chetana; Stebbing, Justin; Payne, Rachel; Palmieri, Carlo; Cleator, Suzy; Walker, Rosemary A; Coombes, R Charles

    2012-02-01

    Biomarkers in breast cancer to monitor minimal residual disease have remained elusive. We hypothesized that genomic analysis of circulating free DNA (cfDNA) isolated from plasma may form the basis for a means of detecting and monitoring breast cancer. We profiled 251 genomes using Affymetrix SNP 6.0 arrays to determine copy number variations (CNVs) and loss of heterozygosity (LOH), comparing 138 cfDNA samples with matched primary tumor and normal leukocyte DNA in 65 breast cancer patients and eight healthy female controls. Concordance of SNP genotype calls in paired cfDNA and leukocyte DNA samples distinguished between breast cancer patients and healthy female controls (P < 0.0001) and between preoperative patients and patients on follow-up who had surgery and treatment (P = 0.0016). Principal component analyses of cfDNA SNP/copy number results also separated presurgical breast cancer patients from the healthy controls, suggesting specific CNVs in cfDNA have clinical significance. We identified focal high-level DNA amplification in paired tumor and cfDNA clustered in a number of chromosome arms, some of which harbor genes with oncogenic potential, including USP17L2 (DUB3), BRF1, MTA1, and JAG2. Remarkably, in 50 patients on follow-up, specific CNVs were detected in cfDNA, mirroring the primary tumor, up to 12 yr after diagnosis despite no other evidence of disease. These data demonstrate the potential of SNP/CNV analysis of cfDNA to distinguish between patients with breast cancer and healthy controls during routine follow-up. The genomic profiles of cfDNA infer dormancy/minimal residual disease in the majority of patients on follow-up.

  12. The evolutionary history of termites as inferred from 66 mitochondrial genomes.

    PubMed

    Bourguignon, Thomas; Lo, Nathan; Cameron, Stephen L; Šobotník, Jan; Hayashi, Yoshinobu; Shigenobu, Shuji; Watanabe, Dai; Roisin, Yves; Miura, Toru; Evans, Theodore A

    2015-02-01

    Termites have colonized many habitats and are among the most abundant animals in tropical ecosystems, which they modify considerably through their actions. The timing of their rise in abundance and of the dispersal events that gave rise to modern termite lineages is not well understood. To shed light on termite origins and diversification, we sequenced the mitochondrial genome of 48 termite species and combined them with 18 previously sequenced termite mitochondrial genomes for phylogenetic and molecular clock analyses using multiple fossil calibrations. The 66 genomes represent most major clades of termites. Unlike previous phylogenetic studies based on fewer molecular data, our phylogenetic tree is fully resolved for the lower termites. The phylogenetic positions of Macrotermitinae and Apicotermitinae are also resolved as the basal groups in the higher termites, but in the crown termitid groups, including Termitinae + Syntermitinae + Nasutitermitinae + Cubitermitinae, the position of some nodes remains uncertain. Our molecular clock tree indicates that the lineages leading to termites and Cryptocercus roaches diverged 170 Ma (153-196 Ma 95% confidence interval [CI]), that modern Termitidae arose 54 Ma (46-66 Ma 95% CI), and that the crown termitid group arose 40 Ma (35-49 Ma 95% CI). This indicates that the distribution of basal termite clades was influenced by the final stages of the breakup of Pangaea. Our inference of ancestral geographic ranges shows that the Termitidae, which includes more than 75% of extant termite species, most likely originated in Africa or Asia, and acquired their pantropical distribution after a series of dispersal and subsequent diversification events.

  13. Structure-based function inference using protein family-specific fingerprints

    PubMed Central

    Bandyopadhyay, Deepak; Huan, Jun; Liu, Jinze; Prins, Jan; Snoeyink, Jack; Wang, Wei; Tropsha, Alexander

    2006-01-01

    We describe a method to assign a protein structure to a functional family using family-specific fingerprints. Fingerprints represent amino acid packing patterns that occur in most members of a family but are rare in the background, a nonredundant subset of PDB; their information is additional to sequence alignments, sequence patterns, structural superposition, and active-site templates. Fingerprints were derived for 120 families in SCOP using Frequent Subgraph Mining. For a new structure, all occurrences of these family-specific fingerprints may be found by a fast algorithm for subgraph isomorphism; the structure can then be assigned to a family with a confidence value derived from the number of fingerprints found and their distribution in background proteins. In validation experiments, we infer the function of new members added to SCOP families and we discriminate between structurally similar, but functionally divergent TIM barrel families. We then apply our method to predict function for several structural genomics proteins, including orphan structures. Some predictions have been corroborated by other computational methods and some validated by subsequent functional characterization. PMID:16731985

  14. Genome Structure Gallery from the Mycobacterium Tuberculosis Structual Genomics Consortium

    DOE Data Explorer

    The TB Structural Genomics Consortium works with the structures of proteins from M. tuberculosis, analyzing these structures in the context of functional information that currently exists and that the Consortium generates. The database of linked structural and functional information constructed from this project will form a lasting basis for understanding M. tuberculosis pathogenesis and for structure-based drug design. The Consortium's structural and functional information is publicly available. The Structures Gallery makes more than 650 total structures available by PDB identifier. Some of these are not consortium targets, but all are viewable in 3D color and can be manipulated in various ways by Jmol, an open-source Java viewer for chemical structures in 3D from http://www.jmol.org/

  15. Bayesian inference of multiscale structures in porous media

    NASA Astrophysics Data System (ADS)

    Lefantzi, S.; McKenna, S. A.; Ray, J.; Van Bloemen Waanders, B.

    2011-12-01

    and a lengthscale proxy for the inclusion size. Probability density functions are developed for the quantities being inferred. We use this inversion scheme to investigate the information content of the measurements. The measurements are generated as synthetic data i.e., we have access to the "ground truth". We find that the measurements of upscaled permeability can provide information about the large spatial scales only. The breakthrough times contain information about both large and small-scale resolved spatial structures, but it is difficult to separate the two, without using the permeability measurements simultaneously to constrain the large scale features. We choose twenty samples from the posterior distribution to reconstruct instances of the fine-scale binary media and then use them to predict the flow and breakthrough times. The ensemble of predictions is combined using Bayesian Model Averaging (BMA). We compare the BMA predictions and those from the raw ensemble versus the "ground truth" and find that BMA improves predictions slightly. The problem is redone with fewer samples to gauge the robustness of the BMA predictions. We find that with fewer samples arithmetic averaging provides slightly better predictions than BMA.

  16. Inferring the structure and dynamics of interactions in schooling fish

    PubMed Central

    Katz, Yael; Tunstrøm, Kolbjørn; Ioannou, Christos C.; Huepe, Cristián; Couzin, Iain D.

    2011-01-01

    Determining individual-level interactions that govern highly coordinated motion in animal groups or cellular aggregates has been a long-standing challenge, central to understanding the mechanisms and evolution of collective behavior. Numerous models have been proposed, many of which display realistic-looking dynamics, but nonetheless rely on untested assumptions about how individuals integrate information to guide movement. Here we infer behavioral rules directly from experimental data. We begin by analyzing trajectories of golden shiners (Notemigonus crysoleucas) swimming in two-fish and three-fish shoals to map the mean effective forces as a function of fish positions and velocities. Speeding and turning responses are dynamically modulated and clearly delineated. Speed regulation is a dominant component of how fish interact, and changes in speed are transmitted to those both behind and ahead. Alignment emerges from attraction and repulsion, and fish tend to copy directional changes made by those ahead. We find no evidence for explicit matching of body orientation. By comparing data from two-fish and three-fish shoals, we challenge the standard assumption, ubiquitous in physics-inspired models of collective behavior, that individual motion results from averaging responses to each neighbor considered separately; three-body interactions make a substantial contribution to fish dynamics. However, pairwise interactions qualitatively capture the correct spatial interaction structure in small groups, and this structure persists in larger groups of 10 and 30 fish. The interactions revealed here may help account for the rapid changes in speed and direction that enable real animal groups to stay cohesive and amplify important social information. PMID:21795604

  17. Effect of sampling on the extent and accuracy of the inferred genetic history of recombining genome.

    PubMed

    Platt, Daniel E; Utro, Filippo; Parida, Laxmi

    2014-06-01

    Accessible biotechnology is enabling the cataloging of genetic variants in individuals in populations at unprecedented scales. The use of phylogeny of the individuals within populations allows a model-based approach to studying these variations, which is important in understanding relationships between and across populations. For the somatic genome, however, the phylogeny must take recombinations (and other genetic mixing events) into account. Hence the resulting topology is more complex than a tree. Unlike a tree topology, it is not as apparent which events are visible from the extant samples. An earlier work presented a mathematical model (called the minimal descriptor) for teasing apart the inherent visible information from that which any specific algorithm might see. We use this framework to study the effect of sampling sizes on the overall inferred genetic history. In this paper, we seek to understand the extent, characteristics (in terms of recent versus ancient genetic events) and reliability of what was resolvable within field samples drawn from modern populations. We observed that most of the visible ancient events are recoverable from relatively small sample sizes. However, without identification of this relatively small minority of ancient genetic events, most of the signal will appear to reflect modern events and admixtures. We also found that the more ancient events are likely to be reproduced with higher fidelity between multiple samplings, and that the identified older events are less likely to yield false positive discrimination between populations. We conclude that a recombinant phylogenetic reconstruction is necessary to identify which markers are most likely to discriminate ancient events, and to discriminate between populations with lower risk of false positives. Secondly, on a broader note, this study also provides a general methodology for a critical assessment of the inferred common genetic history of populations (say, in plant cultivars or

  18. Feature inference and the causal structure of categories.

    PubMed

    Rehder, Bob; Burnett, Russell C

    2005-05-01

    The purpose of this article was to establish how theoretical category knowledge-specifically, knowledge of the causal relations that link the features of categories-supports the ability to infer the presence of unobserved features. Our experiments were designed to test proposals that causal knowledge is represented psychologically as Bayesian networks. In five experiments we found that Bayes' nets generally predicted participants' feature inferences quite well. However, we also observed a pervasive violation of one of the defining principles of Bayes' nets-the causal Markov condition-because the presence of characteristic features invariably led participants to infer yet another characteristic feature. We argue that this effect arises from a domain-general bias to assume the presence of underlying mechanisms associated with the category. Specifically, people take an exemplar to be a "well functioning" category member when it has most or all of the category's characteristic features, and thus are likely to infer a characteristic value on an unobserved dimension.

  19. Genomic inference accurately predicts the timing and severity of a recent bottleneck in a non-model insect population

    PubMed Central

    McCoy, Rajiv C.; Garud, Nandita R.; Kelley, Joanna L.; Boggs, Carol L.; Petrov, Dmitri A.

    2015-01-01

    The analysis of molecular data from natural populations has allowed researchers to answer diverse ecological questions that were previously intractable. In particular, ecologists are often interested in the demographic history of populations, information that is rarely available from historical records. Methods have been developed to infer demographic parameters from genomic data, but it is not well understood how inferred parameters compare to true population history or depend on aspects of experimental design. Here we present and evaluate a method of SNP discovery using RNA-sequencing and demographic inference using the program δaδi, which uses a diffusion approximation to the allele frequency spectrum to fit demographic models. We test these methods in a population of the checkerspot butterfly Euphydryas gillettii. This population was intentionally introduced to Gothic, Colorado in 1977 and has since experienced extreme fluctuations including bottlenecks of fewer than 25 adults, as documented by nearly annual field surveys. Using RNA-sequencing of eight individuals from Colorado and eight individuals from a native population in Wyoming, we generate the first genomic resources for this system. While demographic inference is commonly used to examine ancient demography, our study demonstrates that our inexpensive, all-in-one approach to marker discovery and genotyping provides sufficient data to accurately infer the timing of a recent bottleneck. This demographic scenario is relevant for many species of conservation concern, few of which have sequenced genomes. Our results are remarkably insensitive to sample size or number of genomic markers, which has important implications for applying this method to other non-model systems. PMID:24237665

  20. Identification of structural variation in mouse genomes

    PubMed Central

    Keane, Thomas M.; Wong, Kim; Adams, David J.; Flint, Jonathan; Reymond, Alexandre; Yalcin, Binnaz

    2014-01-01

    Structural variation is variation in structure of DNA regions affecting DNA sequence length and/or orientation. It generally includes deletions, insertions, copy-number gains, inversions, and transposable elements. Traditionally, the identification of structural variation in genomes has been challenging. However, with the recent advances in high-throughput DNA sequencing and paired-end mapping (PEM) methods, the ability to identify structural variation and their respective association to human diseases has improved considerably. In this review, we describe our current knowledge of structural variation in the mouse, one of the prime model systems for studying human diseases and mammalian biology. We further present the evolutionary implications of structural variation on transposable elements. We conclude with future directions on the study of structural variation in mouse genomes that will increase our understanding of molecular architecture and functional consequences of structural variation. PMID:25071822

  1. epiG: statistical inference and profiling of DNA methylation from whole-genome bisulfite sequencing data.

    PubMed

    Vincent, Martin; Mundbjerg, Kamilla; Skou Pedersen, Jakob; Liang, Gangning; Jones, Peter A; Ørntoft, Torben Falck; Dalsgaard Sørensen, Karina; Wiuf, Carsten

    2017-02-21

    The study of epigenetic heterogeneity at the level of individual cells and in whole populations is the key to understanding cellular differentiation, organismal development, and the evolution of cancer. We develop a statistical method, epiG, to infer and differentiate between different epi-allelic haplotypes, annotated with CpG methylation status and DNA polymorphisms, from whole-genome bisulfite sequencing data, and nucleosome occupancy from NOMe-seq data. We demonstrate the capabilities of the method by inferring allele-specific methylation and nucleosome occupancy in cell lines, and colon and tumor samples, and by benchmarking the method against independent experimental data.

  2. GIGA: a simple, efficient algorithm for gene tree inference in the genomic age

    PubMed Central

    2010-01-01

    Background Phylogenetic relationships between genes are not only of theoretical interest: they enable us to learn about human genes through the experimental work on their relatives in numerous model organisms from bacteria to fruit flies and mice. Yet the most commonly used computational algorithms for reconstructing gene trees can be inaccurate for numerous reasons, both algorithmic and biological. Additional information beyond gene sequence data has been shown to improve the accuracy of reconstructions, though at great computational cost. Results We describe a simple, fast algorithm for inferring gene phylogenies, which makes use of information that was not available prior to the genomic age: namely, a reliable species tree spanning much of the tree of life, and knowledge of the complete complement of genes in a species' genome. The algorithm, called GIGA, constructs trees agglomeratively from a distance matrix representation of sequences, using simple rules to incorporate this genomic age information. GIGA makes use of a novel conceptualization of gene trees as being composed of orthologous subtrees (containing only speciation events), which are joined by other evolutionary events such as gene duplication or horizontal gene transfer. An important innovation in GIGA is that, at every step in the agglomeration process, the tree is interpreted/reinterpreted in terms of the evolutionary events that created it. Remarkably, GIGA performs well even when using a very simple distance metric (pairwise sequence differences) and no distance averaging over clades during the tree construction process. Conclusions GIGA is efficient, allowing phylogenetic reconstruction of very large gene families and determination of orthologs on a large scale. It is exceptionally robust to adding more gene sequences, opening up the possibility of creating stable identifiers for referring to not only extant genes, but also their common ancestors. We compared trees produced by GIGA to those in

  3. GIGA: a simple, efficient algorithm for gene tree inference in the genomic age.

    PubMed

    Thomas, Paul D

    2010-06-09

    Phylogenetic relationships between genes are not only of theoretical interest: they enable us to learn about human genes through the experimental work on their relatives in numerous model organisms from bacteria to fruit flies and mice. Yet the most commonly used computational algorithms for reconstructing gene trees can be inaccurate for numerous reasons, both algorithmic and biological. Additional information beyond gene sequence data has been shown to improve the accuracy of reconstructions, though at great computational cost. We describe a simple, fast algorithm for inferring gene phylogenies, which makes use of information that was not available prior to the genomic age: namely, a reliable species tree spanning much of the tree of life, and knowledge of the complete complement of genes in a species' genome. The algorithm, called GIGA, constructs trees agglomeratively from a distance matrix representation of sequences, using simple rules to incorporate this genomic age information. GIGA makes use of a novel conceptualization of gene trees as being composed of orthologous subtrees (containing only speciation events), which are joined by other evolutionary events such as gene duplication or horizontal gene transfer. An important innovation in GIGA is that, at every step in the agglomeration process, the tree is interpreted/reinterpreted in terms of the evolutionary events that created it. Remarkably, GIGA performs well even when using a very simple distance metric (pairwise sequence differences) and no distance averaging over clades during the tree construction process. GIGA is efficient, allowing phylogenetic reconstruction of very large gene families and determination of orthologs on a large scale. It is exceptionally robust to adding more gene sequences, opening up the possibility of creating stable identifiers for referring to not only extant genes, but also their common ancestors. We compared trees produced by GIGA to those in the TreeFam database, and they

  4. Causal inference of gene regulation with subnetwork assembly from genetical genomics data.

    PubMed

    Peng, Chien-Hua; Jiang, Yi-Zhi; Tai, An-Shun; Liu, Chun-Bin; Peng, Shih-Chi; Liao, Chun-Ta; Yen, Tzu-Chen; Hsieh, Wen-Ping

    2014-03-01

    Deciphering the causal networks of gene interactions is critical for identifying disease pathways and disease-causing genes. We introduce a method to reconstruct causal networks based on exploring phenotype-specific modules in the human interactome and including the expression quantitative trait loci (eQTLs) that underlie the joint expression variation of each module. Closely associated eQTLs help anchor the orientation of the network. To overcome the inherent computational complexity of causal network reconstruction, we first deduce the local causality of individual subnetworks using the selected eQTLs and module transcripts. These subnetworks are then integrated to infer a global causal network using a random-field ranking method, which was motivated by animal sociology. We demonstrate how effectively the inferred causality restores the regulatory structure of the networks that mediate lymph node metastasis in oral cancer. Network rewiring clearly characterizes the dynamic regulatory systems of distinct disease states. This study is the first to associate an RXRB-causal network with increased risks of nodal metastasis, tumor relapse, distant metastases and poor survival for oral cancer. Thus, identifying crucial upstream drivers of a signal cascade can facilitate the discovery of potential biomarkers and effective therapeutic targets.

  5. Feature Inference and the Causal Structure of Categories

    ERIC Educational Resources Information Center

    Rehder, B.; Burnett, R.C.

    2005-01-01

    The purpose of this article was to establish how theoretical category knowledge-specifically, knowledge of the causal relations that link the features of categories-supports the ability to infer the presence of unobserved features. Our experiments were designed to test proposals that causal knowledge is represented psychologically as Bayesian…

  6. Low-Pass Genome-Wide Sequencing and Variant Inference Using Identity-by-Descent in an Isolated Human Population

    PubMed Central

    Gusev, A.; Shah, M. J.; Kenny, E. E.; Ramachandran, A.; Lowe, J. K.; Salit, J.; Lee, C. C.; Levandowsky, E. C.; Weaver, T. N.; Doan, Q. C.; Peckham, H. E.; McLaughlin, S. F.; Lyons, M. R.; Sheth, V. N.; Stoffel, M.; De La Vega, F. M.; Friedman, J. M.; Breslow, J. L.

    2012-01-01

    Whole-genome sequencing in an isolated population with few founders directly ascertains variants from the population bottleneck that may be rare elsewhere. In such populations, shared haplotypes allow imputation of variants in unsequenced samples without resorting to complex statistical methods as in studies of outbred cohorts. We focus on an isolated population cohort from the Pacific Island of Kosrae, Micronesia, where we previously collected SNP array and rich phenotype data for the majority of the population. We report identification of long regions with haplotypes co-inherited between pairs of individuals and methodology to leverage such shared genetic content for imputation. Our estimates show that sequencing as few as 40 personal genomes allows for inference in up to 60% of the 3000-person cohort at the average locus. We ascertained a pilot data set of whole-genome sequences from seven Kosraean individuals, with average 5× coverage. This assay identified 5,735,306 unique sites of which 1,212,831 were previously unknown. Additionally, these variants are unusually enriched for alleles that are rare in other populations when compared to geographic neighbors (published Korean genome SJK). We used the presence of shared haplotypes between the seven Kosraen individuals to estimate expected imputation accuracy of known and novel homozygous variants at 99.6% and 97.3%, respectively. This study presents whole-genome analysis of a homogenous isolate population with emphasis on optimal rare variant inference. PMID:22135348

  7. Inferring the genetic history of lactase persistence along the Italian peninsula from a large genomic interval surrounding the LCT gene.

    PubMed

    De Fanti, Sara; Sazzini, Marco; Giuliani, Cristina; Frazzoni, Federica; Sarno, Stefania; Boattini, Alessio; Marasco, Elena; Mantovani, Vilma; Franceschi, Claudio; Moral, Pedro; Garagnani, Paolo; Luiselli, Donata

    2015-12-01

    Although genetic variants related to lactase persistence in European populations were supposed to have firstly undergone positive selection in farmers from the Balkans and Central Europe, demographic and evolutionary dynamics that subsequently shaped the distribution of this adaptive trait across the continent have still to be elucidated. To deepen the knowledge about potential routes of diffusion of lactase persistence to Western Europe we investigated variation at a large genomic region surrounding the LCT gene along the Italian peninsula, a geographical area that played a key role in population movements responsible for Neolithic diffusion across Europe. By genotyping 40 highly selected SNPs in more than 400 Italian individuals we described gradients of nucleotide and haplotype variation potentially related to lactase persistence and compared them with those observed in several European and Mediterranean human groups. Multiple migratory events responsible for earlier introduction of the examined alleles in Italy than in Northern European regions could be invoked. Different demic processes occurred along the western and eastern sides of the peninsula were also inferred via linkage disequilibrium and population structure analyses. The appreciable genetic continuum observed between people from Northern or Central-Western Italy and Central European populations suggested a local arrival of lactase persistence-related variants mainly via overland routes. On the contrary, diversity of Central-Eastern and Southern Italian groups entailed also gene flow from South-Eastern Mediterranean regions, in accordance to the earlier entrance of the Neolithic in Southern Italy via maritime population movements along the Mediterranean coastlines. © 2015 Wiley Periodicals, Inc.

  8. Functional coverage of the human genome by existing structures, structural genomics targets, and homology models.

    PubMed

    Xie, Lei; Bourne, Philip E

    2005-08-01

    The bias in protein structure and function space resulting from experimental limitations and targeting of particular functional classes of proteins by structural biologists has long been recognized, but never continuously quantified. Using the Enzyme Commission and the Gene Ontology classifications as a reference frame, and integrating structure data from the Protein Data Bank (PDB), target sequences from the structural genomics projects, structure homology derived from the SUPERFAMILY database, and genome annotations from Ensembl and NCBI, we provide a quantified view, both at the domain and whole-protein levels, of the current and projected coverage of protein structure and function space relative to the human genome. Protein structures currently provide at least one domain that covers 37% of the functional classes identified in the genome; whole structure coverage exists for 25% of the genome. If all the structural genomics targets were solved (twice the current number of structures in the PDB), it is estimated that structures of one domain would cover 69% of the functional classes identified and complete structure coverage would be 44%. Homology models from existing experimental structures extend the 37% coverage to 56% of the genome as single domains and 25% to 31% for complete structures. Coverage from homology models is not evenly distributed by protein family, reflecting differing degrees of sequence and structure divergence within families. While these data provide coverage, conversely, they also systematically highlight functional classes of proteins for which structures should be determined. Current key functional families without structure representation are highlighted here; updated information on the "most wanted list" that should be solved is available on a weekly basis from http://function.rcsb.org:8080/pdb/function_distribution/index.html.

  9. A physical map for the Amborella trichopoda genome sheds light on the evolution of angiosperm genome structure.

    PubMed

    Zuccolo, Andrea; Bowers, John E; Estill, James C; Xiong, Zhiyong; Luo, Meizhong; Sebastian, Aswathy; Goicoechea, José Luis; Collura, Kristi; Yu, Yeisoo; Jiao, Yuannian; Duarte, Jill; Tang, Haibao; Ayyampalayam, Saravanaraj; Rounsley, Steve; Kudrna, Dave; Paterson, Andrew H; Pires, J Chris; Chanderbali, Andre; Soltis, Douglas E; Chamala, Srikar; Barbazuk, Brad; Soltis, Pamela S; Albert, Victor A; Ma, Hong; Mandoli, Dina; Banks, Jody; Carlson, John E; Tomkins, Jeffrey; dePamphilis, Claude W; Wing, Rod A; Leebens-Mack, Jim

    2011-01-01

    Recent phylogenetic analyses have identified Amborella trichopoda, an understory tree species endemic to the forests of New Caledonia, as sister to a clade including all other known flowering plant species. The Amborella genome is a unique reference for understanding the evolution of angiosperm genomes because it can serve as an outgroup to root comparative analyses. A physical map, BAC end sequences and sample shotgun sequences provide a first view of the 870 Mbp Amborella genome. Analysis of Amborella BAC ends sequenced from each contig suggests that the density of long terminal repeat retrotransposons is negatively correlated with that of protein coding genes. Syntenic, presumably ancestral, gene blocks were identified in comparisons of the Amborella BAC contigs and the sequenced Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera and Oryza sativa genomes. Parsimony mapping of the loss of synteny corroborates previous analyses suggesting that the rate of structural change has been more rapid on lineages leading to Arabidopsis and Oryza compared with lineages leading to Populus and Vitis. The gamma paleohexiploidy event identified in the Arabidopsis, Populus and Vitis genomes is shown to have occurred after the divergence of all other known angiosperms from the lineage leading to Amborella. When placed in the context of a physical map, BAC end sequences representing just 5.4% of the Amborella genome have facilitated reconstruction of gene blocks that existed in the last common ancestor of all flowering plants. The Amborella genome is an invaluable reference for inferences concerning the ancestral angiosperm and subsequent genome evolution.

  10. A physical map for the Amborella trichopoda genome sheds light on the evolution of angiosperm genome structure

    PubMed Central

    2011-01-01

    Background Recent phylogenetic analyses have identified Amborella trichopoda, an understory tree species endemic to the forests of New Caledonia, as sister to a clade including all other known flowering plant species. The Amborella genome is a unique reference for understanding the evolution of angiosperm genomes because it can serve as an outgroup to root comparative analyses. A physical map, BAC end sequences and sample shotgun sequences provide a first view of the 870 Mbp Amborella genome. Results Analysis of Amborella BAC ends sequenced from each contig suggests that the density of long terminal repeat retrotransposons is negatively correlated with that of protein coding genes. Syntenic, presumably ancestral, gene blocks were identified in comparisons of the Amborella BAC contigs and the sequenced Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera and Oryza sativa genomes. Parsimony mapping of the loss of synteny corroborates previous analyses suggesting that the rate of structural change has been more rapid on lineages leading to Arabidopsis and Oryza compared with lineages leading to Populus and Vitis. The gamma paleohexiploidy event identified in the Arabidopsis, Populus and Vitis genomes is shown to have occurred after the divergence of all other known angiosperms from the lineage leading to Amborella. Conclusions When placed in the context of a physical map, BAC end sequences representing just 5.4% of the Amborella genome have facilitated reconstruction of gene blocks that existed in the last common ancestor of all flowering plants. The Amborella genome is an invaluable reference for inferences concerning the ancestral angiosperm and subsequent genome evolution. PMID:21619600

  11. An integrated approach to structural genomics.

    PubMed

    Heinemann, U; Frevert, J; Hofmann, K; Illing, G; Maurer, C; Oschkinat, H; Saenger, W

    2000-01-01

    Structural genomics aims at determining a set of protein structures that will represent all domain folds present in the biosphere. These structures can be used as the basis for the homology modelling of the majority of all remaining protein domains or, indeed, proteins. Structural genomics therefore promises to provide a comprehensive structural description of the protein universe. To achieve this, a broad scientific effort is required. The Berlin-based "Protein Structure Factory" (PSF) plans to contribute to this effort by setting up a local infrastructure for the low-cost, high-throughput analysis of soluble human proteins. In close collaboration with the German Human Genome Project (DHGP) protein-coding genes will be expressed in Escherichia coli or yeast. Affinity-tagged proteins will be purified semi-automatically for biophysical characterization and structure analysis by X-ray diffraction methods and NMR spectroscopy. In all steps of the structure analysis process, possibilities for automation, parallelization and standardization will be explored. Major new facilities that are created for the PSF include a robotic station for large-scale protein crystallization, an NMR center and an experimental station for protein crystallography at the synchrotron storage ring BESSY II in Berlin.

  12. Genome Alignment Spanning Major Poaceae Lineages Reveals Heterogeneous Evolutionary Rates and Alters Inferred Dates for Key Evolutionary Events.

    PubMed

    Wang, Xiyin; Wang, Jingpeng; Jin, Dianchuan; Guo, Hui; Lee, Tae-Ho; Liu, Tao; Paterson, Andrew H

    2015-06-01

    Multiple comparisons among genomes can clarify their evolution, speciation, and functional innovations. To date, the genome sequences of eight grasses representing the most economically important Poaceae (grass) clades have been published, and their genomic-level comparison is an essential foundation for evolutionary, functional, and translational research. Using a formal and conservative approach, we aligned these genomes. Direct comparison of paralogous gene pairs all duplicated simultaneously reveal striking variation in evolutionary rates among whole genomes, with nucleotide substitution slowest in rice and up to 48% faster in other grasses, adding a new dimension to the value of rice as a grass model. We reconstructed ancestral genome contents for major evolutionary nodes, potentially contributing to understanding the divergence and speciation of grasses. Recent fossil evidence suggests revisions of the estimated dates of key evolutionary events, implying that the pan-grass polyploidization occurred ∼96 million years ago and could not be related to the Cretaceous-Tertiary mass extinction as previously inferred. Adjusted dating to reflect both updated fossil evidence and lineage-specific evolutionary rates suggested that maize subgenome divergence and maize-sorghum divergence were virtually simultaneous, a coincidence that would be explained if polyploidization directly contributed to speciation. This work lays a solid foundation for Poaceae translational genomics. Copyright © 2015 The Author. Published by Elsevier Inc. All rights reserved.

  13. Learning about the internal structure of categories through classification and feature inference.

    PubMed

    Jee, Benjamin D; Wiley, Jennifer

    2014-01-01

    Previous research on category learning has found that classification tasks produce representations that are skewed toward diagnostic feature dimensions, whereas feature inference tasks lead to richer representations of within-category structure. Yet, prior studies often measure category knowledge through tasks that involve identifying only the typical features of a category. This neglects an important aspect of a category's internal structure: how typical and atypical features are distributed within a category. The present experiments tested the hypothesis that inference learning results in richer knowledge of internal category structure than classification learning. We introduced several new measures to probe learners' representations of within-category structure. Experiment 1 found that participants in the inference condition learned and used a wider range of feature dimensions than classification learners. Classification learners, however, were more sensitive to the presence of atypical features within categories. Experiment 2 provided converging evidence that classification learners were more likely to incorporate atypical features into their representations. Inference learners were less likely to encode atypical category features, even in a "partial inference" condition that focused learners' attention on the feature dimensions relevant to classification. Overall, these results are contrary to the hypothesis that inference learning produces superior knowledge of within-category structure. Although inference learning promoted representations that included a broad range of category-typical features, classification learning promoted greater sensitivity to the distribution of typical and atypical features within categories.

  14. Genome at Juncture of Early Human Migration: A Systematic Analysis of Two Whole Genomes and Thirteen Exomes from Kuwaiti Population Subgroup of Inferred Saudi Arabian Tribe Ancestry

    PubMed Central

    Alsmadi, Osama; Hebbar, Prashantha; Antony, Dinu; Behbehani, Kazem; Thanaraj, Thangavel Alphonse

    2014-01-01

    Population of the State of Kuwait is composed of three genetic subgroups of inferred Persian, Saudi Arabian tribe and Bedouin ancestry. The Saudi Arabian tribe subgroup traces its origin to the Najd region of Saudi Arabia. By sequencing two whole genomes and thirteen exomes from this subgroup at high coverage (>40X), we identify 4,950,724 Single Nucleotide Polymorphisms (SNPs), 515,802 indels and 39,762 structural variations. Of the identified variants, 10,098 (8.3%) exomic SNPs, 139,923 (2.9%) non-exomic SNPs, 5,256 (54.3%) exomic indels, and 374,959 (74.08%) non-exomic indels are ‘novel’. Up to 8,070 (79.9%) of the reported novel biallelic exomic SNPs are seen in low frequency (minor allele frequency <5%). We observe 5,462 known and 1,004 novel potentially deleterious nonsynonymous SNPs. Allele frequencies of common SNPs from the 15 exomes is significantly correlated with those from genotype data of a larger cohort of 48 individuals (Pearson correlation coefficient, 0.91; p <2.2×10−16). A set of 2,485 SNPs show significantly different allele frequencies when compared to populations from other continents. Two notable variants having risk alleles in high frequencies in this subgroup are: a nonsynonymous deleterious SNP (rs2108622 [19:g.15990431C>T] from CYP4F2 gene [MIM:*604426]) associated with warfarin dosage levels [MIM:#122700] required to elicit normal anticoagulant response; and a 3′ UTR SNP (rs6151429 [22:g.51063477T>C]) from ARSA gene [MIM:*607574]) associated with Metachromatic Leukodystrophy [MIM:#250100]. Hemoglobin Riyadh variant (identified for the first time in a Saudi Arabian woman) is observed in the exome data. The mitochondrial haplogroup profiles of the 15 individuals are consistent with the haplogroup diversity seen in Saudi Arabian natives, who are believed to have received substantial gene flow from Africa and eastern provenance. We present the first genome resource imperative for designing future genetic studies in Saudi Arabian

  15. Using Genomics for Natural Product Structure Elucidation.

    PubMed

    Tietz, Jonathan I; Mitchell, Douglas A

    2016-01-01

    Natural products (NPs) are the most historically bountiful source of chemical matter for drug development-especially for anti-infectives. With insights gleaned from genome mining, interest in natural product discovery has been reinvigorated. An essential stage in NP discovery is structural elucidation, which sheds light not only on the chemical composition of a molecule but also its novelty, properties, and derivatization potential. The history of structure elucidation is replete with techniquebased revolutions: combustion analysis, crystallography, UV, IR, MS, and NMR have each provided game-changing advances; the latest such advance is genomics. All natural products have a genetic basis, and the ability to obtain and interpret genomic information for structure elucidation is increasingly available at low cost to non-specialists. In this review, we describe the value of genomics as a structural elucidation technique, especially from the perspective of the natural product chemist approaching an unknown metabolite. Herein we first introduce the databases and programs of interest to the natural products chemist, with an emphasis on those currently most suited for general usability. We describe strategies for linking observed natural product-linked phenotypes to their corresponding gene clusters. We then discuss techniques for extracting structural information from genes, illustrated with numerous case examples. We also provide an analysis of the biases and limitations of the field with recommendations for future development. Our overview is not only aimed at biologically-oriented researchers already at ease with bioinformatic techniques, but also, in particular, at natural product, organic, and/or medicinal chemists not previously familiar with genomic techniques.

  16. Robust inference of population structure for ancestry prediction and correction of stratification in the presence of relatedness.

    PubMed

    Conomos, Matthew P; Miller, Michael B; Thornton, Timothy A

    2015-05-01

    Population structure inference with genetic data has been motivated by a variety of applications in population genetics and genetic association studies. Several approaches have been proposed for the identification of genetic ancestry differences in samples where study participants are assumed to be unrelated, including principal components analysis (PCA), multidimensional scaling (MDS), and model-based methods for proportional ancestry estimation. Many genetic studies, however, include individuals with some degree of relatedness, and existing methods for inferring genetic ancestry fail in related samples. We present a method, PC-AiR, for robust population structure inference in the presence of known or cryptic relatedness. PC-AiR utilizes genome-screen data and an efficient algorithm to identify a diverse subset of unrelated individuals that is representative of all ancestries in the sample. The PC-AiR method directly performs PCA on the identified ancestry representative subset and then predicts components of variation for all remaining individuals based on genetic similarities. In simulation studies and in applications to real data from Phase III of the HapMap Project, we demonstrate that PC-AiR provides a substantial improvement over existing approaches for population structure inference in related samples. We also demonstrate significant efficiency gains, where a single axis of variation from PC-AiR provides better prediction of ancestry in a variety of structure settings than using 10 (or more) components of variation from widely used PCA and MDS approaches. Finally, we illustrate that PC-AiR can provide improved population stratification correction over existing methods in genetic association studies with population structure and relatedness.

  17. A data management system for structural genomics

    PubMed Central

    Raymond, Stéphane; O'Toole, Nicholas; Cygler, Miroslaw

    2004-01-01

    Background Structural genomics (SG) projects aim to determine thousands of protein structures by the development of high-throughput techniques for all steps of the experimental structure determination pipeline. Crucial to the success of such endeavours is the careful tracking and archiving of experimental and external data on protein targets. Results We have developed a sophisticated data management system for structural genomics. Central to the system is an Oracle-based, SQL-interfaced database. The database schema deals with all facets of the structure determination process, from target selection to data deposition. Users access the database via any web browser. Experimental data is input by users with pre-defined web forms. Data can be displayed according to numerous criteria. A list of all current target proteins can be viewed, with links for each target to associated entries in external databases. To avoid unnecessary work on targets, our data management system matches protein sequences weekly using BLAST to entries in the Protein Data Bank and to targets of other SG centers worldwide. Conclusion Our system is a working, effective and user-friendly data management tool for structural genomics projects. In this report we present a detailed summary of the various capabilities of the system, using real target data as examples, and indicate our plans for future enhancements. PMID:15210054

  18. Inference of Transmission Network Structure from HIV Phylogenetic Trees

    DOE PAGES

    Giardina, Federica; Romero-Severson, Ethan Obie; Albert, Jan; ...

    2017-01-13

    Phylogenetic inference is an attractive means to reconstruct transmission histories and epidemics. However, there is not a perfect correspondence between transmission history and virus phylogeny. Both node height and topological differences may occur, depending on the interaction between within-host evolutionary dynamics and between-host transmission patterns. To investigate these interactions, we added a within-host evolutionary model in epidemiological simulations and examined if the resulting phylogeny could recover different types of contact networks. To further improve realism, we also introduced patient-specific differences in infectivity across disease stages, and on the epidemic level we considered incomplete sampling and the age of the epidemic.more » Second, we implemented an inference method based on approximate Bayesian computation (ABC) to discriminate among three well-studied network models and jointly estimate both network parameters and key epidemiological quantities such as the infection rate. Our ABC framework used both topological and distance-based tree statistics for comparison between simulated and observed trees. Overall, our simulations showed that a virus time-scaled phylogeny (genealogy) may be substantially different from the between-host transmission tree. This has important implications for the interpretation of what a phylogeny reveals about the underlying epidemic contact network. In particular, we found that while the within-host evolutionary process obscures the transmission tree, the diversification process and infectivity dynamics also add discriminatory power to differentiate between different types of contact networks. We also found that the possibility to differentiate contact networks depends on how far an epidemic has progressed, where distance-based tree statistics have more power early in an epidemic. Finally, we applied our ABC inference on two different outbreaks from the Swedish HIV-1 epidemic.« less

  19. Inference of Transmission Network Structure from HIV Phylogenetic Trees.

    PubMed

    Giardina, Federica; Romero-Severson, Ethan Obie; Albert, Jan; Britton, Tom; Leitner, Thomas

    2017-01-01

    Phylogenetic inference is an attractive means to reconstruct transmission histories and epidemics. However, there is not a perfect correspondence between transmission history and virus phylogeny. Both node height and topological differences may occur, depending on the interaction between within-host evolutionary dynamics and between-host transmission patterns. To investigate these interactions, we added a within-host evolutionary model in epidemiological simulations and examined if the resulting phylogeny could recover different types of contact networks. To further improve realism, we also introduced patient-specific differences in infectivity across disease stages, and on the epidemic level we considered incomplete sampling and the age of the epidemic. Second, we implemented an inference method based on approximate Bayesian computation (ABC) to discriminate among three well-studied network models and jointly estimate both network parameters and key epidemiological quantities such as the infection rate. Our ABC framework used both topological and distance-based tree statistics for comparison between simulated and observed trees. Overall, our simulations showed that a virus time-scaled phylogeny (genealogy) may be substantially different from the between-host transmission tree. This has important implications for the interpretation of what a phylogeny reveals about the underlying epidemic contact network. In particular, we found that while the within-host evolutionary process obscures the transmission tree, the diversification process and infectivity dynamics also add discriminatory power to differentiate between different types of contact networks. We also found that the possibility to differentiate contact networks depends on how far an epidemic has progressed, where distance-based tree statistics have more power early in an epidemic. Finally, we applied our ABC inference on two different outbreaks from the Swedish HIV-1 epidemic.

  20. Inference of Transmission Network Structure from HIV Phylogenetic Trees

    PubMed Central

    Britton, Tom; Leitner, Thomas

    2017-01-01

    Phylogenetic inference is an attractive means to reconstruct transmission histories and epidemics. However, there is not a perfect correspondence between transmission history and virus phylogeny. Both node height and topological differences may occur, depending on the interaction between within-host evolutionary dynamics and between-host transmission patterns. To investigate these interactions, we added a within-host evolutionary model in epidemiological simulations and examined if the resulting phylogeny could recover different types of contact networks. To further improve realism, we also introduced patient-specific differences in infectivity across disease stages, and on the epidemic level we considered incomplete sampling and the age of the epidemic. Second, we implemented an inference method based on approximate Bayesian computation (ABC) to discriminate among three well-studied network models and jointly estimate both network parameters and key epidemiological quantities such as the infection rate. Our ABC framework used both topological and distance-based tree statistics for comparison between simulated and observed trees. Overall, our simulations showed that a virus time-scaled phylogeny (genealogy) may be substantially different from the between-host transmission tree. This has important implications for the interpretation of what a phylogeny reveals about the underlying epidemic contact network. In particular, we found that while the within-host evolutionary process obscures the transmission tree, the diversification process and infectivity dynamics also add discriminatory power to differentiate between different types of contact networks. We also found that the possibility to differentiate contact networks depends on how far an epidemic has progressed, where distance-based tree statistics have more power early in an epidemic. Finally, we applied our ABC inference on two different outbreaks from the Swedish HIV-1 epidemic. PMID:28085876

  1. Interrogating the druggable genome with structural informatics.

    PubMed

    Hambly, Kevin; Danzer, Joseph; Muskal, Steven; Debe, Derek A

    2006-08-01

    Structural genomics projects are producing protein structure data at an unprecedented rate. In this paper, we present the Target Informatics Platform (TIP), a novel structural informatics approach for amplifying the rapidly expanding body of experimental protein structure information to enhance the discovery and optimization of small molecule protein modulators on a genomic scale. In TIP, existing experimental structure information is augmented using a homology modeling approach, and binding sites across multiple target families are compared using a clique detection algorithm. We report here a detailed analysis of the structural coverage for the set of druggable human targets, highlighting drug target families where the level of structural knowledge is currently quite high, as well as those areas where structural knowledge is sparse. Furthermore, we demonstrate the utility of TIP's intra- and inter-family binding site similarity analysis using a series of retrospective case studies. Our analysis underscores the utility of a structural informatics infrastructure for extracting drug discovery-relevant information from structural data, aiding researchers in the identification of lead discovery and optimization opportunities as well as potential "off-target" liabilities.

  2. Assignment of homoeologs to parental genomes in allopolyploids for species tree inference, with an example from Fumaria (papaveraceae).

    PubMed

    Bertrand, Yann J K; Scheen, Anne-Cathrine; Marcussen, Thomas; Pfeil, Bernard E; de Sousa, Filipe; Oxelman, Bengt

    2015-05-01

    There is a rising awareness that species trees are best inferred from multiple loci while taking into account processes affecting individual gene trees, such as substitution model error (failure of the model to account for the complexity of the data) and coalescent stochasticity (presence of incomplete lineage sorting [ILS]). Although most studies have been carried out in the context of dichotomous species trees, these processes operate also in more complex evolutionary histories involving multiple hybridizations and polyploidy. Recently, methods have been developed that accurately handle ILS in allopolyploids, but they are thus far restricted to networks of diploids and tetraploids. We propose a procedure that improves on this limitation by designing a workflow that assigns homoeologs to hypothetical diploid ancestral genomes prior to genome tree construction. Conflicting assignment hypotheses are evaluated against substitution model error and coalescent stochasticity. Incongruence that cannot be explained by stochastic mechanisms needs to be explained by other processes (e.g., homoploid hybridization or paralogy). The data can then be filtered to build multilabeled genome phylogenies using inference methods that can recover species trees, either in the face of substitution model error and coalescent stochasticity alone, or while simultaneously accounting for hybridization. Methods are already available for folding the resulting multilabeled genome phylogeny into a network. We apply the workflow to the reconstruction of the reticulate phylogeny of the plant genus Fumaria (Papaveraceae) with ploidal levels ranging from 2[Formula: see text] to 14[Formula: see text]. We describe the challenges in recovering nuclear NRPB2 homoeologs in high ploidy species while combining in vivo cloning and direct sequencing techniques. Using parametric bootstrapping simulations we assign nuclear homoeologs and chloroplast sequences (four concatenated loci) to their common

  3. Genome-Wide Views of Chromatin Structure

    PubMed Central

    Rando, Oliver J.; Chang, Howard Y.

    2010-01-01

    Eukaryotic genomes are packaged into a nucleoprotein complex known as chromatin, which affects most processes that occur on DNA. Along with genetic and biochemical studies of resident chromatin proteins and their modifying enzymes, mapping of chromatin structure in vivo is one of the main pillars in our understanding of how chromatin relates to cellular processes. In this review, we discuss the use of genomic technologies to characterize chromatin structure in vivo, with a focus on data from budding yeast and humans. The picture emerging from these studies is the detailed chromatin structure of a typical gene, where the typical behavior gives insight into the mechanisms and deep rules that establish chromatin structure. Important deviation from the archetype is also observed, usually as a consequence of unique regulatory mechanisms at special genomic loci. Chromatin structure shows substantial conservation from yeast to humans, but mammalian chromatin has additional layers of complexity that likely relate to the requirements of multicellularity such as the need to establish faithful gene regulatory mechanisms for cell differentiation. PMID:19317649

  4. A Detailed History of Intron-rich Eukaryotic Ancestors Inferred from a Global Survey of 100 Complete Genomes

    PubMed Central

    Csuros, Miklos; Rogozin, Igor B.; Koonin, Eugene V.

    2011-01-01

    Protein-coding genes in eukaryotes are interrupted by introns, but intron densities widely differ between eukaryotic lineages. Vertebrates, some invertebrates and green plants have intron-rich genes, with 6–7 introns per kilobase of coding sequence, whereas most of the other eukaryotes have intron-poor genes. We reconstructed the history of intron gain and loss using a probabilistic Markov model (Markov Chain Monte Carlo, MCMC) on 245 orthologous genes from 99 genomes representing the three of the five supergroups of eukaryotes for which multiple genome sequences are available. Intron-rich ancestors are confidently reconstructed for each major group, with 53 to 74% of the human intron density inferred with 95% confidence for the Last Eukaryotic Common Ancestor (LECA). The results of the MCMC reconstruction are compared with the reconstructions obtained using Maximum Likelihood (ML) and Dollo parsimony methods. An excellent agreement between the MCMC and ML inferences is demonstrated whereas Dollo parsimony introduces a noticeable bias in the estimations, typically yielding lower ancestral intron densities than MCMC and ML. Evolution of eukaryotic genes was dominated by intron loss, with substantial gain only at the bases of several major branches including plants and animals. The highest intron density, 120 to 130% of the human value, is inferred for the last common ancestor of animals. The reconstruction shows that the entire line of descent from LECA to mammals was intron-rich, a state conducive to the evolution of alternative splicing. PMID:21935348

  5. A detailed history of intron-rich eukaryotic ancestors inferred from a global survey of 100 complete genomes.

    PubMed

    Csuros, Miklos; Rogozin, Igor B; Koonin, Eugene V

    2011-09-01

    Protein-coding genes in eukaryotes are interrupted by introns, but intron densities widely differ between eukaryotic lineages. Vertebrates, some invertebrates and green plants have intron-rich genes, with 6-7 introns per kilobase of coding sequence, whereas most of the other eukaryotes have intron-poor genes. We reconstructed the history of intron gain and loss using a probabilistic Markov model (Markov Chain Monte Carlo, MCMC) on 245 orthologous genes from 99 genomes representing the three of the five supergroups of eukaryotes for which multiple genome sequences are available. Intron-rich ancestors are confidently reconstructed for each major group, with 53 to 74% of the human intron density inferred with 95% confidence for the Last Eukaryotic Common Ancestor (LECA). The results of the MCMC reconstruction are compared with the reconstructions obtained using Maximum Likelihood (ML) and Dollo parsimony methods. An excellent agreement between the MCMC and ML inferences is demonstrated whereas Dollo parsimony introduces a noticeable bias in the estimations, typically yielding lower ancestral intron densities than MCMC and ML. Evolution of eukaryotic genes was dominated by intron loss, with substantial gain only at the bases of several major branches including plants and animals. The highest intron density, 120 to 130% of the human value, is inferred for the last common ancestor of animals. The reconstruction shows that the entire line of descent from LECA to mammals was intron-rich, a state conducive to the evolution of alternative splicing.

  6. Triallelic Population Genomics for Inferring Correlated Fitness Effects of Same Site Nonsynonymous Mutations.

    PubMed

    Ragsdale, Aaron P; Coffman, Alec J; Hsieh, PingHsun; Struck, Travis J; Gutenkunst, Ryan N

    2016-05-01

    The distribution of mutational effects on fitness is central to evolutionary genetics. Typical univariate distributions, however, cannot model the effects of multiple mutations at the same site, so we introduce a model in which mutations at the same site have correlated fitness effects. To infer the strength of that correlation, we developed a diffusion approximation to the triallelic frequency spectrum, which we applied to data from Drosophila melanogaster We found a moderate positive correlation between the fitness effects of nonsynonymous mutations at the same codon, suggesting that both mutation identity and location are important for determining fitness effects in proteins. We validated our approach by comparing it to biochemical mutational scanning experiments, finding strong quantitative agreement, even between different organisms. We also found that the correlation of mutational fitness effects was not affected by protein solvent exposure or structural disorder. Together, our results suggest that the correlation of fitness effects at the same site is a previously overlooked yet fundamental property of protein evolution.

  7. High-throughput Crystallography for Structural Genomics

    PubMed Central

    Joachimiak, Andrzej

    2009-01-01

    Protein X-ray crystallography recently celebrated its 50th anniversary. The structures of myoglobin and hemoglobin determined by Kendrew and Perutz provided the first glimpses into the complex protein architecture and chemistry. Since then, the field of structural molecular biology has experienced extraordinary progress and now over 53,000 proteins structures have been deposited into the Protein Data Bank. In the past decade many advances in macromolecular crystallography have been driven by world-wide structural genomics efforts. This was made possible because of third-generation synchrotron sources, structure phasing approaches using anomalous signal and cryo-crystallography. Complementary progress in molecular biology, proteomics, hardware and software for crystallographic data collection, structure determination and refinement, computer science, databases, robotics and automation improved and accelerated many processes. These advancements provide the robust foundation for structural molecular biology and assure strong contribution to science in the future. In this report we focus mainly on reviewing structural genomics high-throughput X-ray crystallography technologies and their impact. PMID:19765976

  8. Stem-loop structures in prokaryotic genomes

    PubMed Central

    Petrillo, Mauro; Silvestro, Giustina; Di Nocera, Pier Paolo; Boccia, Angelo; Paolella, Giovanni

    2006-01-01

    Background Prediction of secondary structures in the expressed sequences of bacterial genomes allows to investigate spontaneous folding of the corresponding RNA. This is particularly relevant in untranslated mRNA regions, where base pairing is less affected by interactions with the translation machinery. Relatively large stem-loops significantly contribute to the formation of more complex secondary structures, often important for the activity of sequence elements controlling gene expression. Results Systematic analysis of the distribution of stem-loop structures (SLSs) in 40 wholly-sequenced bacterial genomes is presented. SLSs were searched as stems measuring at least 12 bp, bordering loops 5 to 100 nt in length. G-U pairing in the stems was allowed. SLSs found in natural genomes are constantly more numerous and stable than those expected to randomly form in sequences of comparable size and composition. The large majority of SLSs fall within protein-coding regions but enrichment of specific, non random, SLS sub-populations of higher stability was observed within the intergenic regions of the chromosomes of several species. In low-GC firmicutes, most higher stability intergenic SLSs resemble canonical rho-independent transcriptional terminators, but very frequently feature at the 5'-end an additional A-rich stretch complementary to the 3' uridines. In all species, a clearly biased SLS distribution was observed within the intergenic space, with most concentrating at the 3'-end side of flanking CDSs. Some intergenic SLS regions are members of novel repeated sequence families. Conclusion In depth analysis of SLS features and distribution in 40 different bacterial genomes showed the presence of non random populations of such structures in all species. Many of these structures are plausibly transcribed, and might be involved in the control of transcription termination, or might serve as RNA elements which can enhance either the stability or the turnover of cotranscribed

  9. Inferring friendship network structure by using mobile phone data.

    PubMed

    Eagle, Nathan; Pentland, Alex Sandy; Lazer, David

    2009-09-08

    Data collected from mobile phones have the potential to provide insight into the relational dynamics of individuals. This paper compares observational data from mobile phones with standard self-report survey data. We find that the information from these two data sources is overlapping but distinct. For example, self-reports of physical proximity deviate from mobile phone records depending on the recency and salience of the interactions. We also demonstrate that it is possible to accurately infer 95% of friendships based on the observational data alone, where friend dyads demonstrate distinctive temporal and spatial patterns in their physical proximity and calling patterns. These behavioral patterns, in turn, allow the prediction of individual-level outcomes such as job satisfaction.

  10. Comparative population genomics: power and principles for the inference of functionality

    PubMed Central

    Lawrie, David S.; Petrov, Dmitri A.

    2014-01-01

    The availability of sequenced genomes from multiple related organisms allows the detection and localization of functional genomic elements based on the idea that such elements evolve more slowly than neutral sequences. Although such comparative genomics methods have proven useful in discovering functional elements and ascertaining levels of functional constraint in the genome as a whole, here we outline limitations intrinsic to this approach that cannot be overcome by sequencing more species. We argue that it is essential to supplement comparative genomics with ultra-deep sampling of populations from closely related species to enable substantially more powerful genomic scans for functional elements. The convergence of sequencing technology and population genetics theory has made such projects feasible and has exciting implications for functional genomics. PMID:24656563

  11. Comparative population genomics: power and principles for the inference of functionality.

    PubMed

    Lawrie, David S; Petrov, Dmitri A

    2014-04-01

    The availability of sequenced genomes from multiple related organisms allows the detection and localization of functional genomic elements based on the idea that such elements evolve more slowly than neutral sequences. Although such comparative genomics methods have proven useful in discovering functional elements and ascertaining levels of functional constraint in the genome as a whole, here we outline limitations intrinsic to this approach that cannot be overcome by sequencing more species. We argue that it is essential to supplement comparative genomics with ultra-deep sampling of populations from closely related species to enable substantially more powerful genomic scans for functional elements. The convergence of sequencing technology and population genetics theory has made such projects feasible and has exciting implications for functional genomics. Copyright © 2014 Elsevier Ltd. All rights reserved.

  12. Haemonchus contortus: Genome Structure, Organization and Comparative Genomics.

    PubMed

    Laing, R; Martinelli, A; Tracey, A; Holroyd, N; Gilleard, J S; Cotton, J A

    2016-01-01

    One of the first genome sequencing projects for a parasitic nematode was that for Haemonchus contortus. The open access data from the Wellcome Trust Sanger Institute provided a valuable early resource for the research community, particularly for the identification of specific genes and genetic markers. Later, a second sequencing project was initiated by the University of Melbourne, and the two draft genome sequences for H. contortus were published back-to-back in 2013. There is a pressing need for long-range genomic information for genetic mapping, population genetics and functional genomic studies, so we are continuing to improve the Wellcome Trust Sanger Institute assembly to provide a finished reference genome for H. contortus. This review describes this process, compares the H. contortus genome assemblies with draft genomes from other members of the strongylid group and discusses future directions for parasite genomics using the H. contortus model. Copyright © 2016 Elsevier Ltd. All rights reserved.

  13. The phylogenomic position of the grey nurse shark Carcharias taurus Rafinesque, 1810 (Lamniformes, Odontaspididae) inferred from the mitochondrial genome.

    PubMed

    Bowden, Deborah L; Vargas-Caro, Carolina; Ovenden, Jennifer R; Bennett, Michael B; Bustamante, Carlos

    2016-11-01

    The complete mitochondrial genome of the grey nurse shark Carcharias taurus is described from 25 963 828 sequences obtained using Illumina NGS technology. Total length of the mitogenome is 16 715 bp, consisting of 2 rRNAs, 13 protein-coding regions, 22 tRNA and 2 non-coding regions thus updating the previously published mitogenome for this species. The phylogenomic reconstruction inferred from the mitogenome of 15 species of Lamniform and Carcharhiniform sharks supports the inclusion of C. taurus in a clade with the Lamnidae and Cetorhinidae. This complete mitogenome contributes to ongoing investigation into the monophyly of the Family Odontaspididae.

  14. The Plasmodium apicoplast genome: conserved structure and close relationship of P. ovale to rodent malaria parasites.

    PubMed

    Arisue, Nobuko; Hashimoto, Tetsuo; Mitsui, Hideya; Palacpac, Nirianne M Q; Kaneko, Akira; Kawai, Satoru; Hasegawa, Masami; Tanabe, Kazuyuki; Horii, Toshihiro

    2012-09-01

    Apicoplast, a nonphotosynthetic plastid derived from secondary symbiotic origin, is essential for the survival of malaria parasites of the genus Plasmodium. Elucidation of the evolution of the apicoplast genome in Plasmodium species is important to better understand the functions of the organelle. However, the complete apicoplast genome is available for only the most virulent human malaria parasite, Plasmodium falciparum. Here, we obtained the near-complete apicoplast genome sequences from eight Plasmodium species that infect a wide variety of vertebrate hosts and performed structural and phylogenetic analyses. We found that gene repertoire, gene arrangement, and other structural attributes were highly conserved. Phylogenetic reconstruction using 30 protein-coding genes of the apicoplast genome inferred, for the first time, a close relationship between P. ovale and rodent parasites. This close relatedness was robustly supported using multiple evolutionary assumptions and models. The finding suggests that an ancestral host switch occurred between rodent and human Plasmodium parasites.

  15. Unifying Inference of Meso-Scale Structures in Networks.

    PubMed

    Tunç, Birkan; Verma, Ragini

    2015-01-01

    Networks are among the most prevalent formal representations in scientific studies, employed to depict interactions between objects such as molecules, neuronal clusters, or social groups. Studies performed at meso-scale that involve grouping of objects based on their distinctive interaction patterns form one of the main lines of investigation in network science. In a social network, for instance, meso-scale structures can correspond to isolated social groupings or groups of individuals that serve as a communication core. Currently, the research on different meso-scale structures such as community and core-periphery structures has been conducted via independent approaches, which precludes the possibility of an algorithmic design that can handle multiple meso-scale structures and deciding which structure explains the observed data better. In this study, we propose a unified formulation for the algorithmic detection and analysis of different meso-scale structures. This facilitates the investigation of hybrid structures that capture the interplay between multiple meso-scale structures and statistical comparison of competing structures, all of which have been hitherto unavailable. We demonstrate the applicability of the methodology in analyzing the human brain network, by determining the dominant organizational structure (communities) of the brain, as well as its auxiliary characteristics (core-periphery).

  16. Bayesian inference of protein structure from chemical shift data.

    PubMed

    Bratholm, Lars A; Christensen, Anders S; Hamelryck, Thomas; Jensen, Jan H

    2015-01-01

    Protein chemical shifts are routinely used to augment molecular mechanics force fields in protein structure simulations, with weights of the chemical shift restraints determined empirically. These weights, however, might not be an optimal descriptor of a given protein structure and predictive model, and a bias is introduced which might result in incorrect structures. In the inferential structure determination framework, both the unknown structure and the disagreement between experimental and back-calculated data are formulated as a joint probability distribution, thus utilizing the full information content of the data. Here, we present the formulation of such a probability distribution where the error in chemical shift prediction is described by either a Gaussian or Cauchy distribution. The methodology is demonstrated and compared to a set of empirically weighted potentials through Markov chain Monte Carlo simulations of three small proteins (ENHD, Protein G and the SMN Tudor Domain) using the PROFASI force field and the chemical shift predictor CamShift. Using a clustering-criterion for identifying the best structure, together with the addition of a solvent exposure scoring term, the simulations suggests that sampling both the structure and the uncertainties in chemical shift prediction leads more accurate structures compared to conventional methods using empirical determined weights. The Cauchy distribution, using either sampled uncertainties or predetermined weights, did, however, result in overall better convergence to the native fold, suggesting that both types of distribution might be useful in different aspects of the protein structure prediction.

  17. Bayesian inference of protein structure from chemical shift data

    PubMed Central

    Bratholm, Lars A.; Christensen, Anders S.; Hamelryck, Thomas

    2015-01-01

    Protein chemical shifts are routinely used to augment molecular mechanics force fields in protein structure simulations, with weights of the chemical shift restraints determined empirically. These weights, however, might not be an optimal descriptor of a given protein structure and predictive model, and a bias is introduced which might result in incorrect structures. In the inferential structure determination framework, both the unknown structure and the disagreement between experimental and back-calculated data are formulated as a joint probability distribution, thus utilizing the full information content of the data. Here, we present the formulation of such a probability distribution where the error in chemical shift prediction is described by either a Gaussian or Cauchy distribution. The methodology is demonstrated and compared to a set of empirically weighted potentials through Markov chain Monte Carlo simulations of three small proteins (ENHD, Protein G and the SMN Tudor Domain) using the PROFASI force field and the chemical shift predictor CamShift. Using a clustering-criterion for identifying the best structure, together with the addition of a solvent exposure scoring term, the simulations suggests that sampling both the structure and the uncertainties in chemical shift prediction leads more accurate structures compared to conventional methods using empirical determined weights. The Cauchy distribution, using either sampled uncertainties or predetermined weights, did, however, result in overall better convergence to the native fold, suggesting that both types of distribution might be useful in different aspects of the protein structure prediction. PMID:25825683

  18. Computational analysis of conserved RNA secondary structure in transcriptomes and genomes

    PubMed Central

    Eddy, Sean R.

    2017-01-01

    Transcriptomics experiments and computational predictions both enable systematic discovery of new functional RNAs, but many putative noncoding transcripts arise instead from artifacts and biological noise, and current computational prediction methods have high false positive rates. I discuss prospects for improving computational methods for analyzing and identifying functional RNAs, with a focus on detecting signatures of conserved RNA secondary structure. An interesting new front is the application of chemical and enzymatic RNA structure probing experiments on a transcriptome-wide scale. I review several proposed approaches for incorporating structure probing data into computational RNA secondary structure prediction. Using probabilistic inference formalisms, I show how all these approaches can be unified in a well-principled framework. Using that framework, RNA probing data can easily be integrated into a wide range of different analyses that depend on RNA secondary structure inference, including homology search and genome-wide detection of new structural RNAs. PMID:24895857

  19. Demographic Divergence History of Pied Flycatcher and Collared Flycatcher Inferred from Whole-Genome Re-sequencing Data

    PubMed Central

    Nadachowska-Brzyska, Krystyna; Burri, Reto; Olason, Pall I.; Kawakami, Takeshi; Smeds, Linnéa; Ellegren, Hans

    2013-01-01

    Profound knowledge of demographic history is a prerequisite for the understanding and inference of processes involved in the evolution of population differentiation and speciation. Together with new coalescent-based methods, the recent availability of genome-wide data enables investigation of differentiation and divergence processes at unprecedented depth. We combined two powerful approaches, full Approximate Bayesian Computation analysis (ABC) and pairwise sequentially Markovian coalescent modeling (PSMC), to reconstruct the demographic history of the split between two avian speciation model species, the pied flycatcher and collared flycatcher. Using whole-genome re-sequencing data from 20 individuals, we investigated 15 demographic models including different levels and patterns of gene flow, and changes in effective population size over time. ABC provided high support for recent (mode 0.3 my, range <0.7 my) species divergence, declines in effective population size of both species since their initial divergence, and unidirectional recent gene flow from pied flycatcher into collared flycatcher. The estimated divergence time and population size changes, supported by PSMC results, suggest that the ancestral species persisted through one of the glacial periods of middle Pleistocene and then split into two large populations that first increased in size before going through severe bottlenecks and expanding into their current ranges. Secondary contact appears to have been established after the last glacial maximum. The severity of the bottlenecks at the last glacial maximum is indicated by the discrepancy between current effective population sizes (20,000–80,000) and census sizes (5–50 million birds) of the two species. The recent divergence time challenges the supposition that avian speciation is a relatively slow process with extended times for intrinsic postzygotic reproductive barriers to evolve. Our study emphasizes the importance of using genome-wide data to

  20. Inferring Selective Constraint from Population Genomic Data Suggests Recent Regulatory Turnover in the Human Brain

    PubMed Central

    Schrider, Daniel R.; Kern, Andrew D.

    2015-01-01

    The comparative genomics revolution of the past decade has enabled the discovery of functional elements in the human genome via sequence comparison. While that is so, an important class of elements, those specific to humans, is entirely missed by searching for sequence conservation across species. Here we present an analysis based on variation data among human genomes that utilizes a supervised machine learning approach for the identification of human-specific purifying selection in the genome. Using only allele frequency information from the complete low-coverage 1000 Genomes Project data set in conjunction with a support vector machine trained from known functional and nonfunctional portions of the genome, we are able to accurately identify portions of the genome constrained by purifying selection. Our method identifies previously known human-specific gains or losses of function and uncovers many novel candidates. Candidate targets for gain and loss of function along the human lineage include numerous putative regulatory regions of genes essential for normal development of the central nervous system, including a significant enrichment of gain of function events near neurotransmitter receptor genes. These results are consistent with regulatory turnover being a key mechanism in the evolution of human-specific characteristics of brain development. Finally, we show that the majority of the genome is unconstrained by natural selection currently, in agreement with what has been estimated from phylogenetic methods but in sharp contrast to estimates based on transcriptomics or other high-throughput functional methods. PMID:26590212

  1. Reference set of regulons in Desulfovibrionales inferred by comparative genomics approach

    SciTech Connect

    Kazakov, A.E.; Rodionov, D.A.; Price, M.N.; Arkin, A.P.; Dubchak, I.; Novichkov, P.S.

    2010-11-15

    in this study, we carried out large-scale comparative genomics analysis of regulatory interactions in Desulfovibrio vulgaris and 12 related genomes from Desulfovibrionales order using our recently developed web server RegPredict (http://regpredict.lbl.gov). An overall reference collection of 26 Desulfovibrionales regulogs can be accessed through RegPrecise database (http://regpredict.lbl.gov).

  2. Mechanisms underlying structural variant formation in genomic disorders

    PubMed Central

    Carvalho, Claudia M. B.; Lupski, James R.

    2016-01-01

    With the recent burst of technological developments in genomics, and the clinical implementation of genome-wide assays, our understanding of the molecular basis of genomic disorders, specifically the contribution of structural variation to disease burden, is evolving quickly. Ongoing studies have revealed a ubiquitous role for genome architecture in the formation of structural variants at a given locus, both in DNA recombination-based processes and in replication-based processes. These reports showcase the influence of repeat sequences on genomic stability and structural variant complexity and also highlight the tremendous plasticity and dynamic nature of our genome in evolution, health and disease susceptibility. PMID:26924765

  3. Predicting Protein Function by Genomic Context: Quantitative Evaluation and Qualitative Inferences

    PubMed Central

    Huynen, Martijn; Snel, Berend; Lathe, Warren; Bork, Peer

    2000-01-01

    Various new methods have been proposed to predict functional interactions between proteins based on the genomic context of their genes. The types of genomic context that they use are Type I: the fusion of genes; Type II: the conservation of gene-order or co-occurrence of genes in potential operons; and Type III: the co-occurrence of genes across genomes (phylogenetic profiles). Here we compare these types for their coverage, their correlations with various types of functional interaction, and their overlap with homology-based function assignment. We apply the methods to Mycoplasma genitalium, the standard benchmarking genome in computational and experimental genomics. Quantitatively, conservation of gene order is the technique with the highest coverage, applying to 37% of the genes. By combining gene order conservation with gene fusion (6%), the co-occurrence of genes in operons in absence of gene order conservation (8%), and the co-occurrence of genes across genomes (11%), significant context information can be obtained for 50% of the genes (the categories overlap). Qualitatively, we observe that the functional interactions between genes are stronger as the requirements for physical neighborhood on the genome are more stringent, while the fraction of potential false positives decreases. Moreover, only in cases in which gene order is conserved in a substantial fraction of the genomes, in this case six out of twenty-five, does a single type of functional interaction (physical interaction) clearly dominate (>80%). In other cases, complementary function information from homology searches, which is available for most of the genes with significant genomic context, is essential to predict the type of interaction. Using a combination of genomic context and homology searches, new functional features can be predicted for 10% of M. genitalium genes. PMID:10958638

  4. Inferring Meaning from Syntactic Structures in Acquisition: The Case of Transitivity and Telicity

    ERIC Educational Resources Information Center

    Wagner, Laura

    2010-01-01

    This paper investigated children's ability to use syntactic structures to infer semantic information. The particular syntax-semantics link examined was the one between transitivity (transitive/intransitive structures) and telicity (telic/atelic perspectives; that is, boundedness). Although transitivity is an important syntactic reflex of telicity,…

  5. Genome instability mechanisms and the structure of cancer genomes.

    PubMed

    Cassidy, Liam D; Venkitaraman, Ashok R

    2012-02-01

    Genomic instability is a hallmark of cancer cells, and arises from the aberrations that these cells exhibit in the normal biological mechanisms that repair and replicate the genome, or ensure its accurate segregation during cell division. Increasingly detailed descriptions of cancer genomes have begun to emerge from next-generation sequencing (NGS), providing snapshots of their nature and heterogeneity in different cancers at different stages in their evolution. Here, we attempt to extract from these sequencing studies insights into the role of genome instability mechanisms in carcinogenesis, and to identify challenges impeding further progress.

  6. Systematic Prioritization of Druggable Mutations in ∼5000 Genomes Across 16 Cancer Types Using a Structural Genomics-based Approach*

    PubMed Central

    Zhao, Junfei; Cheng, Feixiong; Wang, Yuanyuan; Arteaga, Carlos L.; Zhao, Zhongming

    2016-01-01

    A massive amount of somatic mutations has been cataloged in large-scale projects such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium projects. The majority of the somatic mutations found in tumor genomes are neutral 'passenger' rather than damaging “driver” mutations. Now, understanding their biological consequences and prioritizing them for druggable targets are urgently needed. Thanks to the rapid advances in structural genomics technologies (e.g. X-ray), large-scale protein structural data has now been made available, providing critical information for deciphering functional roles of mutations in cancer and prioritizing those alterations that may mediate drug binding at the atom resolution and, as such, be druggable targets. We hypothesized that mutations at protein–ligand binding-site residues are likely to be druggable targets. Thus, to prioritize druggable mutations, we developed SGDriver, a structural genomics-based method incorporating the somatic missense mutations into protein–ligand binding-site residues using a Bayes inference statistical framework. We applied SGDriver to 746,631 missense mutations observed in 4997 tumor-normal pairs across 16 cancer types from The Cancer Genome Atlas. SGDriver detected 14,471 potential druggable mutations in 2091 proteins (including 1,516 recurrently mutated proteins) across 3558 cancer genomes (71.2%), and further identified 298 proteins harboring mutations that were significantly enriched at protein–ligand binding-site residues (adjusted p value < 0.05). The identified proteins are significantly enriched in both oncoproteins and tumor suppressors. The follow-up drug-target network analysis suggested 98 known and 126 repurposed druggable anticancer targets (e.g. SPOP and NR3C1). Furthermore, our integrative analysis indicated that 13% of patients might benefit from current targeted therapy, and this –proportion would increase to 31% when considering drug repositioning

  7. Protein NMR Structure Refinement based on Bayesian Inference

    NASA Astrophysics Data System (ADS)

    Ikeya, Teppei; Ikeda, Shiro; Kigawa, Takanori; Ito, Yutaka; Güntert, Peter

    2016-03-01

    Nuclear Magnetic Resonance (NMR) spectroscopy is a tool to investigate threedimensional (3D) structures and dynamics of biomacromolecules at atomic resolution in solution or more natural environments such as living cells. Since NMR data are principally only spectra with peak signals, it is required to properly deduce structural information from the sparse experimental data with their imperfections and uncertainty, and to visualize 3D conformations by NMR structure calculation. In order to efficiently analyse the data, Rieping et al. proposed a new structure calculation method based on Bayes’ theorem. We implemented a similar approach into the program CYANA with some modifications. It allows us to handle automatic NOE cross peak assignments in unambiguous and ambiguous usages, and to create a prior distribution based on a physical force field with the generalized Born implicit water model. The sampling scheme for obtaining the posterior is performed by a hybrid Monte Carlo algorithm combined with Markov chain Monte Carlo (MCMC) by the Gibbs sampler, and molecular dynamics simulation (MD) for obtaining a canonical ensemble of conformations. Since it is not trivial to search the entire function space particularly for exploring the conformational prior due to the extraordinarily large conformation space of proteins, the replica exchange method is performed, in which several MCMC calculations with different temperatures run in parallel as replicas. It is shown with simulated data or randomly deleted experimental peaks that the new structure calculation method can provide accurate structures even with less peaks, especially compared with the conventional method. In particular, it dramatically improves in-cell structures of the proteins GB1 and TTHA1718 using exclusively information obtained in living Escherichia coli (E. coli) cells.

  8. High-level phylogeny of the Coleoptera inferred with mitochondrial genome sequences.

    PubMed

    Yuan, Ming-Long; Zhang, Qi-Lin; Zhang, Li; Guo, Zhong-Long; Liu, Yong-Jian; Shen, Yu-Ying; Shao, Renfu

    2016-11-01

    The Coleoptera (beetles) exhibits tremendous morphological, ecological, and behavioral diversity. To better understand the phylogenetics and evolution of beetles, we sequenced three complete mitogenomes from two families (Cleridae and Meloidae), which share conserved mitogenomic features with other completely sequenced beetles. We assessed the influence of six datasets and three inference methods on topology and nodal support within the Coleoptera. We found that both Bayesian inference and maximum likelihood with homogeneous-site models were greatly affected by nucleotide compositional heterogeneity, while the heterogeneous-site mixture model in PhyloBayes could provide better phylogenetic signals for the Coleoptera. The amino acid dataset generated more reliable tree topology at the higher taxonomic levels (i.e. suborders and series), where the inclusion of rRNA genes and the third positions of protein-coding genes improved phylogenetic inference at the superfamily level, especially under a heterogeneous-site model. We recovered the suborder relationships as (Archostemata+Adephaga)+(Myxophaga+Polyphaga). The series relationships within Polyphaga were recovered as (Scirtiformia+(Elateriformia+((Bostrichiformia+Scarabaeiformia+Staphyliniformia)+Cucujiformia))). All superfamilies within Cucujiformia were recovered as monophyletic. We obtained a cucujiform phylogeny of (Cleroidea+(Coccinelloidea+((Lymexyloidea+Tenebrionoidea)+(Cucujoidea+(Chrysomeloidea+Curculionoidea))))). This study showed that although tree topologies were sensitive to data types and inference methods, mitogenomic data could provide useful information for resolving the Coleoptera phylogeny at various taxonomic levels by using suitable datasets and heterogeneous-site models.

  9. The Generator of the Event Structure Lexicon (GESL): Automatic Annotation of Event Structure for Textual Inference Tasks

    ERIC Educational Resources Information Center

    Im, Seohyun

    2013-01-01

    This dissertation aims to develop the Generator of the Event Structure Lexicon (GESL) which is a tool to automate annotating the event structure of verbs in text to support textual inference tasks related to lexically entailed subevents. The output of the GESL is the Event Structure Lexicon (ESL), which is a lexicon of verbs in text which includes…

  10. The Generator of the Event Structure Lexicon (GESL): Automatic Annotation of Event Structure for Textual Inference Tasks

    ERIC Educational Resources Information Center

    Im, Seohyun

    2013-01-01

    This dissertation aims to develop the Generator of the Event Structure Lexicon (GESL) which is a tool to automate annotating the event structure of verbs in text to support textual inference tasks related to lexically entailed subevents. The output of the GESL is the Event Structure Lexicon (ESL), which is a lexicon of verbs in text which includes…

  11. Complete Chloroplast Genome of the Wollemi Pine (Wollemia nobilis): Structure and Evolution

    PubMed Central

    Yap, Jia-Yee S.; Rohner, Thore; Greenfield, Abigail; Van Der Merwe, Marlien; McPherson, Hannah; Glenn, Wendy; Kornfeld, Geoff; Marendy, Elessa; Pan, Annie Y. H.; Wilkins, Marc R.; Rossetto, Maurizio; Delaney, Sven K.

    2015-01-01

    The Wollemi pine (Wollemia nobilis) is a rare Southern conifer with striking morphological similarity to fossil pines. A small population of W. nobilis was discovered in 1994 in a remote canyon system in the Wollemi National Park (near Sydney, Australia). This population contains fewer than 100 individuals and is critically endangered. Previous genetic studies of the Wollemi pine have investigated its evolutionary relationship with other pines in the family Araucariaceae, and have suggested that the Wollemi pine genome contains little or no variation. However, these studies were performed prior to the widespread use of genome sequencing, and their conclusions were based on a limited fraction of the Wollemi pine genome. In this study, we address this problem by determining the entire sequence of the W. nobilis chloroplast genome. A detailed analysis of the structure of the genome is presented, and the evolution of the genome is inferred by comparison with the chloroplast sequences of other members of the Araucariaceae and the related family Podocarpaceae. Pairwise alignments of whole genome sequences, and the presence of unique pseudogenes, gene duplications and insertions in W. nobilis and Araucariaceae, indicate that the W. nobilis chloroplast genome is most similar to that of its sister taxon Agathis. However, the W. nobilis genome contains an unusually high number of repetitive sequences, and these could be used in future studies to investigate and conserve any remnant genetic diversity in the Wollemi pine. PMID:26061691

  12. Chloroplast genome structure in Ilex (Aquifoliaceae)

    PubMed Central

    Yao, Xin; Tan, Yun-Hong; Liu, Ying-Ying; Song, Yu; Yang, Jun-Bo; Corlett, Richard T.

    2016-01-01

    Aquifoliaceae is the largest family in the campanulid order Aquifoliales. It consists of a single genus, Ilex, the hollies, which is the largest woody dioecious genus in the angiosperms. Most species are in East Asia or South America. The taxonomy and evolutionary history remain unclear due to the lack of a robust species-level phylogeny. We produced the first complete chloroplast genomes in this family, including seven Ilex species, by Illumina sequencing of long-range PCR products and subsequent reference-guided de novo assembly. These genomes have a typical bicyclic structure with a conserved genome arrangement and moderate divergence. The total length is 157,741 bp and there is one large single-copy region (LSC) with 87,109 bp, one small single-copy with 18,436 bp, and a pair of inverted repeat regions (IR) with 52,196 bp. A total of 144 genes were identified, including 96 protein-coding genes, 40 tRNA and 8 rRNA. Thirty-four repetitive sequences were identified in Ilex pubescens, with lengths >14 bp and identity >90%, and 11 divergence hotspot regions that could be targeted for phylogenetic markers. This study will contribute to improved resolution of deep branches of the Ilex phylogeny and facilitate identification of Ilex species. PMID:27378489

  13. Chloroplast genome structure in Ilex (Aquifoliaceae).

    PubMed

    Yao, Xin; Tan, Yun-Hong; Liu, Ying-Ying; Song, Yu; Yang, Jun-Bo; Corlett, Richard T

    2016-07-05

    Aquifoliaceae is the largest family in the campanulid order Aquifoliales. It consists of a single genus, Ilex, the hollies, which is the largest woody dioecious genus in the angiosperms. Most species are in East Asia or South America. The taxonomy and evolutionary history remain unclear due to the lack of a robust species-level phylogeny. We produced the first complete chloroplast genomes in this family, including seven Ilex species, by Illumina sequencing of long-range PCR products and subsequent reference-guided de novo assembly. These genomes have a typical bicyclic structure with a conserved genome arrangement and moderate divergence. The total length is 157,741 bp and there is one large single-copy region (LSC) with 87,109 bp, one small single-copy with 18,436 bp, and a pair of inverted repeat regions (IR) with 52,196 bp. A total of 144 genes were identified, including 96 protein-coding genes, 40 tRNA and 8 rRNA. Thirty-four repetitive sequences were identified in Ilex pubescens, with lengths >14 bp and identity >90%, and 11 divergence hotspot regions that could be targeted for phylogenetic markers. This study will contribute to improved resolution of deep branches of the Ilex phylogeny and facilitate identification of Ilex species.

  14. Non-Bayesian Inference: Causal Structure Trumps Correlation

    ERIC Educational Resources Information Center

    Bes, Benedicte; Sloman, Steven; Lucas, Christopher G.; Raufaste, Eric

    2012-01-01

    The study tests the hypothesis that conditional probability judgments can be influenced by causal links between the target event and the evidence even when the statistical relations among variables are held constant. Three experiments varied the causal structure relating three variables and found that (a) the target event was perceived as more…

  15. Alternative Multiple Imputation Inference for Mean and Covariance Structure Modeling

    ERIC Educational Resources Information Center

    Lee, Taehun; Cai, Li

    2012-01-01

    Model-based multiple imputation has become an indispensable method in the educational and behavioral sciences. Mean and covariance structure models are often fitted to multiply imputed data sets. However, the presence of multiple random imputations complicates model fit testing, which is an important aspect of mean and covariance structure…

  16. Alternative Multiple Imputation Inference for Mean and Covariance Structure Modeling

    ERIC Educational Resources Information Center

    Lee, Taehun; Cai, Li

    2012-01-01

    Model-based multiple imputation has become an indispensable method in the educational and behavioral sciences. Mean and covariance structure models are often fitted to multiply imputed data sets. However, the presence of multiple random imputations complicates model fit testing, which is an important aspect of mean and covariance structure…

  17. Non-Bayesian Inference: Causal Structure Trumps Correlation

    ERIC Educational Resources Information Center

    Bes, Benedicte; Sloman, Steven; Lucas, Christopher G.; Raufaste, Eric

    2012-01-01

    The study tests the hypothesis that conditional probability judgments can be influenced by causal links between the target event and the evidence even when the statistical relations among variables are held constant. Three experiments varied the causal structure relating three variables and found that (a) the target event was perceived as more…

  18. The Lithospheric Structure of Southeast China, Inferred from Magnetotelluric Data

    NASA Astrophysics Data System (ADS)

    Xu, S.; Unsworth, M. J.; Hu, X.

    2016-12-01

    The South China block is a major structural unit in southern China which was assembled in the Neoproterozoic by the collision of the Yangtze craton in the west and the Cathaysia in the east along the Jiangshan-Shaoxing suture. Despite a significant number of studies, a number of questions about the structure and tectonic history of the South China block remain unresolved. These include the location and geometry of the Jiangshan-Shaoxing suture and the geodynamic processes that caused the Mesozoic extension and magmatism. The lithospheric structure of South China block is significant to understanding the geological framework of the Eurasian continent. Magnetotellurics (MT) is a useful tool to study the structure of ancient subduction zones, extensional regimes and tectonothermal events. We used broadband MT data to study the resistivity structure of the South China block. This included MT data collected from 2009 to 2010 during the Sinoprobe project and additional MT surveys in 2016. A 2-D inversion was first performed to derive resistivity models along seven profiles. However, these 2-D inversions were unable to fit some of data that exhibited out-of-quadrant phases, which suggested the existence of 3-D resistivity structure or current channeling. Therefore, 3-D resistivity structures were investigated using inversions of the full impedance tensor with the ModEM inversion code. The 3-D model showed a number of major resistivity features: (1) The Jiangshan-Shaoxing suture was imaged as an east dipping conductive zone. (2) High resistivity anomalies were found at the Jiangnan orogen, corresponding to the thickened lithosphere. (3) To the east of the Jiangnan orogen, low resistivity zones resolved in the lower crust and upper mantle suggested lithospheric thinning beneath the Wuyi-yunkai orogen. This may indicate the Paleozoic Wuyi-yunkai orogen was largely reworked in the Mesozoic extension of the South China block. However, the extension did not lead to a consistent

  19. Inferring Speciation Processes from Patterns of Natural Variation in Microbial Genomes

    PubMed Central

    Krause, David J.; Whitaker, Rachel J.

    2015-01-01

    Microbial species concepts have long been the focus of contentious debate, fueled by technological limitations to the genetic resolution of species, by the daunting task of investigating phenotypic variation among individual microscopic organisms, and by a lack of understanding of gene flow in reproductively asexual organisms that are prone to promiscuous horizontal gene transfer. Population genomics, the emerging approach of analyzing the complete genomes of a multitude of closely related organisms, is poised to overcome these limitations by providing a window into patterns of genome variation revealing the evolutionary processes through which species diverge. This new approach is more than just an extension of previous multilocus sequencing technologies, in that it provides a comprehensive view of interacting evolutionary processes. Here we argue that the application of population genomic tools in a rigorous population genetic framework will help to identify the processes of microbial speciation and ultimately lead to a general species concept based on the unique biology and ecology of microorganisms. PMID:26316424

  20. Demographic History of the Genus Pan Inferred from Whole Mitochondrial Genome Reconstructions.

    PubMed

    Lobon, Irene; Tucci, Serena; de Manuel, Marc; Ghirotto, Silvia; Benazzo, Andrea; Prado-Martinez, Javier; Lorente-Galdos, Belen; Nam, Kiwoong; Dabad, Marc; Hernandez-Rodriguez, Jessica; Comas, David; Navarro, Arcadi; Schierup, Mikkel H; Andres, Aida M; Barbujani, Guido; Hvilsom, Christina; Marques-Bonet, Tomas

    2016-07-03

    The genus Pan is the closest genus to our own and it includes two species, Pan paniscus (bonobos) and Pan troglodytes (chimpanzees). The later is constituted by four subspecies, all highly endangered. The study of the Pan genera has been incessantly complicated by the intricate relationship among subspecies and the statistical limitations imposed by the reduced number of samples or genomic markers analyzed. Here, we present a new method to reconstruct complete mitochondrial genomes (mitogenomes) from whole genome shotgun (WGS) datasets, mtArchitect, showing that its reconstructions are highly accurate and consistent with long-range PCR mitogenomes. We used this approach to build the mitochondrial genomes of 20 newly sequenced samples which, together with available genomes, allowed us to analyze the hitherto most complete Pan mitochondrial genome dataset including 156 chimpanzee and 44 bonobo individuals, with a proportional contribution from all chimpanzee subspecies. We estimated the separation time between chimpanzees and bonobos around 1.15 million years ago (Mya) [0.81-1.49]. Further, we found that under the most probable genealogical model the two clades of chimpanzees, Western + Nigeria-Cameroon and Central + Eastern, separated at 0.59 Mya [0.41-0.78] with further internal separations at 0.32 Mya [0.22-0.43] and 0.16 Mya [0.17-0.34], respectively. Finally, for a subset of our samples, we compared nuclear versus mitochondrial genomes and we found that chimpanzee subspecies have different patterns of nuclear and mitochondrial diversity, which could be a result of either processes affecting the mitochondrial genome, such as hitchhiking or background selection, or a result of population dynamics. © The Author(s) 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  1. Inferring Selective Constraint from Population Genomic Data Suggests Recent Regulatory Turnover in the Human Brain.

    PubMed

    Schrider, Daniel R; Kern, Andrew D

    2015-11-19

    The comparative genomics revolution of the past decade has enabled the discovery of functional elements in the human genome via sequence comparison. While that is so, an important class of elements, those specific to humans, is entirely missed by searching for sequence conservation across species. Here we present an analysis based on variation data among human genomes that utilizes a supervised machine learning approach for the identification of human-specific purifying selection in the genome. Using only allele frequency information from the complete low-coverage 1000 Genomes Project data set in conjunction with a support vector machine trained from known functional and nonfunctional portions of the genome, we are able to accurately identify portions of the genome constrained by purifying selection. Our method identifies previously known human-specific gains or losses of function and uncovers many novel candidates. Candidate targets for gain and loss of function along the human lineage include numerous putative regulatory regions of genes essential for normal development of the central nervous system, including a significant enrichment of gain of function events near neurotransmitter receptor genes. These results are consistent with regulatory turnover being a key mechanism in the evolution of human-specific characteristics of brain development. Finally, we show that the majority of the genome is unconstrained by natural selection currently, in agreement with what has been estimated from phylogenetic methods but in sharp contrast to estimates based on transcriptomics or other high-throughput functional methods. © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  2. Demographic History of the Genus Pan Inferred from Whole Mitochondrial Genome Reconstructions

    PubMed Central

    Tucci, Serena; de Manuel, Marc; Ghirotto, Silvia; Benazzo, Andrea; Prado-Martinez, Javier; Lorente-Galdos, Belen; Nam, Kiwoong; Dabad, Marc; Hernandez-Rodriguez, Jessica; Comas, David; Navarro, Arcadi; Schierup, Mikkel H.; Andres, Aida M.; Barbujani, Guido; Hvilsom, Christina; Marques-Bonet, Tomas

    2016-01-01

    The genus Pan is the closest genus to our own and it includes two species, Pan paniscus (bonobos) and Pan troglodytes (chimpanzees). The later is constituted by four subspecies, all highly endangered. The study of the Pan genera has been incessantly complicated by the intricate relationship among subspecies and the statistical limitations imposed by the reduced number of samples or genomic markers analyzed. Here, we present a new method to reconstruct complete mitochondrial genomes (mitogenomes) from whole genome shotgun (WGS) datasets, mtArchitect, showing that its reconstructions are highly accurate and consistent with long-range PCR mitogenomes. We used this approach to build the mitochondrial genomes of 20 newly sequenced samples which, together with available genomes, allowed us to analyze the hitherto most complete Pan mitochondrial genome dataset including 156 chimpanzee and 44 bonobo individuals, with a proportional contribution from all chimpanzee subspecies. We estimated the separation time between chimpanzees and bonobos around 1.15 million years ago (Mya) [0.81–1.49]. Further, we found that under the most probable genealogical model the two clades of chimpanzees, Western + Nigeria-Cameroon and Central + Eastern, separated at 0.59 Mya [0.41–0.78] with further internal separations at 0.32 Mya [0.22–0.43] and 0.16 Mya [0.17–0.34], respectively. Finally, for a subset of our samples, we compared nuclear versus mitochondrial genomes and we found that chimpanzee subspecies have different patterns of nuclear and mitochondrial diversity, which could be a result of either processes affecting the mitochondrial genome, such as hitchhiking or background selection, or a result of population dynamics. PMID:27345955

  3. Simplified DGS procedure for large-scale genome structural study.

    PubMed

    Jung, Yong-Chul; Xu, Jia; Chen, Jun; Kim, Yeong; Winchester, David; Wang, San Ming

    2009-11-01

    Ditag genome scanning (DGS) uses next-generation DNA sequencing to sequence the ends of ditag fragments produced by restriction enzymes. These sequences are compared to known genome sequences to determine their structure. In order to use DGS for large-scale genome structural studies, we have substantially revised the original protocol by replacing the in vivo genomic DNA cloning with in vitro adaptor ligation, eliminating the ditag concatemerization steps, and replacing the 454 sequencer with Solexa or SOLiD sequencers for ditag sequence collection. This revised protocol further increases genome coverage and resolution and allows DGS to be used to analyze multiple genomes simultaneously.

  4. Use of Bayesian Inference in Crystallographic Structure Refinement via Full Diffraction Profile Analysis

    PubMed Central

    Fancher, Chris M.; Han, Zhen; Levin, Igor; Page, Katharine; Reich, Brian J.; Smith, Ralph C.; Wilson, Alyson G.; Jones, Jacob L.

    2016-01-01

    A Bayesian inference method for refining crystallographic structures is presented. The distribution of model parameters is stochastically sampled using Markov chain Monte Carlo. Posterior probability distributions are constructed for all model parameters to properly quantify uncertainty by appropriately modeling the heteroskedasticity and correlation of the error structure. The proposed method is demonstrated by analyzing a National Institute of Standards and Technology silicon standard reference material. The results obtained by Bayesian inference are compared with those determined by Rietveld refinement. Posterior probability distributions of model parameters provide both estimates and uncertainties. The new method better estimates the true uncertainties in the model as compared to the Rietveld method. PMID:27550221

  5. Use of Bayesian Inference in Crystallographic Structure Refinement via Full Diffraction Profile Analysis.

    PubMed

    Fancher, Chris M; Han, Zhen; Levin, Igor; Page, Katharine; Reich, Brian J; Smith, Ralph C; Wilson, Alyson G; Jones, Jacob L

    2016-08-23

    A Bayesian inference method for refining crystallographic structures is presented. The distribution of model parameters is stochastically sampled using Markov chain Monte Carlo. Posterior probability distributions are constructed for all model parameters to properly quantify uncertainty by appropriately modeling the heteroskedasticity and correlation of the error structure. The proposed method is demonstrated by analyzing a National Institute of Standards and Technology silicon standard reference material. The results obtained by Bayesian inference are compared with those determined by Rietveld refinement. Posterior probability distributions of model parameters provide both estimates and uncertainties. The new method better estimates the true uncertainties in the model as compared to the Rietveld method.

  6. Stock Portfolio Structure of Individual Investors Infers Future Trading Behavior

    PubMed Central

    Bohlin, Ludvig; Rosvall, Martin

    2014-01-01

    Although the understanding of and motivation behind individual trading behavior is an important puzzle in finance, little is known about the connection between an investor's portfolio structure and her trading behavior in practice. In this paper, we investigate the relation between what stocks investors hold, and what stocks they buy, and show that investors with similar portfolio structures to a great extent trade in a similar way. With data from the central register of shareholdings in Sweden, we model the market in a similarity network, by considering investors as nodes, connected with links representing portfolio similarity. From the network, we find investor groups that not only identify different investment strategies, but also represent individual investors trading in a similar way. These findings suggest that the stock portfolios of investors hold meaningful information, which could be used to earn a better understanding of stock market dynamics. PMID:25068302

  7. Stock portfolio structure of individual investors infers future trading behavior.

    PubMed

    Bohlin, Ludvig; Rosvall, Martin

    2014-01-01

    Although the understanding of and motivation behind individual trading behavior is an important puzzle in finance, little is known about the connection between an investor's portfolio structure and her trading behavior in practice. In this paper, we investigate the relation between what stocks investors hold, and what stocks they buy, and show that investors with similar portfolio structures to a great extent trade in a similar way. With data from the central register of shareholdings in Sweden, we model the market in a similarity network, by considering investors as nodes, connected with links representing portfolio similarity. From the network, we find investor groups that not only identify different investment strategies, but also represent individual investors trading in a similar way. These findings suggest that the stock portfolios of investors hold meaningful information, which could be used to earn a better understanding of stock market dynamics.

  8. Parameter and Structure Inference for Nonlinear Dynamical Systems

    NASA Technical Reports Server (NTRS)

    Morris, Robin D.; Smelyanskiy, Vadim N.; Millonas, Mark

    2006-01-01

    A great many systems can be modeled in the non-linear dynamical systems framework, as x = f(x) + xi(t), where f() is the potential function for the system, and xi is the excitation noise. Modeling the potential using a set of basis functions, we derive the posterior for the basis coefficients. A more challenging problem is to determine the set of basis functions that are required to model a particular system. We show that using the Bayesian Information Criteria (BIC) to rank models, and the beam search technique, that we can accurately determine the structure of simple non-linear dynamical system models, and the structure of the coupling between non-linear dynamical systems where the individual systems are known. This last case has important ecological applications.

  9. Genome-Wide Structural Variation Detection by Genome Mapping on Nanochannel Arrays.

    PubMed

    Mak, Angel C Y; Lai, Yvonne Y Y; Lam, Ernest T; Kwok, Tsz-Piu; Leung, Alden K Y; Poon, Annie; Mostovoy, Yulia; Hastie, Alex R; Stedman, William; Anantharaman, Thomas; Andrews, Warren; Zhou, Xiang; Pang, Andy W C; Dai, Heng; Chu, Catherine; Lin, Chin; Wu, Jacob J K; Li, Catherine M L; Li, Jing-Woei; Yim, Aldrin K Y; Chan, Saki; Sibert, Justin; Džakula, Željko; Cao, Han; Yiu, Siu-Ming; Chan, Ting-Fung; Yip, Kevin Y; Xiao, Ming; Kwok, Pui-Yan

    2016-01-01

    Comprehensive whole-genome structural variation detection is challenging with current approaches. With diploid cells as DNA source and the presence of numerous repetitive elements, short-read DNA sequencing cannot be used to detect structural variation efficiently. In this report, we show that genome mapping with long, fluorescently labeled DNA molecules imaged on nanochannel arrays can be used for whole-genome structural variation detection without sequencing. While whole-genome haplotyping is not achieved, local phasing (across >150-kb regions) is routine, as molecules from the parental chromosomes are examined separately. In one experiment, we generated genome maps from a trio from the 1000 Genomes Project, compared the maps against that derived from the reference human genome, and identified structural variations that are >5 kb in size. We find that these individuals have many more structural variants than those published, including some with the potential of disrupting gene function or regulation. Copyright © 2016 by the Genetics Society of America.

  10. Structural influence of gene networks on their inference: analysis of C3NET

    PubMed Central

    2011-01-01

    Background The availability of large-scale high-throughput data possesses considerable challenges toward their functional analysis. For this reason gene network inference methods gained considerable interest. However, our current knowledge, especially about the influence of the structure of a gene network on its inference, is limited. Results In this paper we present a comprehensive investigation of the structural influence of gene networks on the inferential characteristics of C3NET - a recently introduced gene network inference algorithm. We employ local as well as global performance metrics in combination with an ensemble approach. The results from our numerical study for various biological and synthetic network structures and simulation conditions, also comparing C3NET with other inference algorithms, lead a multitude of theoretical and practical insights into the working behavior of C3NET. In addition, in order to facilitate the practical usage of C3NET we provide an user-friendly R package, called c3net, and describe its functionality. It is available from https://r-forge.r-project.org/projects/c3net and from the CRAN package repository. Conclusions The availability of gene network inference algorithms with known inferential properties opens a new era of large-scale screening experiments that could be equally beneficial for basic biological and biomedical research with auspicious prospects. The availability of our easy to use software package c3net may contribute to the popularization of such methods. Reviewers This article was reviewed by Lev Klebanov, Joel Bader and Yuriy Gusev. PMID:21696592

  11. Comparative genomics of four Liliales families inferred from the complete chloroplast genome sequence of Veratrum patulum O. Loes. (Melanthiaceae).

    PubMed

    Do, Hoang Dang Khoa; Kim, Jung Sung; Kim, Joo-Hwan

    2013-11-10

    The sequence of the chloroplast genome, which is inherited maternally, contains useful information for many scientific fields such as plant systematics, biogeography and biotechnology because its characteristics are highly conserved among species. There is an increase in chloroplast genomes of angiosperms that have been sequenced in recent years. In this study, the nucleotide sequence of the chloroplast genome (cpDNA) of Veratrum patulum Loes. (Melanthiaceae, Liliales) was analyzed completely. The circular double-stranded DNA of 153,699 bp consists of two inverted repeat (IR) regions of 26,360 bp each, a large single copy of 83,372 bp, and a small single copy of 17,607 bp. This plastome contains 81 protein-coding genes, 30 distinct tRNA and four genes of rRNA. In addition, there are six hypothetical coding regions (ycf1, ycf2, ycf3, ycf4, ycf15 and ycf68) and two open reading frames (ORF42 and ORF56), which are also found in the chloroplast genomes of the other species. The gene orders and gene contents of the V. patulum plastid genome are similar to that of Smilax china, Lilium longiflorum and Alstroemeria aurea, members of the Smilacaceae, Liliaceae and Alstroemeriaceae (Liliales), respectively. However, the loss rps16 exon 2 in V. patulum results in the difference in the large single copy regions in comparison with other species. The base substitution rate is quite similar among genes of these species. Additionally, the base substitution rate of inverted repeat region was smaller than that of single copy regions in all observed species of Liliales. The IR regions were expanded to trnH_GUG in V. patulum, a part of rps19 in L. longiflorum and A. aurea, and whole sequence of rps19 in S. china. Furthermore, the IGS lengths of rbcL-accD-psaI region were variable among Liliales species, suggesting that this region might be a hotspot of indel events and the informative site for phylogenetic studies in Liliales. In general, the whole chloroplast genome of V. patulum, a

  12. Inference of Expanded Lrp-Like Feast/Famine Transcription Factor Targets in a Non-Model Organism Using Protein Structure-Based Prediction

    PubMed Central

    Ashworth, Justin; Plaisier, Christopher L.; Lo, Fang Yin; Reiss, David J.; Baliga, Nitin S.

    2014-01-01

    Widespread microbial genome sequencing presents an opportunity to understand the gene regulatory networks of non-model organisms. This requires knowledge of the binding sites for transcription factors whose DNA-binding properties are unknown or difficult to infer. We adapted a protein structure-based method to predict the specificities and putative regulons of homologous transcription factors across diverse species. As a proof-of-concept we predicted the specificities and transcriptional target genes of divergent archaeal feast/famine regulatory proteins, several of which are encoded in the genome of Halobacterium salinarum. This was validated by comparison to experimentally determined specificities for transcription factors in distantly related extremophiles, chromatin immunoprecipitation experiments, and cis-regulatory sequence conservation across eighteen related species of halobacteria. Through this analysis we were able to infer that Halobacterium salinarum employs a divergent local trans-regulatory strategy to regulate genes (carA and carB) involved in arginine and pyrimidine metabolism, whereas Escherichia coli employs an operon. The prediction of gene regulatory binding sites using structure-based methods is useful for the inference of gene regulatory relationships in new species that are otherwise difficult to infer. PMID:25255272

  13. Inferring Earth structure from the response to ocean tidal loads

    NASA Astrophysics Data System (ADS)

    Martens, H. R.; Simons, M.; Ito, T.

    2012-12-01

    Tidal forces, generated primarily by gravitational interactions with the moon and Sun, distort the shape of Earth's solid interior (body tides) and redistribute the mass of the oceans (ocean tides). The periodic shifting of ocean mass places cyclic loads on Earth, with the response to these loads observable as spatial displacements in Global Positioning System (GPS) data. Gravitational and elastic responses of the solid Earth to ocean tidal loads (OTLs) are controlled by the material properties of Earth's interior and may hence be used to constrain independently the absolute values of density and the elastic moduli down to c. 300km depth. Previous analysis of this type focused on structure in the western United States. We present observational results and modeled predictions for OTL-induced surface displacements at nearly 100 GPS stations across Brazil, Argentina, and Uruguay. Relative to the earlier study region, eastern South America is an ideal geographic location to study the effects of OTLs because it is composed primarily of stable shield and platform provinces, implying less structural complexity. Furthermore, the region is bounded to the north and east by large amplitude ocean tides. Obtaining absolute values for material properties in the crust and upper mantle beneath South America could provide valuable insight into the structure of the Amazonian craton and hence knowledge about its long-term stability against tectonic deformation. We extract the amplitude and phase of several main tidal constituents from the GPS data using classical harmonic analysis. We then compare our observations with theoretical predictions drawn from a variety of Earth models. Predicted surface displacements derived from radially symmetric Earth models, such as PREM and ad hoc perturbations to PREM, exhibit spatially correlated residuals, suggesting a need to explore a wider family of models, including those with lateral heterogeneity. Initially we have relied on one

  14. Crustal structure across southern Mexico inferred from gravity data

    NASA Astrophysics Data System (ADS)

    Campos-Enríquez, J. O.; Sánchez-Zamora, O.

    2000-11-01

    We present a gravity model of the crustal structure in southern Mexico based on interpretation of a detailed marine gravity profile perpendicularly across the Middle America Trench offshore from Acapulco, and a regional gravity transect extending into continental Mexico across the Sierra Madre del Sur, the central sector of the Trans-Mexican Volcanic Belt, the Sierra Madre Oriental, the Coastal Plain, and into the Gulf of Mexico. The elastic thickness of the Cocos lithospheric plate was found to be 30 km. In agreement with a previous seismic refraction study, no major differences in crustal structure were observed on both sides of the O'Gorman Fracture Zone. The gravity high seaward of the trench is interpreted as due to the incipient flexure and crustal thinning. The gravity low at the axis of the trench is explained by the increase in water depth and the existence of low-density accreted or continental-derived sediments (2.25 and 2.40 g/cm 3). A gravity high of 50 mGal extending about 100 km landward is interpreted as caused by local shoaling of the Moho. The crust attains a thickness of 42 km under the Trans-Mexican Volcanic Belt but thins beneath the Coastal Plain and the continental slope of the Gulf of Mexico. Gravity highs around the Sierra de Tamaulipas are interpreted in terms of relief of the lower-upper crustal interface, implying a shallow basement.

  15. Sparse Bayesian Inference and the Temperature Structure of the Solar Corona

    NASA Astrophysics Data System (ADS)

    Warren, Harry P.; Byers, Jeff M.; Crump, Nicholas A.

    2017-02-01

    Measuring the temperature structure of the solar atmosphere is critical to understanding how it is heated to high temperatures. Unfortunately, the temperature of the upper atmosphere cannot be observed directly, but must be inferred from spectrally resolved observations of individual emission lines that span a wide range of temperatures. Such observations are “inverted” to determine the distribution of plasma temperatures along the line of sight. This inversion is ill posed and, in the absence of regularization, tends to produce wildly oscillatory solutions. We introduce the application of sparse Bayesian inference to the problem of inferring the temperature structure of the solar corona. Within a Bayesian framework a preference for solutions that utilize a minimum number of basis functions can be encoded into the prior and many ad hoc assumptions can be avoided. We demonstrate the efficacy of the Bayesian approach by considering a test library of 40 assumed temperature distributions.

  16. Submarine structure of Reunion Island (Indian Ocean) inferred from gravity

    NASA Astrophysics Data System (ADS)

    Gailler, L.; Lénat, J.

    2008-12-01

    La Reunion is a large (diameter: 220 km; height: 7 km), mostly immerged (97%) oceanic volcanic system. New land and marine gravity data are used to study the structure of its submarine part. The gravity models are interpreted jointly with the published geology interpretations and compared with magnetic models. This allows us to derive a new model of the shallow and internal structure of the submarine flanks. Recent cruises have collected high quality gravity, magnetic and multi-beam swath bathymetry data over the submarine flanks of La Réunion and the surrounding oceanic plate. A new Bouguer anomaly map has been computed for a reduction density of 2.67.103 kg m-3. A magnetic anomalies map covering the same area has been also built. Studies based on bathymetric and acoustic data have previously shown the presence of different types of submarine features: a coastal shelf, huge bulges built by debris avalanches and sediment deposits, erosion canyons, volcanic constructions near the coast, isolated seamounts offshore, and elongate volcanic ridges on the Mascarene plate. On the new Bouguer anomaly map, all these features are associated with negative anomalies. They have been modeled using 2 3/4 D modeling techniques. The short wavelength anomalies over the coastal shelf area can be explained by piles of low density layers. This suggests that they are mostly built by hyaloclastites which are generally characterized by lower densities than lava flows. The voluminous debris avalanche deposits which formed the huge Submarine Bulges to the east, north, west, and south of the island have also been modeled as low density formations. Each bulge is modeled with an overall density less than 2.67.103 kg m-3, in order to account for its long wavelength anomaly. Some shorter wavelength features are superimposed on these long wavelength negative anomalies. They probably represent heterogeneities within the bulges. Some shallow ones can be associated with observed surface geological

  17. Crustal structure beneath northeast India inferred from receiver function modeling

    NASA Astrophysics Data System (ADS)

    Borah, Kajaljyoti; Bora, Dipok K.; Goyal, Ayush; Kumar, Raju

    2016-09-01

    We estimated crustal shear velocity structure beneath ten broadband seismic stations of northeast India, by using H-Vp/Vs stacking method and a non-linear direct search approach, Neighbourhood Algorithm (NA) technique followed by joint inversion of Rayleigh wave group velocity and receiver function, calculated from teleseismic earthquakes data. Results show significant variations of thickness, shear velocities (Vs) and Vp/Vs ratio in the crust of the study region. The inverted shear wave velocity models show crustal thickness variations of 32-36 km in Shillong Plateau (North), 36-40 in Assam Valley and ∼44 km in Lesser Himalaya (South). Average Vp/Vs ratio in Shillong Plateau is less (1.73-1.77) compared to Assam Valley and Lesser Himalaya (∼1.80). Average crustal shear velocity beneath the study region varies from 3.4 to 3.5 km/s. Sediment structure beneath Shillong Plateau and Assam Valley shows 1-2 km thick sediment layer with low Vs (2.5-2.9 km/s) and high Vp/Vs ratio (1.8-2.1), while it is observed to be of greater thickness (4 km) with similar Vs and high Vp/Vs (∼2.5) in RUP (Lesser Himalaya). Both Shillong Plateau and Assam Valley show thick upper and middle crust (10-20 km), and thin (4-9 km) lower crust. Average Vp/Vs ratio in Assam Valley and Shillong Plateau suggest that the crust is felsic-to-intermediate and intermediate-to-mafic beneath Shillong Plateau and Assam Valley, respectively. Results show that lower crust rocks beneath the Shillong Plateau and Assam Valley lies between mafic granulite and mafic garnet granulite.

  18. The evolutionary history of Saccharomyces species inferred from completed mitochondrial genomes and revision in the 'yeast mitochondrial genetic code'.

    PubMed

    Sulo, Pavol; Szabóová, Dana; Bielik, Peter; Poláková, Silvia; Šoltys, Katarína; Jatzová, Katarína; Szemes, Tomáš

    2017-06-15

    The yeast Saccharomyces are widely used to test ecological and evolutionary hypotheses. A large number of nuclear genomic DNA sequences are available, but mitochondrial genomic data are insufficient. We completed mitochondrial DNA (mtDNA) sequencing from Illumina MiSeq reads for all Saccharomyces species. All are circularly mapped molecules decreasing in size with phylogenetic distance from Saccharomyces cerevisiae but with similar gene content including regulatory and selfish elements like origins of replication, introns, free-standing open reading frames or GC clusters. Their most profound feature is species-specific alteration in gene order. The genetic code slightly differs from well-established yeast mitochondrial code as GUG is used rarely as the translation start and CGA and CGC code for arginine. The multilocus phylogeny, inferred from mtDNA, does not correlate with the trees derived from nuclear genes. mtDNA data demonstrate that Saccharomyces cariocanus should be assigned as a separate species and Saccharomyces bayanus CBS 380T should not be considered as a distinct species due to mtDNA nearly identical to Saccharomyces uvarum mtDNA. Apparently, comparison of mtDNAs should not be neglected in genomic studies as it is an important tool to understand the origin and evolutionary history of some yeast species. © The Author 2017. Published by Oxford University Press on behalf of Kazusa DNA Research Institute.

  19. Inferences of drug responses in cancer cells from cancer genomic features and compound chemical and therapeutic properties

    PubMed Central

    Wang, Yongcui; Fang, Jianwen; Chen, Shilong

    2016-01-01

    Accurately predicting the response of a cancer patient to a therapeutic agent is a core goal of precision medicine. Existing approaches were mainly relied primarily on genomic alterations in cancer cells that have been treated with different drugs. Here we focus on predicting drug response based on integration of the heterogeneously pharmacogenomics data from both cell and drug sides. Through a systematical approach, named as PDRCC (Predict Drug Response in Cancer Cells), the cancer genomic alterations and compound chemical and therapeutic properties were incorporated to determine the chemotherapeutic response in cancer patients. Using the Cancer Cell Line Encyclopedia (CCLE) study as the benchmark dataset, all pharmacogenomics data exhibited their roles in inferring the relationships between cancer cells and drugs. When integrating both genomic resources and compound information, the prediction coverage was significantly increased. The validity of PDRCC was also supported by its effective in uncovering the unknown cell-drug associations with database and literature evidences. It set the stage for clinical testing of novel therapeutic strategies, such as the sensitive association between cancer cell ‘A549_LUNG’ and compound ‘Topotecan’. In conclusion, PDRCC offers the possibility for faster, safer, and cheaper the development of novel anti-cancer therapeutics in the early-stage clinical trails. PMID:27645580

  20. Inferences of drug responses in cancer cells from cancer genomic features and compound chemical and therapeutic properties

    NASA Astrophysics Data System (ADS)

    Wang, Yongcui; Fang, Jianwen; Chen, Shilong

    2016-09-01

    Accurately predicting the response of a cancer patient to a therapeutic agent is a core goal of precision medicine. Existing approaches were mainly relied primarily on genomic alterations in cancer cells that have been treated with different drugs. Here we focus on predicting drug response based on integration of the heterogeneously pharmacogenomics data from both cell and drug sides. Through a systematical approach, named as PDRCC (Predict Drug Response in Cancer Cells), the cancer genomic alterations and compound chemical and therapeutic properties were incorporated to determine the chemotherapeutic response in cancer patients. Using the Cancer Cell Line Encyclopedia (CCLE) study as the benchmark dataset, all pharmacogenomics data exhibited their roles in inferring the relationships between cancer cells and drugs. When integrating both genomic resources and compound information, the prediction coverage was significantly increased. The validity of PDRCC was also supported by its effective in uncovering the unknown cell-drug associations with database and literature evidences. It set the stage for clinical testing of novel therapeutic strategies, such as the sensitive association between cancer cell ‘A549_LUNG’ and compound ‘Topotecan’. In conclusion, PDRCC offers the possibility for faster, safer, and cheaper the development of novel anti-cancer therapeutics in the early-stage clinical trails.

  1. Epigenomics and the structure of the living genome

    PubMed Central

    Friedman, Nir; Rando, Oliver J.

    2015-01-01

    Eukaryotic genomes are packaged into an extensively folded state known as chromatin. Analysis of the structure of eukaryotic chromosomes has been revolutionized by development of a suite of genome-wide measurement technologies, collectively termed “epigenomics.” We review major advances in epigenomic analysis of eukaryotic genomes, covering aspects of genome folding at scales ranging from whole chromosome folding down to nucleotide-resolution assays that provide structural insights into protein-DNA interactions. We then briefly outline several challenges remaining and highlight new developments such as single-cell epigenomic assays that will help provide us with a high-resolution structural understanding of eukaryotic genomes. PMID:26430158

  2. Epigenomics and the structure of the living genome.

    PubMed

    Friedman, Nir; Rando, Oliver J

    2015-10-01

    Eukaryotic genomes are packaged into an extensively folded state known as chromatin. Analysis of the structure of eukaryotic chromosomes has been revolutionized by development of a suite of genome-wide measurement technologies, collectively termed "epigenomics." We review major advances in epigenomic analysis of eukaryotic genomes, covering aspects of genome folding at scales ranging from whole chromosome folding down to nucleotide-resolution assays that provide structural insights into protein-DNA interactions. We then briefly outline several challenges remaining and highlight new developments such as single-cell epigenomic assays that will help provide us with a high-resolution structural understanding of eukaryotic genomes.

  3. Genome-wide inference of natural selection on human transcription factor binding sites.

    PubMed

    Arbiza, Leonardo; Gronau, Ilan; Aksoy, Bulent A; Hubisz, Melissa J; Gulko, Brad; Keinan, Alon; Siepel, Adam

    2013-07-01

    For decades, it has been hypothesized that gene regulation has had a central role in human evolution, yet much remains unknown about the genome-wide impact of regulatory mutations. Here we use whole-genome sequences and genome-wide chromatin immunoprecipitation and sequencing data to demonstrate that natural selection has profoundly influenced human transcription factor binding sites since the divergence of humans from chimpanzees 4-6 million years ago. Our analysis uses a new probabilistic method, called INSIGHT, for measuring the influence of selection on collections of short, interspersed noncoding elements. We find that, on average, transcription factor binding sites have experienced somewhat weaker selection than protein-coding genes. However, the binding sites of several transcription factors show clear evidence of adaptation. Several measures of selection are strongly correlated with predicted binding affinity. Overall, regulatory elements seem to contribute substantially to both adaptive substitutions and deleterious polymorphisms with key implications for human evolution and disease.

  4. Phylogenetic position of the coral symbiont Ostreobium (Ulvophyceae) inferred from chloroplast genome data.

    PubMed

    Verbruggen, Heroen; Marcelino, Vanessa R; Guiry, Michael D; Cremen, M Chiela M; Jackson, Christopher J

    2017-04-10

    The green algal genus Ostreobium is an important symbiont of corals, playing roles in reef decalcification and providing photosynthates to the coral during bleaching events. A chloroplast genome of a cultured strain of Ostreobium was available, but low taxon sampling and Ostreobium's early-branching nature left doubt about its phylogenetic position. Here we generate and describe chloroplast genomes from four Ostreobium strains as well as Avrainvillea mazei and Neomeris sp., strategically sampled early-branching lineages in the Bryopsidales and Dasycladales, respectively. At 80,584 bp, the chloroplast genome of Ostreobium sp. HV05042 is the most compact yet found in the Ulvophyceae. The Avrainvillea chloroplast genome is ca. 94 kbp and contains introns in infA and cysT that have nearly complete sequence identity except for an ORF in infA that is not present in cysT. In line with other bryopsidalean species, it also contains regions with possibly bacteria-derived ORFs. The Neomeris data did not assemble into a canonical circular chloroplast genome but a large number of contigs containing fragments of chloroplast genes and showing evidence of long introns and intergenic regions, and the Neomeris chloroplast genome size was estimated to exceed 1.87 Mb. Chloroplast phylogenomics and 18S nrDNA data showed strong support for the Ostreobium lineage being sister to the remaining Bryopsidales. There were differences in branch support when outgroups were varied, but the overall support for the placement of Ostreobium was strong. These results permitted us to validate two suborders and introduce a third, the Ostreobineae. This article is protected by copyright. All rights reserved.

  5. Partial short-read sequencing of a highly inbred Iberian pig and genomics inference thereof

    PubMed Central

    Esteve-Codina, A; Kofler, R; Himmelbauer, H; Ferretti, L; Vivancos, A P; Groenen, M A M; Folch, J M; Rodríguez, M C; Pérez-Enciso, M

    2011-01-01

    Despite dramatic reduction in sequencing costs with the advent of next generation sequencing technologies, obtaining a complete mammalian genome sequence at sufficient depth is still costly. An alternative is partial sequencing. Here, we have sequenced a reduced representation library of an Iberian sow from the Guadyerbas strain, a highly inbred strain that has been used in numerous QTL studies because of its extreme phenotypic characteristics. Using the Illumina Genome Analyzer II (San Diego, CA, USA), we resequenced ∼1% of the genome with average 4 × depth, identifying 68 778 polymorphisms. Of these, 55 457 were putative fixed differences with respect to the assembly, based on the genome of a Duroc pig, and 13 321 were heterozygous positions within Guadyerbas. Despite being highly inbred, the estimate of heterozygosity within Guadyerbas was ∼0.78 kb−1 in autosomes, after correcting for low depth. Nucleotide variability was consistently higher at the telomeric regions than on the rest of the chromosome, likely a result of increased recombination rates. Further, variability was 50% lower in the X-chromosome than in autosomes, which may be explained by a recent bottleneck or by selection. We divided the whole genome in 500 kb windows and we analyzed overrepresented gene ontology terms in regions of low and high variability. Multi organism process, pigmentation and cell killing were overrepresented in high variability regions and metabolic process ontology, within low variability regions. Further, a genome wide Hudson–Kreitman–Aguadé test was carried out per window; overall, variability was in agreement with neutral expectations. PMID:21407255

  6. A core phylogeny of Dictyostelia inferred from genomes representative of the eight major and minor taxonomic divisions of the group.

    PubMed

    Singh, Reema; Schilde, Christina; Schaap, Pauline

    2016-11-17

    Dictyostelia are a well-studied group of organisms with colonial multicellularity, which are members of the mostly unicellular Amoebozoa. A phylogeny based on SSU rDNA data subdivided all Dictyostelia into four major groups, but left the position of the root and of six group-intermediate taxa unresolved. Recent phylogenies inferred from 30 or 213 proteins from sequenced genomes, positioned the root between two branches, each containing two major groups, but lacked data to position the group-intermediate taxa. Since the positions of these early diverging taxa are crucial for understanding the evolution of phenotypic complexity in Dictyostelia, we sequenced six representative genomes of early diverging taxa. We retrieved orthologs of 47 housekeeping proteins with an average size of 890 amino acids from six newly sequenced and eight published genomes of Dictyostelia and unicellular Amoebozoa and inferred phylogenies from single and concatenated protein sequence alignments. Concatenated alignments of all 47 proteins, and four out of five subsets of nine concatenated proteins all produced the same consensus phylogeny with 100% statistical support. Trees inferred from just two out of the 47 proteins, individually reproduced the consensus phylogeny, highlighting that single gene phylogenies will rarely reflect correct species relationships. However, sets of two or three concatenated proteins again reproduced the consensus phylogeny, indicating that a small selection of genes suffices for low cost classification of as yet unincorporated or newly discovered dictyostelid and amoebozoan taxa by gene amplification. The multi-locus consensus phylogeny shows that groups 1 and 2 are sister clades in branch I, with the group-intermediate taxon D. polycarpum positioned as outgroup to group 2. Branch II consists of groups 3 and 4, with the group-intermediate taxon Polysphondylium violaceum positioned as sister to group 4, and the group-intermediate taxon Dictyostelium polycephalum

  7. Seismic structure of the oceanic lithosphere inferred from guided wave

    NASA Astrophysics Data System (ADS)

    Shito, A.; Suetsugu, D.; Furumura, T.; Sugioka, H.; Ito, A.

    2012-12-01

    Characteristic seismic waves are observed by seismological experiment using Broad-Band Ocean Bottom Seismometers (BBOBSs) conducted in the northwestern Pacific from 2007 to 2008 and from 2010 to 2011. The seismic waves have low frequency onset (< 1 Hz) followed by high frequency later phases (2.5-10 Hz). The high frequency later phases have large amplitude and long duration for both P and S waves. The seismic waves are observed commonly at the BBOBS array from events in the subducting Pacific plate. To investigate generation and propagation mechanisms of the seismic wave will help us to understand the seismic structure and the origin of the oceanic lithosphere. High frequency phases travelling efficiently through the oceanic lithosphere more than 3000 km are well known phenomenon. These phases were previously called as Po/So waves. Po/So waves were observed as early as 1935, and were studied actively from the 1970s to 1990s. However, the mechanism of generation and propagation of the phases are still controversial. The guided waves propagating in subducting plate are also common phenomenon in the subduction zone. The waves are generally characterized by separation of low frequency and high frequency components. In order to explain the separation, Martin and Rietbrock [2003] considered the trapping of waves in the waveguide formed by thin low velocity former oceanic crust at the top of the plate. However, large amplitude and long duration of the high frequency component cannot be achieved by the model. From the analysis of waveform observed at the eastern seaboard of northern Japan and numerical simulation of seismic wave propagation, Furumura and Kennet [2005] demonstrate that the guided wave travelling in the subducting plate is produced by multiple forward scattering of high-frequency seismic waves due to small-scale random heterogeneity in the plate structure. We apply the method proposed by Furumura and Kennett [2005] to reproduce the seismograms recorded by

  8. Upper Mantle Structure beneath Afar: inferences from surface waves.

    NASA Astrophysics Data System (ADS)

    Sicilia, D.; Montagner, J.; Debayle, E.; Lepine, J.; Leveque, J.; Cara, M.; Ataley, A.; Sholan, J.

    2001-12-01

    The Afar hotspot is related to one of the most important plume from a geodynamic point of view. It has been advocated to be the surface expression of the South-West African Superswell. Below the lithosphere, the Afar plume might feed other hotspots in central Africa (Hadiouche et al., 1989; Ebinger & Sleep, 1998). The processes of interaction between crust, lithosphere and plume are not well understood. In order to gain insight into the scientific issue, we have performed a surface-wave tomography covering the Horn of Africa. A data set of 1404 paths for Rayleigh waves and 473 paths for Love waves was selected in the period range 45-200s. They were collected from the permanent IRIS and GEOSCOPE networks and from the PASSCAL experiment, in Tanzania and Saudi Arabia. Other data come from the broadband stations deployed in Ethiopia and Yemen in the framework of the French INSU program ``Horn of Africa''. The results presented here come from a path average phase velocities obtained with a method based on a least-squares minimization (Beucler et al., 2000). The local phase velocity distribution and the azimuthal anisotropy were simultaneously retrieved by using the tomographic technique of Montagner (1986). A correction of the data is applied according to the crustal structure of the 3SMAC model (Nataf & Ricard, 1996). We find low velocities down to 200 km depth beneath the Red Sea, the Gulf of Aden, Afars, the Ethiopian Plateau and southern Arabia. High velocities are present in the eastern Arabia and the Tanzania Craton. The anisotropy beneath Afar seems to be complex, but enables to map the flow pattern at the interface lithosphere-asthenosphere. The results presented here are complementary to those obtained by Debayle et al. (2001) at upper-mantle transition zone depths using waveform inversion of higher Rayle igh modes.

  9. Inference of Candidate Germline Mutator Loci in Humans from Genome-Wide Haplotype Data

    PubMed Central

    2017-01-01

    The rate of germline mutation varies widely between species but little is known about the extent of variation in the germline mutation rate between individuals of the same species. Here we demonstrate that an allele that increases the rate of germline mutation can result in a distinctive signature in the genomic region linked to the affected locus, characterized by a number of haplotypes with a locally high proportion of derived alleles, against a background of haplotypes carrying a typical proportion of derived alleles. We searched for this signature in human haplotype data from phase 3 of the 1000 Genomes Project and report a number of candidate mutator loci, several of which are located close to or within genes involved in DNA repair or the DNA damage response. To investigate whether mutator alleles remained active at any of these loci, we used de novo mutation counts from human parent-offspring trios in the 1000 Genomes and Genome of the Netherlands cohorts, looking for an elevated number of de novo mutations in the offspring of parents carrying a candidate mutator haplotype at each of these loci. We found some support for two of the candidate loci, including one locus just upstream of the BRSK2 gene, which is expressed in the testis and has been reported to be involved in the response to DNA damage. PMID:28095480

  10. Interspecific chromosome substitution lines as genetic resources for improvement, trait analysis and genomic inference

    USDA-ARS?s Scientific Manuscript database

    The genetic base that cotton breeders commonly use to improve Upland cultivars is very narrow. The AD-genome species G. barbadense, G. tomentosum, and G. mustelinum are part of the primary germplasm pool, too, and constitute genetic reservoirs of genes for resistance to abiotic stress, pests and pa...

  11. Linear Models for Item Scores: Reliability, Covariance Structure, and Psychometric Inference.

    ERIC Educational Resources Information Center

    Woodruff, David

    Two analyses of variance (ANOVA) models for item scores are compared. The first is an items by subject random effect ANOVA. The second is a mixed effects ANOVA with items fixed and subjects random. Comparisons regarding reliability, Cronbach's alpha coefficient, psychometric inference, and inter-item covariance structure are made between the…

  12. Earth's structure and evolution inferred from topography, gravity, and seismicity.

    NASA Astrophysics Data System (ADS)

    Watkinson, A. J.; Menard, J.; Patton, R. L.

    2016-12-01

    Earth's wavelength-dependent response to loading, reflected in observed topography, gravity, and seismicity, can be interpreted in terms of a stack of layers under the assumption of transverse isotropy. The theory of plate tectonics holds that the outermost layers of this stack are mobile, produced at oceanic ridges, and consumed at subduction zones. Their toroidal motions are generally consistent with those of several rigid bodies, except in the world's active mountain belts where strains are partitioned and preserved in tectonite fabrics. Even portions of the oceanic lithosphere exhibit non-rigid behavior. Earth's gravity-topography cross-spectrum exhibits notable variations in signal amplitude and character at spherical harmonic degrees l=13, 116, 416, and 1389. Corresponding Cartesian wavelengths are approximately equal to the respective thicknesses of Earth's mantle, continental mantle lithosphere, oceanic thermal lithosphere, and continental crust, all known from seismology. Regional variations in seismic moment release with depth, derived from the global Centroid Moment Tensor catalog, are also evident in the crust and mantle lithosphere. Combined, these observations provide powerful constraints for the structure and evolution of the crust, mantle lithosphere, and mantle as a whole. All that is required is a dynamically consistent mechanism relating wavelength to layer thickness and shear-strain localization. A statistically-invariant 'diharmonic' relation exhibiting these properties appears as the leading order approximation to toroidal motions on a self-gravitating body of differential grade-2 material. We use this relation, specifically its predictions of weakness and rigidity, and of folding and shear banding response as a function of wavelength-to-thickness ratio, to interpret Earth's gravity, topography, and seismicity in four-dimensions. We find the mantle lithosphere to be about 255-km thick beneath the Himalaya and the Andes, and the long

  13. A Novel Candidate Vaccine for Cytauxzoonosis Inferred from Comparative Apicomplexan Genomics

    PubMed Central

    Tarigo, Jaime L.; Scholl, Elizabeth H.; Bird, David McK.; Brown, Corrie C.; Cohn, Leah A.; Dean, Gregg A.; Levy, Michael G.; Doolan, Denise L.; Trieu, Angela; Nordone, Shila K.; Felgner, Philip L.; Vigil, Adam; Birkenheuer, Adam J.

    2013-01-01

    Cytauxzoonosis is an emerging infectious disease of domestic cats (Felis catus) caused by the apicomplexan protozoan parasite Cytauxzoon felis. The growing epidemic, with its high morbidity and mortality points to the need for a protective vaccine against cytauxzoonosis. Unfortunately, the causative agent has yet to be cultured continuously in vitro, rendering traditional vaccine development approaches beyond reach. Here we report the use of comparative genomics to computationally and experimentally interpret the C. felis genome to identify a novel candidate vaccine antigen for cytauxzoonosis. As a starting point we sequenced, assembled, and annotated the C. felis genome and the proteins it encodes. Whole genome alignment revealed considerable conserved synteny with other apicomplexans. In particular, alignments with the bovine parasite Theileria parva revealed that a C. felis gene, cf76, is syntenic to p67 (the leading vaccine candidate for bovine theileriosis), despite a lack of significant sequence similarity. Recombinant subdomains of cf76 were challenged with survivor-cat antiserum and found to be highly seroreactive. Comparison of eleven geographically diverse samples from the south-central and southeastern USA demonstrated 91–100% amino acid sequence identity across cf76, including a high level of conservation in an immunogenic 226 amino acid (24 kDa) carboxyl terminal domain. Using in situ hybridization, transcription of cf76 was documented in the schizogenous stage of parasite replication, the life stage that is believed to be the most important for development of a protective immune response. Collectively, these data point to identification of the first potential vaccine candidate antigen for cytauxzoonosis. Further, our bioinformatic approach emphasizes the use of comparative genomics as an accelerated path to developing vaccines against experimentally intractable pathogens. PMID:23977000

  14. Chromosomal instability in Afrotheria: fragile sites, evolutionary breakpoints and phylogenetic inference from genome sequence assemblies

    PubMed Central

    Ruiz-Herrera, Aurora; Robinson, Terence J

    2007-01-01

    Background Extant placental mammals are divided into four major clades (Laurasiatheria, Supraprimates, Xenarthra and Afrotheria). Given that Afrotheria is generally thought to root the eutherian tree in phylogenetic analysis of large nuclear gene data sets, the study of the organization of the genomes of afrotherian species provides new insights into the dynamics of mammalian chromosomal evolution. Here we test if there are chromosomal bands with a high tendency to break and reorganize in Afrotheria, and by analyzing the expression of aphidicolin-induced common fragile sites in three afrotherian species, whether these are coincidental with recognized evolutionary breakpoints. Results We described 29 fragile sites in the aardvark (OAF) genome, 27 in the golden mole (CAS), and 35 in the elephant-shrew (EED) genome. We show that fragile sites are conserved among afrotherian species and these are correlated with evolutionary breakpoints when compared to the human (HSA) genome. Inddition, by computationally scanning the newly released opossum (Monodelphis domestica) and chicken sequence assemblies for use as outgroups to Placentalia, we validate the HSA 3/21/5 chromosomal synteny as a rare genomic change that defines the monophyly of this ancient African clade of mammals. On the other hand, support for HSA 1/19p, which is also thought to underpin Afrotheria, is currently ambiguous. Conclusion We provide evidence that (i) the evolutionary breakpoints that characterise human syntenies detected in the basal Afrotheria correspond at the chromosomal band level with fragile sites, (ii) that HSA 3p/21 was in the amniote ancestor (i.e., common to turtles, lepidosaurs, crocodilians, birds and mammals) and was subsequently disrupted in the lineage leading to marsupials. Its expansion to include HSA 5 in Afrotheria is unique and (iii) that its fragmentation to HSA 3p/21 + HSA 5/21 in elephant and manatee was due to a fission within HSA 21 that is probably shared by all

  15. The admixed population structure in Danish Jersey dairy cattle challenges accurate genomic predictions.

    PubMed

    Thomasen, J R; Sørensen, A C; Su, G; Madsen, P; Lund, M S; Guldbrandtsen, B

    2013-07-01

    The main purpose of this study was to evaluate whether the population structure in Danish Jersey (DJ) known from the history of the breed also is reflected in its genomic structure. This is done by comparing the linkage disequilibrium and persistence of phase for subgroups of Jersey animals with high proportions of Danish (DNK) or United States (USJ) origin. Furthermore, it is investigated whether a model explicitly incorporating breed origin of animals, inferred either through the known pedigree or from SNP marker data, leads to improved genomic predictions compared with a model ignoring breed origin. The study of the population structure incorporated 1,730 genotyped Jersey animals. In total 39,542 SNP markers were included in the analysis. The 1,079 genotyped bulls with de-regressed proof for udder health were used in the analysis for the predictions of the genomic breeding values. A range of random regressions models that included the breed origin were analyzed and compared with a basic genomic model that assumes a homogeneous breed structure. The main finding in this study is that the importation of germplasm from the USJ population is readily reflected in the genomes of modern DJ animals. First, linkage disequilibrium in the group of admixed DJ animals is lower compared with the groups of the original DNK and USJ animals. Second, persistence of linkage disequilibrium phase is not conserved for longer marker distances between animals with mainly Danish or United States origin. Third, the STRUCTURE analysis could retrieve genomic-based breed proportions in alignment to the pedigree-based breed proportions. However, including this population structure in a random regression prediction model did not clearly improve the reliabilities of the genomic predictions compared with a basic genomic model.

  16. Phylogeny and physiology of candidate phylum ‘Atribacteria' (OP9/JS1) inferred from cultivation-independent genomics

    PubMed Central

    Nobu, Masaru K; Dodsworth, Jeremy A; Murugapiran, Senthil K; Rinke, Christian; Gies, Esther A; Webster, Gordon; Schwientek, Patrick; Kille, Peter; Parkes, R John; Sass, Henrik; Jørgensen, Bo B; Weightman, Andrew J; Liu, Wen-Tso; Hallam, Steven J; Tsiamis, George; Woyke, Tanja; Hedlund, Brian P

    2016-01-01

    The ‘Atribacteria' is a candidate phylum in the Bacteria recently proposed to include members of the OP9 and JS1 lineages. OP9 and JS1 are globally distributed, and in some cases abundant, in anaerobic marine sediments, geothermal environments, anaerobic digesters and reactors and petroleum reservoirs. However, the monophyly of OP9 and JS1 has been questioned and their physiology and ecology remain largely enigmatic due to a lack of cultivated representatives. Here cultivation-independent genomic approaches were used to provide a first comprehensive view of the phylogeny, conserved genomic features and metabolic potential of members of this ubiquitous candidate phylum. Previously available and heretofore unpublished OP9 and JS1 single-cell genomic data sets were used as recruitment platforms for the reconstruction of atribacterial metagenome bins from a terephthalate-degrading reactor biofilm and from the monimolimnion of meromictic Sakinaw Lake. The single-cell genomes and metagenome bins together comprise six species- to genus-level groups that represent most major lineages within OP9 and JS1. Phylogenomic analyses of these combined data sets confirmed the monophyly of the ‘Atribacteria' inclusive of OP9 and JS1. Additional conserved features within the ‘Atribacteria' were identified, including a gene cluster encoding putative bacterial microcompartments that may be involved in aldehyde and sugar metabolism, energy conservation and carbon storage. Comparative analysis of the metabolic potential inferred from these data sets revealed that members of the ‘Atribacteria' are likely to be heterotrophic anaerobes that lack respiratory capacity, with some lineages predicted to specialize in either primary fermentation of carbohydrates or secondary fermentation of organic acids, such as propionate. PMID:26090992

  17. Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Computation Approach

    PubMed Central

    Boitard, Simon; Rodríguez, Willy; Jay, Flora; Mona, Stefano; Austerlitz, Frédéric

    2016-01-01

    Inferring the ancestral dynamics of effective population size is a long-standing question in population genetics, which can now be tackled much more accurately thanks to the massive genomic data available in many species. Several promising methods that take advantage of whole-genome sequences have been recently developed in this context. However, they can only be applied to rather small samples, which limits their ability to estimate recent population size history. Besides, they can be very sensitive to sequencing or phasing errors. Here we introduce a new approximate Bayesian computation approach named PopSizeABC that allows estimating the evolution of the effective population size through time, using a large sample of complete genomes. This sample is summarized using the folded allele frequency spectrum and the average zygotic linkage disequilibrium at different bins of physical distance, two classes of statistics that are widely used in population genetics and can be easily computed from unphased and unpolarized SNP data. Our approach provides accurate estimations of past population sizes, from the very first generations before present back to the expected time to the most recent common ancestor of the sample, as shown by simulations under a wide range of demographic scenarios. When applied to samples of 15 or 25 complete genomes in four cattle breeds (Angus, Fleckvieh, Holstein and Jersey), PopSizeABC revealed a series of population declines, related to historical events such as domestication or modern breed creation. We further highlight that our approach is robust to sequencing errors, provided summary statistics are computed from SNPs with common alleles. PMID:26943927

  18. Phylogeny and physiology of candidate phylum 'Atribacteria' (OP9/JS1) inferred from cultivation-independent genomics.

    PubMed

    Nobu, Masaru K; Dodsworth, Jeremy A; Murugapiran, Senthil K; Rinke, Christian; Gies, Esther A; Webster, Gordon; Schwientek, Patrick; Kille, Peter; Parkes, R John; Sass, Henrik; Jørgensen, Bo B; Weightman, Andrew J; Liu, Wen-Tso; Hallam, Steven J; Tsiamis, George; Woyke, Tanja; Hedlund, Brian P

    2016-02-01

    The 'Atribacteria' is a candidate phylum in the Bacteria recently proposed to include members of the OP9 and JS1 lineages. OP9 and JS1 are globally distributed, and in some cases abundant, in anaerobic marine sediments, geothermal environments, anaerobic digesters and reactors and petroleum reservoirs. However, the monophyly of OP9 and JS1 has been questioned and their physiology and ecology remain largely enigmatic due to a lack of cultivated representatives. Here cultivation-independent genomic approaches were used to provide a first comprehensive view of the phylogeny, conserved genomic features and metabolic potential of members of this ubiquitous candidate phylum. Previously available and heretofore unpublished OP9 and JS1 single-cell genomic data sets were used as recruitment platforms for the reconstruction of atribacterial metagenome bins from a terephthalate-degrading reactor biofilm and from the monimolimnion of meromictic Sakinaw Lake. The single-cell genomes and metagenome bins together comprise six species- to genus-level groups that represent most major lineages within OP9 and JS1. Phylogenomic analyses of these combined data sets confirmed the monophyly of the 'Atribacteria' inclusive of OP9 and JS1. Additional conserved features within the 'Atribacteria' were identified, including a gene cluster encoding putative bacterial microcompartments that may be involved in aldehyde and sugar metabolism, energy conservation and carbon storage. Comparative analysis of the metabolic potential inferred from these data sets revealed that members of the 'Atribacteria' are likely to be heterotrophic anaerobes that lack respiratory capacity, with some lineages predicted to specialize in either primary fermentation of carbohydrates or secondary fermentation of organic acids, such as propionate.

  19. Inferring action structure and causal relationships in continuous sequences of human action.

    PubMed

    Buchsbaum, Daphna; Griffiths, Thomas L; Plunkett, Dillon; Gopnik, Alison; Baldwin, Dare

    2015-02-01

    In the real world, causal variables do not come pre-identified or occur in isolation, but instead are embedded within a continuous temporal stream of events. A challenge faced by both human learners and machine learning algorithms is identifying subsequences that correspond to the appropriate variables for causal inference. A specific instance of this problem is action segmentation: dividing a sequence of observed behavior into meaningful actions, and determining which of those actions lead to effects in the world. Here we present a Bayesian analysis of how statistical and causal cues to segmentation should optimally be combined, as well as four experiments investigating human action segmentation and causal inference. We find that both people and our model are sensitive to statistical regularities and causal structure in continuous action, and are able to combine these sources of information in order to correctly infer both causal relationships and segmentation boundaries.

  20. Climate-induced changes in lake ecosystem structure inferred from coupled neo- and paleoecological approaches

    USGS Publications Warehouse

    Saros, Jasmine E.; Stone, Jeffery R.; Pederson, Gregory T.; Slemmons, Krista; Spanbauer, Trisha; Schliep, Anna; Cahl, Douglas; Williamson, Craig E.; Engstrom, Daniel R.

    2015-01-01

    Over the 20th century, surface water temperatures have increased in many lake ecosystems around the world, but long-term trends in the vertical thermal structure of lakes remain unclear, despite the strong control that thermal stratification exerts on the biological response of lakes to climate change. Here we used both neo- and paleoecological approaches to develop a fossil-based inference model for lake mixing depths and thereby refine understanding of lake thermal structure change. We focused on three common planktonic diatom taxa, the distributions of which previous research suggests might be affected by mixing depth. Comparative lake surveys and growth rate experiments revealed that these species respond to lake thermal structure when nitrogen is sufficient, with species optima ranging from shallower to deeper mixing depths. The diatom-based mixing depth model was applied to sedimentary diatom profiles extending back to 1750 AD in two lakes with moderate nitrate concentrations but differing climate settings. Thermal reconstructions were consistent with expected changes, with shallower mixing depths inferred for an alpine lake where treeline has advanced, and deeper mixing depths inferred for a boreal lake where wind strength has increased. The inference model developed here provides a new tool to expand and refine understanding of climate-induced changes in lake ecosystems.

  1. A Symmetry-Based Method to Infer Structural Brain Networks from Probabilistic Tractography Data

    PubMed Central

    Shadi, Kamal; Bakhshi, Saideh; Gutman, David A.; Mayberg, Helen S.; Dovrolis, Constantine

    2016-01-01

    Recent progress in diffusion MRI and tractography algorithms as well as the launch of the Human Connectome Project (HCP)1 have provided brain research with an abundance of structural connectivity data. In this work, we describe and evaluate a method that can infer the structural brain network that interconnects a given set of Regions of Interest (ROIs) from probabilistic tractography data. The proposed method, referred to as Minimum Asymmetry Network Inference Algorithm (MANIA), does not determine the connectivity between two ROIs based on an arbitrary connectivity threshold. Instead, we exploit a basic limitation of the tractography process: the observed streamlines from a source to a target do not provide any information about the polarity of the underlying white matter, and so if there are some fibers connecting two voxels (or two ROIs) X and Y, tractography should be able in principle to follow this connection in both directions, from X to Y and from Y to X. We leverage this limitation to formulate the network inference process as an optimization problem that minimizes the (appropriately normalized) asymmetry of the observed network. We evaluate the proposed method using both the FiberCup dataset and based on a noise model that randomly corrupts the observed connectivity of synthetic networks. As a case-study, we apply MANIA on diffusion MRI data from 28 healthy subjects to infer the structural network between 18 corticolimbic ROIs that are associated with various neuropsychiatric conditions including depression, anxiety and addiction. PMID:27867354

  2. Simple Math is Enough: Two Examples of Inferring Functional Associations from Genomic Data

    NASA Technical Reports Server (NTRS)

    Liang, Shoudan

    2003-01-01

    Non-random features in the genomic data are usually biologically meaningful. The key is to choose the feature well. Having a p-value based score prioritizes the findings. If two proteins share a unusually large number of common interaction partners, they tend to be involved in the same biological process. We used this finding to predict the functions of 81 un-annotated proteins in yeast.

  3. Models of earth structure inferred from neodymium and strontium isotopic abundances

    PubMed Central

    Wasserburg, G. J.; DePaolo, D. J.

    1979-01-01

    A simplified model of earth structure based on the Nd and Sr isotopic characteristics of oceanic and continental tholeiitic flood basalts is presented, taking into account the motion of crustal plates and a chemical balance for trace elements. The resulting structure that is inferred consists of a lower mantle that is still essentially undifferentiated, overlain by an upper mantle that is the residue of the original source from which the continents were derived. PMID:16592688

  4. Low rate of genomic repatterning in Xenarthra inferred from chromosome painting data.

    PubMed

    Dobigny, G; Yang, F; O'Brien, P C M; Volobouev, V; Kovács, A; Pieczarka, J C; Ferguson-Smith, M A; Robinson, T J

    2005-01-01

    Comparative cytogenetic studies on Xenarthra, one of the most basal mammalian clades in the Placentalia, are virtually absent, being restricted largely to descriptions of conventional karyotypes and diploid numbers. We present a molecular cytogenetic comparison of chromosomes from the two-toed (Choloepus didactylus, 2n = 65) and three-toed sloth species (Bradypus tridactylus, 2n = 52), an anteater (Tamandua tetradactyla, 2n = 54) which, together with some data on the six-banded armadillo (Euphractus sexcinctus, 2n = 58), collectively represent all the major xenarthran lineages. Our results, based on interspecific chromosome painting using flow-sorted two-toed sloth chromosomes as painting probes, show the sloth species to be karyotypically closely related but markedly different from the anteater. We also test the synteny disruptions and segmental associations identified within Pilosa (anteaters and sloths) against the chromosomes of the six-banded armadillo as outgroup taxon. We could thus polarize the 35 non-ambiguously identified chromosomal changes characterizing the evolution of the anteater and sloth genomes and map these to a published sequence-based phylogeny for the group. These data suggest a low rate of genomic repatterning when placed in the context of divergence estimates based on molecular and fossil data. Finally, our results provide a glimpse of a likely ancestral karyotype for the extant Xenarthra, a pivotal group for understanding eutherian genome evolution.

  5. The Structure of a Rigorously Conserved RNA Element within the SARS Virus Genome

    PubMed Central

    Robertson, Michael P; Igel, Haller; Baertsch, Robert; Haussler, David; Ares, Manuel

    2005-01-01

    We have solved the three-dimensional crystal structure of the stem-loop II motif (s2m) RNA element of the SARS virus genome to 2.7-Å resolution. SARS and related coronaviruses and astroviruses all possess a motif at the 3′ end of their RNA genomes, called the s2m, whose pathogenic importance is inferred from its rigorous sequence conservation in an otherwise rapidly mutable RNA genome. We find that this extreme conservation is clearly explained by the requirement to form a highly structured RNA whose unique tertiary structure includes a sharp 90° kink of the helix axis and several novel longer-range tertiary interactions. The tertiary base interactions create a tunnel that runs perpendicular to the main helical axis whose interior is negatively charged and binds two magnesium ions. These unusual features likely form interaction surfaces with conserved host cell components or other reactive sites required for virus function. Based on its conservation in viral pathogen genomes and its absence in the human genome, we suggest that these unusual structural features in the s2m RNA element are attractive targets for the design of anti-viral therapeutic agents. Structural genomics has sought to deduce protein function based on three-dimensional homology. Here we have extended this approach to RNA by proposing potential functions for a rigorously conserved set of RNA tertiary structural interactions that occur within the SARS RNA genome itself. Based on tertiary structural comparisons, we propose the s2m RNA binds one or more proteins possessing an oligomer-binding-like fold, and we suggest a possible mechanism for SARS viral RNA hijacking of host protein synthesis, both based upon observed s2m RNA macromolecular mimicry of a relevant ribosomal RNA fold. PMID:15630477

  6. Structure and function of the mammalian middle ear. II: Inferring function from structure.

    PubMed

    Mason, Matthew J

    2016-02-01

    Anatomists and zoologists who study middle ear morphology are often interested to know what the structure of an ear can reveal about the auditory acuity and hearing range of the animal in question. This paper represents an introduction to middle ear function targetted towards biological scientists with little experience in the field of auditory acoustics. Simple models of impedance matching are first described, based on the familiar concepts of the area and lever ratios of the middle ear. However, using the Mongolian gerbil Meriones unguiculatus as a test case, it is shown that the predictions made by such 'ideal transformer' models are generally not consistent with measurements derived from recent experimental studies. Electrical analogue models represent a better way to understand some of the complex, frequency-dependent responses of the middle ear: these have been used to model the effects of middle ear subcavities, and the possible function of the auditory ossicles as a transmission line. The concepts behind such models are explained here, again aimed at those with little background knowledge. Functional inferences based on middle ear anatomy are more likely to be valid at low frequencies. Acoustic impedance at low frequencies is dominated by compliance; expanded middle ear cavities, found in small desert mammals including gerbils, jerboas and the sengi Macroscelides, are expected to improve low-frequency sound transmission, as long as the ossicular system is not too stiff. © 2015 Anatomical Society.

  7. ARG-walker: inference of individual specific strengths of meiotic recombination hotspots by population genomics analysis

    PubMed Central

    2015-01-01

    Background Meiotic recombination hotspots play important roles in various aspects of genomics, but the underlying mechanisms for regulating the locations and strengths of recombination hotspots are not yet fully revealed. Most existing algorithms for estimating recombination rates from sequence polymorphism data can only output average recombination rates of a population, although there is evidence for the heterogeneity in recombination rates among individuals. For genome-wide association studies (GWAS) of recombination hotspots, an efficient algorithm that estimates the individualized strengths of recombination hotspots is highly desirable. Results In this work, we propose a novel graph mining algorithm named ARG-walker, based on random walks on ancestral recombination graphs (ARG), to estimate individual-specific recombination hotspot strengths. Extensive simulations demonstrate that ARG-walker is able to distinguish the hot allele of a recombination hotspot from the cold allele. Integrated with output of ARG-walker, we performed GWAS on the phased haplotype data of the 22 autosome chromosomes of the HapMap Asian population samples of Chinese and Japanese (JPT+CHB). Significant cis-regulatory signals have been detected, which is corroborated by the enrichment of the well-known 13-mer motif CCNCCNTNNCCNC of PRDM9 protein. Moreover, two new DNA motifs have been identified in the flanking regions of the significantly associated SNPs (single nucleotide polymorphisms), which are likely to be new cis-regulatory elements of meiotic recombination hotspots of the human genome. Conclusions Our results on both simulated and real data suggest that ARG-walker is a promising new method for estimating the individual recombination variations. In the future, it could be used to uncover the mechanisms of recombination regulation and human diseases related with recombination hotspots. PMID:26679564

  8. Comparative genomics of Eucalyptus and Corymbia reveals low rates of genome structural rearrangement.

    PubMed

    Butler, J B; Vaillancourt, R E; Potts, B M; Lee, D J; King, G J; Baten, A; Shepherd, M; Freeman, J S

    2017-05-22

    Previous studies suggest genome structure is largely conserved between Eucalyptus species. However, it is unknown if this conservation extends to more divergent eucalypt taxa. We performed comparative genomics between the eucalypt genera Eucalyptus and Corymbia. Our results will facilitate transfer of genomic information between these important taxa and provide further insights into the rate of structural change in tree genomes. We constructed three high density linkage maps for two Corymbia species (Corymbia citriodora subsp. variegata and Corymbia torelliana) which were used to compare genome structure between both species and Eucalyptus grandis. Genome structure was highly conserved between the Corymbia species. However, the comparison of Corymbia and E. grandis suggests large (from 1-13 MB) intra-chromosomal rearrangements have occurred on seven of the 11 chromosomes. Most rearrangements were supported through comparisons of the three independent Corymbia maps to the E. grandis genome sequence, and to other independently constructed Eucalyptus linkage maps. These are the first large scale chromosomal rearrangements discovered between eucalypts. Nonetheless, in the general context of plants, the genomic structure of the two genera was remarkably conserved; adding to a growing body of evidence that conservation of genome structure is common amongst woody angiosperms.

  9. Genomic hypomethylation in the human germline associates with selective structural mutability in the human genome.

    PubMed

    Li, Jian; Harris, R Alan; Cheung, Sau Wai; Coarfa, Cristian; Jeong, Mira; Goodell, Margaret A; White, Lisa D; Patel, Ankita; Kang, Sung-Hae; Shaw, Chad; Chinault, A Craig; Gambin, Tomasz; Gambin, Anna; Lupski, James R; Milosavljevic, Aleksandar

    2012-01-01

    The hotspots of structural polymorphisms and structural mutability in the human genome remain to be explained mechanistically. We examine associations of structural mutability with germline DNA methylation and with non-allelic homologous recombination (NAHR) mediated by low-copy repeats (LCRs). Combined evidence from four human sperm methylome maps, human genome evolution, structural polymorphisms in the human population, and previous genomic and disease studies consistently points to a strong association of germline hypomethylation and genomic instability. Specifically, methylation deserts, the ~1% fraction of the human genome with the lowest methylation in the germline, show a tenfold enrichment for structural rearrangements that occurred in the human genome since the branching of chimpanzee and are highly enriched for fast-evolving loci that regulate tissue-specific gene expression. Analysis of copy number variants (CNVs) from 400 human samples identified using a custom-designed array comparative genomic hybridization (aCGH) chip, combined with publicly available structural variation data, indicates that association of structural mutability with germline hypomethylation is comparable in magnitude to the association of structural mutability with LCR-mediated NAHR. Moreover, rare CNVs occurring in the genomes of individuals diagnosed with schizophrenia, bipolar disorder, and developmental delay and de novo CNVs occurring in those diagnosed with autism are significantly more concentrated within hypomethylated regions. These findings suggest a new connection between the epigenome, selective mutability, evolution, and human disease.

  10. Genomic Hypomethylation in the Human Germline Associates with Selective Structural Mutability in the Human Genome

    PubMed Central

    Li, Jian; Harris, R. Alan; Cheung, Sau Wai; Coarfa, Cristian; Jeong, Mira; Goodell, Margaret A.; White, Lisa D.; Patel, Ankita; Kang, Sung-Hae; Shaw, Chad; Chinault, A. Craig; Gambin, Tomasz; Gambin, Anna; Lupski, James R.; Milosavljevic, Aleksandar

    2012-01-01

    The hotspots of structural polymorphisms and structural mutability in the human genome remain to be explained mechanistically. We examine associations of structural mutability with germline DNA methylation and with non-allelic homologous recombination (NAHR) mediated by low-copy repeats (LCRs). Combined evidence from four human sperm methylome maps, human genome evolution, structural polymorphisms in the human population, and previous genomic and disease studies consistently points to a strong association of germline hypomethylation and genomic instability. Specifically, methylation deserts, the ∼1% fraction of the human genome with the lowest methylation in the germline, show a tenfold enrichment for structural rearrangements that occurred in the human genome since the branching of chimpanzee and are highly enriched for fast-evolving loci that regulate tissue-specific gene expression. Analysis of copy number variants (CNVs) from 400 human samples identified using a custom-designed array comparative genomic hybridization (aCGH) chip, combined with publicly available structural variation data, indicates that association of structural mutability with germline hypomethylation is comparable in magnitude to the association of structural mutability with LCR–mediated NAHR. Moreover, rare CNVs occurring in the genomes of individuals diagnosed with schizophrenia, bipolar disorder, and developmental delay and de novo CNVs occurring in those diagnosed with autism are significantly more concentrated within hypomethylated regions. These findings suggest a new connection between the epigenome, selective mutability, evolution, and human disease. PMID:22615578

  11. Revealing Less Derived Nature of Cartilaginous Fish Genomes with Their Evolutionary Time Scale Inferred with Nuclear Genes

    PubMed Central

    Renz, Adina J.; Meyer, Axel; Kuraku, Shigehiro

    2013-01-01

    Cartilaginous fishes, divided into Holocephali (chimaeras) and Elasmoblanchii (sharks, rays and skates), occupy a key phylogenetic position among extant vertebrates in reconstructing their evolutionary processes. Their accurate evolutionary time scale is indispensable for better understanding of the relationship between phenotypic and molecular evolution of cartilaginous fishes. However, our current knowledge on the time scale of cartilaginous fish evolution largely relies on estimates using mitochondrial DNA sequences. In this study, making the best use of the still partial, but large-scale sequencing data of cartilaginous fish species, we estimate the divergence times between the major cartilaginous fish lineages employing nuclear genes. By rigorous orthology assessment based on available genomic and transcriptomic sequence resources for cartilaginous fishes, we selected 20 protein-coding genes in the nuclear genome, spanning 2973 amino acid residues. Our analysis based on the Bayesian inference resulted in the mean divergence time of 421 Ma, the late Silurian, for the Holocephali-Elasmobranchii split, and 306 Ma, the late Carboniferous, for the split between sharks and rays/skates. By applying these results and other documented divergence times, we measured the relative evolutionary rate of the Hox A cluster sequences in the cartilaginous fish lineages, which resulted in a lower substitution rate with a factor of at least 2.4 in comparison to tetrapod lineages. The obtained time scale enables mapping phenotypic and molecular changes in a quantitative framework. It is of great interest to corroborate the less derived nature of cartilaginous fish at the molecular level as a genome-wide phenomenon. PMID:23825540

  12. Inferring Properties of Ancient Cyanobacteria from Biogeochemical Activity and Genomes of Siderophilic Cyanobacteria

    NASA Technical Reports Server (NTRS)

    McKay, David S.; Brown, I. I.; Tringe, S. G.; Thomas-Keprta, K. E.; Bryant, D. A.; Sarkisova, S. S.; Malley, K.; Sosa, O.; Klatt, C. G.; McKay, D. S.

    2010-01-01

    Interrelationships between life and the planetary system could have simultaneously left landmarks in genomes of microbes and physicochemical signatures in the lithosphere. Verifying the links between genomic features in living organisms and the mineralized signatures generated by these organisms will help to reveal traces of life on Earth and beyond. Among contemporary environments, iron-depositing hot springs (IDHS) may represent one of the most appropriate natural models [1] for insights into ancient life since organisms may have originated on Earth and probably Mars in association with hydrothermal activity [2,3]. IDHS also seem to be appropriate models for studying certain biogeochemical processes that could have taken place in the late Archean and,-or early Paleoproterozoic eras [4, 5]. It has been suggested that inorganic polyphosphate (PPi), in chains of tens to hundreds of phosphate residues linked by high-energy bonds, is environmentally ubiquitous and abundant [6]. Cyanobacteria (CB) react to increased heavy metal concentrations and UV by enhanced generation of PPi bodies (PPB) [7], which are believed to be signatures of life [8]. However, the role of PPi in oxygenic prokaryotes for the suppression of oxidative stress induced by high Fe is poorly studied. Here we present preliminary results of a new mechanism of Fe mineralization in oxygenic prokaryotes, the effect of Fe on the generation of PPi bodies in CB, as well as preliminary analysis of the diversity and phylogeny of proteins involved in the prevention of oxidative stress in phototrophs inhabiting IDHS.

  13. Primate phylogenetic relationships and divergence dates inferred from complete mitochondrial genomes

    PubMed Central

    Hodgson, Jason A.; Burrell, Andrew S.; Sterner, Kirstin N.; Raaum, Ryan L.; Disotell, Todd R.

    2014-01-01

    The origins and the divergence times of the most basal lineages within primates have been difficult to resolve mainly due to the incomplete sampling of early fossil taxa. The main source of contention is related to the discordance between molecular and fossil estimates: while there are no crown primate fossils older than 56 Ma, most molecule-based estimates extend the origins of crown primates into the Cretaceous. Here we present a comprehensive mitogenomic study of primates. We assembled 87 mammalian mitochondrial genomes, including 62 primate species representing all the families of the order. We newly sequenced eleven mitochondrial genomes, including eight Old World monkeys and three strepsirrhines. Phylogenetic analyses support a strong topology, confirming the monophyly for all the major primate clades. In contrast to previous mitogenomic studies, the positions of tarsiers and colugos relative to strepsirrhines and anthropoids are well resolved. In order to improve our understanding of how fossil calibrations affect age estimates within primates, we explore the effect of seventeen fossil calibrations across primates and other mammalian groups and we select a subset of calibrations to date our mitogenomic tree. The divergence date estimates of the Strepsirrhine/Haplorhine split support an origin of crown primates in the Late Cretaceous, at around 74 Ma. This result supports a short fuse model of primate origins, whereby relatively little time passed between the origin of the order and the diversification of its major clades. It also suggests that the early primate fossil record is likely poorly sampled. PMID:24583291

  14. Inferring the choreography of parental genomes during fertilization from ultralarge-scale whole-transcriptome analysis.

    PubMed

    Park, Sung-Joon; Komata, Makiko; Inoue, Fukashi; Yamada, Kaori; Nakai, Kenta; Ohsugi, Miho; Shirahige, Katsuhiko

    2013-12-15

    Fertilization precisely choreographs parental genomes by using gamete-derived cellular factors and activating genome regulatory programs. However, the mechanism remains elusive owing to the technical difficulties of preparing large numbers of high-quality preimplantation cells. Here, we collected >14 × 10(4) high-quality mouse metaphase II oocytes and used these to establish detailed transcriptional profiles for four early embryo stages and parthenogenetic development. By combining these profiles with other public resources, we found evidence that gene silencing appeared to be mediated in part by noncoding RNAs and that this was a prerequisite for post-fertilization development. Notably, we identified 817 genes that were differentially expressed in embryos after fertilization compared with parthenotes. The regulation of these genes was distinctly different from those expressed in parthenotes, suggesting functional specialization of particular transcription factors prior to first cell cleavage. We identified five transcription factors that were potentially necessary for developmental progression: Foxd1, Nkx2-5, Sox18, Myod1, and Runx1. Our very large-scale whole-transcriptome profile of early mouse embryos yielded a novel and valuable resource for studies in developmental biology and stem cell research. The database is available at http://dbtmee.hgc.jp.

  15. Karyotypic evolution of the family Sciuridae: inferences from the genome organizations of ground squirrels.

    PubMed

    Li, T; Wang, J; Su, W; Nie, W; Yang, F

    2006-01-01

    Cross-species chromosome painting has made a great contribution to our understanding of the evolution of karyotypes and genome organizations of mammals. Several recent papers of comparative painting between tree and flying squirrels have shed some light on the evolution of the family Sciuridae and the order Rodentia. In the present study we have extended the comparative painting to the Himalayan marmot (Marmotahimalayana) and the African ground squirrel (Xerus cf. erythropus), i.e. representative species from another important squirrel group--the ground squirrels--, and have established genome-wide comparative chromosome maps between human, eastern gray squirrel, and these two ground squirrels. The results show that 1) the squirrels so far studied all have conserved karyotypes that resemble the ancestral karyotype of the order Rodentia; 2) the African ground squirrels could have retained the ancestral karyotype of the family Sciuridae. Furthermore, we have mapped the evolutionary rearrangements onto a molecular-based consensus phylogenetic tree of the family Sciuridae. 2006 S. Karger AG, Basel.

  16. King penguin demography since the last glaciation inferred from genome-wide data.

    PubMed

    Trucchi, Emiliano; Gratton, Paolo; Whittington, Jason D; Cristofari, Robin; Le Maho, Yvon; Stenseth, Nils Chr; Le Bohec, Céline

    2014-07-22

    How natural climate cycles, such as past glacial/interglacial patterns, have shaped species distributions at the high-latitude regions of the Southern Hemisphere is still largely unclear. Here, we show how the post-glacial warming following the Last Glacial Maximum (ca 18 000 years ago), allowed the (re)colonization of the fragmented sub-Antarctic habitat by an upper-level marine predator, the king penguin Aptenodytes patagonicus. Using restriction site-associated DNA sequencing and standard mitochondrial data, we tested the behaviour of subsets of anonymous nuclear loci in inferring past demography through coalescent-based and allele frequency spectrum analyses. Our results show that the king penguin population breeding on Crozet archipelago steeply increased in size, closely following the Holocene warming recorded in the Epica Dome C ice core. The following population growth can be explained by a threshold model in which the ecological requirements of this species (year-round ice-free habitat for breeding and access to a major source of food such as the Antarctic Polar Front) were met on Crozet soon after the Pleistocene/Holocene climatic transition. © 2014 The Author(s) Published by the Royal Society. All rights reserved.

  17. King penguin demography since the last glaciation inferred from genome-wide data

    PubMed Central

    Trucchi, Emiliano; Gratton, Paolo; Whittington, Jason D.; Cristofari, Robin; Le Maho, Yvon; Stenseth, Nils Chr; Le Bohec, Céline

    2014-01-01

    How natural climate cycles, such as past glacial/interglacial patterns, have shaped species distributions at the high-latitude regions of the Southern Hemisphere is still largely unclear. Here, we show how the post-glacial warming following the Last Glacial Maximum (ca 18 000 years ago), allowed the (re)colonization of the fragmented sub-Antarctic habitat by an upper-level marine predator, the king penguin Aptenodytes patagonicus. Using restriction site-associated DNA sequencing and standard mitochondrial data, we tested the behaviour of subsets of anonymous nuclear loci in inferring past demography through coalescent-based and allele frequency spectrum analyses. Our results show that the king penguin population breeding on Crozet archipelago steeply increased in size, closely following the Holocene warming recorded in the Epica Dome C ice core. The following population growth can be explained by a threshold model in which the ecological requirements of this species (year-round ice-free habitat for breeding and access to a major source of food such as the Antarctic Polar Front) were met on Crozet soon after the Pleistocene/Holocene climatic transition. PMID:24920481

  18. Integrating gene and protein expression data with genome-scale metabolic networks to infer functional pathways.

    PubMed

    Pey, Jon; Valgepea, Kaspar; Rubio, Angel; Beasley, John E; Planes, Francisco J

    2013-12-08

    The study of cellular metabolism in the context of high-throughput -omics data has allowed us to decipher novel mechanisms of importance in biotechnology and health. To continue with this progress, it is essential to efficiently integrate experimental data into metabolic modeling. We present here an in-silico framework to infer relevant metabolic pathways for a particular phenotype under study based on its gene/protein expression data. This framework is based on the Carbon Flux Path (CFP) approach, a mixed-integer linear program that expands classical path finding techniques by considering additional biophysical constraints. In particular, the objective function of the CFP approach is amended to account for gene/protein expression data and influence obtained paths. This approach is termed integrative Carbon Flux Path (iCFP). We show that gene/protein expression data also influences the stoichiometric balancing of CFPs, which provides a more accurate picture of active metabolic pathways. This is illustrated in both a theoretical and real scenario. Finally, we apply this approach to find novel pathways relevant in the regulation of acetate overflow metabolism in Escherichia coli. As a result, several targets which could be relevant for better understanding of the phenomenon leading to impaired acetate overflow are proposed. A novel mathematical framework that determines functional pathways based on gene/protein expression data is presented and validated. We show that our approach is able to provide new insights into complex biological scenarios such as acetate overflow in Escherichia coli.

  19. Integrating gene and protein expression data with genome-scale metabolic networks to infer functional pathways

    PubMed Central

    2013-01-01

    Background The study of cellular metabolism in the context of high-throughput -omics data has allowed us to decipher novel mechanisms of importance in biotechnology and health. To continue with this progress, it is essential to efficiently integrate experimental data into metabolic modeling. Results We present here an in-silico framework to infer relevant metabolic pathways for a particular phenotype under study based on its gene/protein expression data. This framework is based on the Carbon Flux Path (CFP) approach, a mixed-integer linear program that expands classical path finding techniques by considering additional biophysical constraints. In particular, the objective function of the CFP approach is amended to account for gene/protein expression data and influence obtained paths. This approach is termed integrative Carbon Flux Path (iCFP). We show that gene/protein expression data also influences the stoichiometric balancing of CFPs, which provides a more accurate picture of active metabolic pathways. This is illustrated in both a theoretical and real scenario. Finally, we apply this approach to find novel pathways relevant in the regulation of acetate overflow metabolism in Escherichia coli. As a result, several targets which could be relevant for better understanding of the phenomenon leading to impaired acetate overflow are proposed. Conclusions A novel mathematical framework that determines functional pathways based on gene/protein expression data is presented and validated. We show that our approach is able to provide new insights into complex biological scenarios such as acetate overflow in Escherichia coli. PMID:24314206

  20. Child Development and Structural Variation in the Human Genome

    ERIC Educational Resources Information Center

    Zhang, Ying; Haraksingh, Rajini; Grubert, Fabian; Abyzov, Alexej; Gerstein, Mark; Weissman, Sherman; Urban, Alexander E.

    2013-01-01

    Structural variation of the human genome sequence is the insertion, deletion, or rearrangement of stretches of DNA sequence sized from around 1,000 to millions of base pairs. Over the past few years, structural variation has been shown to be far more common in human genomes than previously thought. Very little is currently known about the effects…

  1. Child Development and Structural Variation in the Human Genome

    ERIC Educational Resources Information Center

    Zhang, Ying; Haraksingh, Rajini; Grubert, Fabian; Abyzov, Alexej; Gerstein, Mark; Weissman, Sherman; Urban, Alexander E.

    2013-01-01

    Structural variation of the human genome sequence is the insertion, deletion, or rearrangement of stretches of DNA sequence sized from around 1,000 to millions of base pairs. Over the past few years, structural variation has been shown to be far more common in human genomes than previously thought. Very little is currently known about the effects…

  2. Accurate Inference of Subtle Population Structure (and Other Genetic Discontinuities) Using Principal Coordinates

    PubMed Central

    Reeves, Patrick A.; Richards, Christopher M.

    2009-01-01

    Background Accurate inference of genetic discontinuities between populations is an essential component of intraspecific biodiversity and evolution studies, as well as associative genetics. The most widely-used methods to infer population structure are model-based, Bayesian MCMC procedures that minimize Hardy-Weinberg and linkage disequilibrium within subpopulations. These methods are useful, but suffer from large computational requirements and a dependence on modeling assumptions that may not be met in real data sets. Here we describe the development of a new approach, PCO-MC, which couples principal coordinate analysis to a clustering procedure for the inference of population structure from multilocus genotype data. Methodology/Principal Findings PCO-MC uses data from all principal coordinate axes simultaneously to calculate a multidimensional “density landscape”, from which the number of subpopulations, and the membership within subpopulations, is determined using a valley-seeking algorithm. Using extensive simulations, we show that this approach outperforms a Bayesian MCMC procedure when many loci (e.g. 100) are sampled, but that the Bayesian procedure is marginally superior with few loci (e.g. 10). When presented with sufficient data, PCO-MC accurately delineated subpopulations with population Fst values as low as 0.03 (G'st>0.2), whereas the limit of resolution of the Bayesian approach was Fst = 0.05 (G'st>0.35). Conclusions/Significance We draw a distinction between population structure inference for describing biodiversity as opposed to Type I error control in associative genetics. We suggest that discrete assignments, like those produced by PCO-MC, are appropriate for circumscribing units of biodiversity whereas expression of population structure as a continuous variable is more useful for case-control correction in structured association studies. PMID:19172174

  3. Inferring structural connectivity using Ising couplings in models of neuronal networks.

    PubMed

    Kadirvelu, Balasundaram; Hayashi, Yoshikatsu; Nasuto, Slawomir J

    2017-08-15

    Functional connectivity metrics have been widely used to infer the underlying structural connectivity in neuronal networks. Maximum entropy based Ising models have been suggested to discount the effect of indirect interactions and give good results in inferring the true anatomical connections. However, no benchmarking is currently available to assess the performance of Ising couplings against other functional connectivity metrics in the microscopic scale of neuronal networks through a wide set of network conditions and network structures. In this paper, we study the performance of the Ising model couplings to infer the synaptic connectivity in in silico networks of neurons and compare its performance against partial and cross-correlations for different correlation levels, firing rates, network sizes, network densities, and topologies. Our results show that the relative performance amongst the three functional connectivity metrics depends primarily on the network correlation levels. Ising couplings detected the most structural links at very weak network correlation levels, and partial correlations outperformed Ising couplings and cross-correlations at strong correlation levels. The result was consistent across varying firing rates, network sizes, and topologies. The findings of this paper serve as a guide in choosing the right functional connectivity tool to reconstruct the structural connectivity.

  4. Evolutionary landscape of amphibians emerging from ancient freshwater fish inferred from complete mitochondrial genomes.

    PubMed

    Wang, Xiao-Tong; Zhang, Yan-Feng; Wu, Qian; Zhang, Hao

    2012-05-04

    It is very interesting that the only extant marine amphibian is the marine frog, Fejervarya cancrivora. This study investigated the reasons for this apparent rarity by conducting a phylogenetic tree analysis of the complete mitochondrial genomes from 14 amphibians, 67 freshwater fishes, four migratory fishes, 35 saltwater fishes, and one hemichordate. The results showed that amphibians, living fossil fishes, and the common ancestors of modern fishes are phylogenetically separated. In general, amphibians, living fossil fishes, saltwater fishes, and freshwater fishes are clustered in different clades. This suggests that the ancestor of living amphibians arose from a type of primordial freshwater fish, rather than the coelacanth, lungfish, or modern saltwater fish. Modern freshwater fish and modern saltwater fish were probably separated from a common ancestor by a single event, caused by crustal movement. Copyright © 2012 Elsevier Inc. All rights reserved.

  5. Inverse Bayesian inference as a key of consciousness featuring a macroscopic quantum logical structure.

    PubMed

    Gunji, Yukio-Pegio; Shinohara, Shuji; Haruna, Taichi; Basios, Vasileios

    2017-02-01

    To overcome the dualism between mind and matter and to implement consciousness in science, a physical entity has to be embedded with a measurement process. Although quantum mechanics have been regarded as a candidate for implementing consciousness, nature at its macroscopic level is inconsistent with quantum mechanics. We propose a measurement-oriented inference system comprising Bayesian and inverse Bayesian inferences. While Bayesian inference contracts probability space, the newly defined inverse one relaxes the space. These two inferences allow an agent to make a decision corresponding to an immediate change in their environment. They generate a particular pattern of joint probability for data and hypotheses, comprising multiple diagonal and noisy matrices. This is expressed as a nondistributive orthomodular lattice equivalent to quantum logic. We also show that an orthomodular lattice can reveal information generated by inverse syllogism as well as the solutions to the frame and symbol-grounding problems. Our model is the first to connect macroscopic cognitive processes with the mathematical structure of quantum mechanics with no additional assumptions.

  6. Phylogeography of the fire-bellied toads Bombina: independent Pleistocene histories inferred from mitochondrial genomes.

    PubMed

    Hofman, Sebastian; Spolsky, Christina; Uzzell, Thomas; Cogălniceanu, Dan; Babik, Wiesław; Szymura, Jacek M

    2007-06-01

    The fire-bellied toads Bombina bombina and Bombina variegata, interbreed in a long, narrow zone maintained by a balance between selection and dispersal. Hybridization takes place between local, genetically differentiated groups. To quantify divergence between these groups and reconstruct their history and demography, we analysed nucleotide variation at the mitochondrial cytochrome b gene (1096 bp) in 364 individuals from 156 sites representing the entire range of both species. Three distinct clades with high sequence divergence (K2P = 8-11%) were distinguished. One clade grouped B. bombina haplotypes; the two other clades grouped B. variegata haplotypes. One B. variegata clade included only Carpathian individuals; the other represented B. variegata from the southwestern parts of its distribution: Southern and Western Europe (Balkano-Western lineage), Apennines, and the Rhodope Mountains. Differentiation between the Carpathian and Balkano-Western lineages, K2P approximately 8%, approached interspecific divergence. Deep divergence among European Bombina lineages suggests their preglacial origin, and implies long and largely independent evolutionary histories of the species. Multiple glacial refugia were identified in the lowlands adjoining the Black Sea, in the Carpathians, in the Balkans, and in the Apennines. The results of the nested clade and demographic analyses suggest drastic reductions of population sizes during the last glacial period, and significant demographic growth related to postglacial colonization. Inferred history, supported by fossil evidence, demonstrates that Bombina ranges underwent repeated contractions and expansions. Geographical concordance between morphology, allozymes, and mtDNA shows that previous episodes of interspecific hybridization have left no detectable mtDNA introgression. Either the admixed populations went extinct, or selection against hybrids hindered mtDNA gene flow in ancient hybrid zones.

  7. Conflicting genomic signals affect phylogenetic inference in four species of North American pines.

    PubMed

    Koralewski, Tomasz E; Mateos, Mariana; Krutovsky, Konstantin V

    2016-01-01

    Adaptive evolutionary processes in plants may be accompanied by episodes of introgression, parallel evolution and incomplete lineage sorting that pose challenges in untangling species evolutionary history. Genus Pinus (pines) is one of the most abundant and most studied groups among gymnosperms, and a good example of a lineage where these phenomena have been observed. Pines are among the most ecologically and economically important plant species. Some, such as the pines of the southeastern USA (southern pines in subsection Australes), are subjects of intensive breeding programmes. Despite numerous published studies, the evolutionary history of Australes remains ambiguous and often controversial. We studied the phylogeny of four major southern pine species: shortleaf (Pinus echinata), slash (P. elliottii), longleaf (P. palustris) and loblolly (P. taeda), using sequences from 11 nuclear loci and maximum likelihood and Bayesian methods. Our analysis encountered resolution difficulties similar to earlier published studies. Although incomplete lineage sorting and introgression are two phenomena presumptively underlying our results, the phylogenetic inferences seem to be also influenced by the genes examined, with certain topologies supported by sets of genes sharing common putative functionalities. For example, genes involved in wood formation supported the clade echinata-taeda, genes linked to plant defence supported the clade echinata-elliottii and genes linked to water management properties supported the clade echinata-palustris The support for these clades was very high and consistent across methods. We discuss the potential factors that could underlie these observations, including incomplete lineage sorting, hybridization and parallel or adaptive evolution. Our results likely reflect the relatively short evolutionary history of the subsection that is thought to have begun during the middle Miocene and has been influenced by climate fluctuations. Published by Oxford

  8. Conflicting genomic signals affect phylogenetic inference in four species of North American pines

    PubMed Central

    Koralewski, Tomasz E.; Mateos, Mariana; Krutovsky, Konstantin V.

    2016-01-01

    Adaptive evolutionary processes in plants may be accompanied by episodes of introgression, parallel evolution and incomplete lineage sorting that pose challenges in untangling species evolutionary history. Genus Pinus (pines) is one of the most abundant and most studied groups among gymnosperms, and a good example of a lineage where these phenomena have been observed. Pines are among the most ecologically and economically important plant species. Some, such as the pines of the southeastern USA (southern pines in subsection Australes), are subjects of intensive breeding programmes. Despite numerous published studies, the evolutionary history of Australes remains ambiguous and often controversial. We studied the phylogeny of four major southern pine species: shortleaf (Pinus echinata), slash (P. elliottii), longleaf (P. palustris) and loblolly (P. taeda), using sequences from 11 nuclear loci and maximum likelihood and Bayesian methods. Our analysis encountered resolution difficulties similar to earlier published studies. Although incomplete lineage sorting and introgression are two phenomena presumptively underlying our results, the phylogenetic inferences seem to be also influenced by the genes examined, with certain topologies supported by sets of genes sharing common putative functionalities. For example, genes involved in wood formation supported the clade echinata–taeda, genes linked to plant defence supported the clade echinata–elliottii and genes linked to water management properties supported the clade echinata–palustris. The support for these clades was very high and consistent across methods. We discuss the potential factors that could underlie these observations, including incomplete lineage sorting, hybridization and parallel or adaptive evolution. Our results likely reflect the relatively short evolutionary history of the subsection that is thought to have begun during the middle Miocene and has been influenced by climate fluctuations. PMID

  9. PyClone: statistical inference of clonal population structure in cancer.

    PubMed

    Roth, Andrew; Khattra, Jaswinder; Yap, Damian; Wan, Adrian; Laks, Emma; Biele, Justina; Ha, Gavin; Aparicio, Samuel; Bouchard-Côté, Alexandre; Shah, Sohrab P

    2014-04-01

    We introduce PyClone, a statistical model for inference of clonal population structures in cancers. PyClone is a Bayesian clustering method for grouping sets of deeply sequenced somatic mutations into putative clonal clusters while estimating their cellular prevalences and accounting for allelic imbalances introduced by segmental copy-number changes and normal-cell contamination. Single-cell sequencing validation demonstrates PyClone's accuracy.

  10. Module Anchored Network Inference: A Sequential Module-Based Approach to Novel Gene Network Construction from Genomic Expression Data on Human Disease Mechanism

    PubMed Central

    Keller, Susanna R.; Lee, Jae K.

    2017-01-01

    Different computational approaches have been examined and compared for inferring network relationships from time-series genomic data on human disease mechanisms under the recent Dialogue on Reverse Engineering Assessment and Methods (DREAM) challenge. Many of these approaches infer all possible relationships among all candidate genes, often resulting in extremely crowded candidate network relationships with many more False Positives than True Positives. To overcome this limitation, we introduce a novel approach, Module Anchored Network Inference (MANI), that constructs networks by analyzing sequentially small adjacent building blocks (modules). Using MANI, we inferred a 7-gene adipogenesis network based on time-series gene expression data during adipocyte differentiation. MANI was also applied to infer two 10-gene networks based on time-course perturbation datasets from DREAM3 and DREAM4 challenges. MANI well inferred and distinguished serial, parallel, and time-dependent gene interactions and network cascades in these applications showing a superior performance to other in silico network inference techniques for discovering and reconstructing gene network relationships. PMID:28197408

  11. Inference and Analysis of Population Structure Using Genetic Data and Network Theory.

    PubMed

    Greenbaum, Gili; Templeton, Alan R; Bar-David, Shirli

    2016-04-01

    Clustering individuals to subpopulations based on genetic data has become commonplace in many genetic studies. Inference about population structure is most often done by applying model-based approaches, aided by visualization using distance-based approaches such as multidimensional scaling. While existing distance-based approaches suffer from a lack of statistical rigor, model-based approaches entail assumptions of prior conditions such as that the subpopulations are at Hardy-Weinberg equilibria. Here we present a distance-based approach for inference about population structure using genetic data by defining population structure using network theory terminology and methods. A network is constructed from a pairwise genetic-similarity matrix of all sampled individuals. The community partition, a partition of a network to dense subgraphs, is equated with population structure, a partition of the population to genetically related groups. Community-detection algorithms are used to partition the network into communities, interpreted as a partition of the population to subpopulations. The statistical significance of the structure can be estimated by using permutation tests to evaluate the significance of the partition's modularity, a network theory measure indicating the quality of community partitions. To further characterize population structure, a new measure of the strength of association (SA) for an individual to its assigned community is presented. The strength of association distribution (SAD) of the communities is analyzed to provide additional population structure characteristics, such as the relative amount of gene flow experienced by the different subpopulations and identification of hybrid individuals. Human genetic data and simulations are used to demonstrate the applicability of the analyses. The approach presented here provides a novel, computationally efficient model-free method for inference about population structure that does not entail assumption of

  12. Structure of the germline genome of Tetrahymena thermophila and relationship to the massively rearranged somatic genome.

    PubMed

    Hamilton, Eileen P; Kapusta, Aurélie; Huvos, Piroska E; Bidwell, Shelby L; Zafar, Nikhat; Tang, Haibao; Hadjithomas, Michalis; Krishnakumar, Vivek; Badger, Jonathan H; Caler, Elisabet V; Russ, Carsten; Zeng, Qiandong; Fan, Lin; Levin, Joshua Z; Shea, Terrance; Young, Sarah K; Hegarty, Ryan; Daza, Riza; Gujja, Sharvari; Wortman, Jennifer R; Birren, Bruce W; Nusbaum, Chad; Thomas, Jainy; Carey, Clayton M; Pritham, Ellen J; Feschotte, Cédric; Noto, Tomoko; Mochizuki, Kazufumi; Papazyan, Romeo; Taverna, Sean D; Dear, Paul H; Cassidy-Hanley, Donna M; Xiong, Jie; Miao, Wei; Orias, Eduardo; Coyne, Robert S

    2016-11-28

    The germline genome of the binucleated ciliate Tetrahymena thermophila undergoes programmed chromosome breakage and massive DNA elimination to generate the somatic genome. Here, we present a complete sequence assembly of the germline genome and analyze multiple features of its structure and its relationship to the somatic genome, shedding light on the mechanisms of genome rearrangement as well as the evolutionary history of this remarkable germline/soma differentiation. Our results strengthen the notion that a complex, dynamic, and ongoing interplay between mobile DNA elements and the host genome have shaped Tetrahymena chromosome structure, locally and globally. Non-standard outcomes of rearrangement events, including the generation of short-lived somatic chromosomes and excision of DNA interrupting protein-coding regions, may represent novel forms of developmental gene regulation. We also compare Tetrahymena's germline/soma differentiation to that of other characterized ciliates, illustrating the wide diversity of adaptations that have occurred within this phylum.

  13. Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals.

    PubMed

    Vanet, A; Marsan, L; Labigne, A; Sagot, M F

    2000-03-24

    Helicobacter pylori is adapted to life in a unique niche, the gastric epithelium of primates. Its promoters may therefore be different from those of other bacteria. Here, we determine motifs possibly involved in the recognition of such promoter sequences by the RNA polymerase using a new motif identification method. An important feature of this method is that the motifs are sought with the least possible assumptions about what they may look like. The method starts by considering the whole genome of H. pylori and attempts to infer directly from it a description for a family of promoters. Thus, this approach differs from searching for such promoters with a previously established description. The two algorithms are based on the idea of inferring motifs by flexibly comparing words in the sequences with an external object, instead of between themselves. The first algorithm infers single motifs, the second a combination of two motifs separated from one another by strictly defined, sterically constrained distances. Besides independently finding motifs known to be present in other bacteria, such as the Shine-Dalgarno sequence and the TATA-box, this approach suggests the existence in H. pylori of a new, combined motif, TTAAGC, followed optimally 21 bp downstream by TATAAT. Between these two motifs, there is in some cases another, TTTTAA or, less frequently, a repetition of TTAAGC separated optimally from the TATA-box by 12 bp. The combined motif TTAAGCx(21+/-2)TATAAT is present with no errors immediately upstream from the only two copies of the ribosomal 23 S-5 S RNA genes in H. pylori, and with one error upstream from the only two copies of the ribosomal 16 S RNA genes. The operons of both ribosomal RNA molecules are strongly expressed, representing an encouraging sign of the pertinence of the motifs found by the algorithms. In 25 cases out of a possible 30, the combined motif is found with no more than three substitutions immediately upstream from ribosomal proteins, or

  14. Morphological homoplasy, life history evolution, and historical biogeography of plethodontid salamanders inferred from complete mitochondrial genomes

    PubMed Central

    Mueller, Rachel Lockridge; Macey, J. Robert; Jaekel, Martin; Wake, David B.; Boore, Jeffrey L.

    2004-01-01

    The evolutionary history of the largest salamander family (Plethodontidae) is characterized by extreme morphological homoplasy. Analysis of the mechanisms generating such homoplasy requires an independent molecular phylogeny. To this end, we sequenced 24 complete mitochondrial genomes (22 plethodontids and two outgroup taxa), added data for three species from GenBank, and performed partitioned and unpartitioned Bayesian, maximum likelihood, and maximum parsimony phylogenetic analyses. We explored four dataset partitioning strategies to account for evolutionary process heterogeneity among genes and codon positions, all of which yielded increased model likelihoods and decreased numbers of supported nodes in the topologies (Bayesian posterior probability >0.95) relative to the unpartitioned analysis. Our phylogenetic analyses yielded congruent trees that contrast with the traditional morphology-based taxonomy; the monophyly of three of four major groups is rejected. Reanalysis of current hypotheses in light of these evolutionary relationships suggests that (i) a larval life history stage reevolved from a direct-developing ancestor multiple times; (ii) there is no phylogenetic support for the “Out of Appalachia” hypothesis of plethodontid origins; and (iii) novel scenarios must be reconstructed for the convergent evolution of projectile tongues, reduction in toe number, and specialization for defensive tail loss. Some of these scenarios imply morphological transformation series that proceed in the opposite direction than was previously thought. In addition, they suggest surprising evolutionary lability in traits previously interpreted to be conservative. PMID:15365171

  15. Origins of the Moken Sea Gypsies inferred from mitochondrial hypervariable region and whole genome sequences.

    PubMed

    Dancause, Kelsey Needham; Chan, Chim W; Arunotai, Narumon Hinshiranan; Lum, J Koji

    2009-02-01

    The origins of the Moken 'Sea Gypsies,' a group of traditionally boat-dwelling nomadic foragers, remain speculative despite previous examinations from linguistic, sociocultural and genetic perspectives. We explored Moken origin(s) and affinities by comparing whole mitochondrial genome and hypervariable segment I sequences from 12 Moken individuals, sampled from four islands of the Mergui Archipelago, to other mainland Asian, Island Southeast Asian (ISEA) and Oceanic populations. These analyses revealed a major (11/12) and a minor (1/12) haplotype in the population, indicating low mitochondrial diversity likely resulting from historically low population sizes, isolation and consequent genetic drift. Phylogenetic analyses revealed close relationships between the major lineage (MKN1) and ISEA, mainland Asian and aboriginal Malay populations, and of the minor lineage (MKN2) to populations from ISEA. MKN1 belongs to a recently defined subclade of the ancient yet localized M21 haplogroup. MKN2 is not closely related to any previously sampled lineages, but has been tentatively assigned to the basal M46 haplogroup that possibly originated among the original inhabitants of ISEA. Our analyses suggest that MKN1 originated within coastal mainland SEA and dispersed into ISEA and rapidly into the Mergui Archipelago within the past few thousand years as a result of climate change induced population pressure.

  16. Morphological homoplasy, life history evolution, and historical biogeography of plethodontid salamanders inferred from complete mitochondrial genomes

    SciTech Connect

    Mueller, Rachel Lockridge; Macey, J. Robert; Jaekel, Martin; Wake, David B.; Boore, Jeffrey L.

    2004-08-01

    The evolutionary history of the largest salamander family (Plethodontidae) is characterized by extreme morphological homoplasy. Analysis of the mechanisms generating such homoplasy requires an independent, molecular phylogeny. To this end, we sequenced 24 complete mitochondrial genomes (22 plethodontids and two outgroup taxa), added data for three species from GenBank, and performed partitioned and unpartitioned Bayesian, ML, and MP phylogenetic analyses. We explored four dataset partitioning strategies to account for evolutionary process heterogeneity among genes and codon positions, all of which yielded increased model likelihoods and decreased numbers of supported nodes in the topologies (PP > 0.95) relative to the unpartitioned analysis. Our phylogenetic analyses yielded congruent trees that contrast with the traditional morphology-based taxonomy; the monophyly of three out of four major groups is rejected. Reanalysis of current hypotheses in light of these new evolutionary relationships suggests that (1) a larval life history stage re-evolved from a direct-developing ancestor multiple times, (2) there is no phylogenetic support for the ''Out of Appalachia'' hypothesis of plethodontid origins, and (3) novel scenarios must be reconstructed for the convergent evolution of projectile tongues, reduction in toe number, and specialization for defensive tail loss. Some of these novel scenarios imply morphological transformation series that proceed in the opposite direction than was previously thought. In addition, they suggest surprising evolutionary lability in traits previously interpreted to be conservative.

  17. Gorilla genome structural variation reveals evolutionary parallelisms with chimpanzee

    PubMed Central

    Ventura, Mario; Catacchio, Claudia R.; Alkan, Can; Marques-Bonet, Tomas; Sajjadian, Saba; Graves, Tina A.; Hormozdiari, Fereydoun; Navarro, Arcadi; Malig, Maika; Baker, Carl; Lee, Choli; Turner, Emily H.; Chen, Lin; Kidd, Jeffrey M.; Archidiacono, Nicoletta; Shendure, Jay; Wilson, Richard K.; Eichler, Evan E.

    2011-01-01

    Structural variation has played an important role in the evolutionary restructuring of human and great ape genomes. Recent analyses have suggested that the genomes of chimpanzee and human have been particularly enriched for this form of genetic variation. Here, we set out to assess the extent of structural variation in the gorilla lineage by generating 10-fold genomic sequence coverage from a western lowland gorilla and integrating these data into a physical and cytogenetic framework of structural variation. We discovered and validated over 7665 structural changes within the gorilla lineage, including sequence resolution of inversions, deletions, duplications, and mobile element insertions. A comparison with human and other ape genomes shows that the gorilla genome has been subjected to the highest rate of segmental duplication. We show that both the gorilla and chimpanzee genomes have experienced independent yet convergent patterns of structural mutation that have not occurred in humans, including the formation of subtelomeric heterochromatic caps, the hyperexpansion of segmental duplications, and bursts of retroviral integrations. Our analysis suggests that the chimpanzee and gorilla genomes are structurally more derived than either orangutan or human genomes. PMID:21685127

  18. Gorilla genome structural variation reveals evolutionary parallelisms with chimpanzee.

    PubMed

    Ventura, Mario; Catacchio, Claudia R; Alkan, Can; Marques-Bonet, Tomas; Sajjadian, Saba; Graves, Tina A; Hormozdiari, Fereydoun; Navarro, Arcadi; Malig, Maika; Baker, Carl; Lee, Choli; Turner, Emily H; Chen, Lin; Kidd, Jeffrey M; Archidiacono, Nicoletta; Shendure, Jay; Wilson, Richard K; Eichler, Evan E

    2011-10-01

    Structural variation has played an important role in the evolutionary restructuring of human and great ape genomes. Recent analyses have suggested that the genomes of chimpanzee and human have been particularly enriched for this form of genetic variation. Here, we set out to assess the extent of structural variation in the gorilla lineage by generating 10-fold genomic sequence coverage from a western lowland gorilla and integrating these data into a physical and cytogenetic framework of structural variation. We discovered and validated over 7665 structural changes within the gorilla lineage, including sequence resolution of inversions, deletions, duplications, and mobile element insertions. A comparison with human and other ape genomes shows that the gorilla genome has been subjected to the highest rate of segmental duplication. We show that both the gorilla and chimpanzee genomes have experienced independent yet convergent patterns of structural mutation that have not occurred in humans, including the formation of subtelomeric heterochromatic caps, the hyperexpansion of segmental duplications, and bursts of retroviral integrations. Our analysis suggests that the chimpanzee and gorilla genomes are structurally more derived than either orangutan or human genomes.

  19. Minimum message length inference of secondary structure from protein coordinate data

    PubMed Central

    Konagurthu, Arun S.; Lesk, Arthur M.; Allison, Lloyd

    2012-01-01

    Motivation: Secondary structure underpins the folding pattern and architecture of most proteins. Accurate assignment of the secondary structure elements is therefore an important problem. Although many approximate solutions of the secondary structure assignment problem exist, the statement of the problem has resisted a consistent and mathematically rigorous definition. A variety of comparative studies have highlighted major disagreements in the way the available methods define and assign secondary structure to coordinate data. Results: We report a new method to infer secondary structure based on the Bayesian method of minimum message length inference. It treats assignments of secondary structure as hypotheses that explain the given coordinate data. The method seeks to maximize the joint probability of a hypothesis and the data. There is a natural null hypothesis and any assignment that cannot better it is unacceptable. We developed a program SST based on this approach and compared it with popular programs, such as DSSP and STRIDE among others. Our evaluation suggests that SST gives reliable assignments even on low-resolution structures. Availability: http://www.csse.monash.edu.au/~karun/sst Contact: arun.konagurthu@monash.edu (or lloyd.allison@monash.edu) PMID:22689785

  20. SHIPS: Spectral Hierarchical clustering for the Inference of Population Structure in genetic studies.

    PubMed

    Bouaziz, Matthieu; Paccard, Caroline; Guedj, Mickael; Ambroise, Christophe

    2012-01-01

    Inferring the structure of populations has many applications for genetic research. In addition to providing information for evolutionary studies, it can be used to account for the bias induced by population stratification in association studies. To this end, many algorithms have been proposed to cluster individuals into genetically homogeneous sub-populations. The parametric algorithms, such as Structure, are very popular but their underlying complexity and their high computational cost led to the development of faster parametric alternatives such as Admixture. Alternatives to these methods are the non-parametric approaches. Among this category, AWclust has proven efficient but fails to properly identify population structure for complex datasets. We present in this article a new clustering algorithm called Spectral Hierarchical clustering for the Inference of Population Structure (SHIPS), based on a divisive hierarchical clustering strategy, allowing a progressive investigation of population structure. This method takes genetic data as input to cluster individuals into homogeneous sub-populations and with the use of the gap statistic estimates the optimal number of such sub-populations. SHIPS was applied to a set of simulated discrete and admixed datasets and to real SNP datasets, that are data from the HapMap and Pan-Asian SNP consortium. The programs Structure, Admixture, AWclust and PCAclust were also investigated in a comparison study. SHIPS and the parametric approach Structure were the most accurate when applied to simulated datasets both in terms of individual assignments and estimation of the correct number of clusters. The analysis of the results on the real datasets highlighted that the clusterings of SHIPS were the more consistent with the population labels or those produced by the Admixture program. The performances of SHIPS when applied to SNP data, along with its relatively low computational cost and its ease of use make this method a promising

  1. SHIPS: Spectral Hierarchical Clustering for the Inference of Population Structure in Genetic Studies

    PubMed Central

    Bouaziz, Matthieu; Paccard, Caroline; Guedj, Mickael; Ambroise, Christophe

    2012-01-01

    Inferring the structure of populations has many applications for genetic research. In addition to providing information for evolutionary studies, it can be used to account for the bias induced by population stratification in association studies. To this end, many algorithms have been proposed to cluster individuals into genetically homogeneous sub-populations. The parametric algorithms, such as Structure, are very popular but their underlying complexity and their high computational cost led to the development of faster parametric alternatives such as Admixture. Alternatives to these methods are the non-parametric approaches. Among this category, AWclust has proven efficient but fails to properly identify population structure for complex datasets. We present in this article a new clustering algorithm called Spectral Hierarchical clustering for the Inference of Population Structure (SHIPS), based on a divisive hierarchical clustering strategy, allowing a progressive investigation of population structure. This method takes genetic data as input to cluster individuals into homogeneous sub-populations and with the use of the gap statistic estimates the optimal number of such sub-populations. SHIPS was applied to a set of simulated discrete and admixed datasets and to real SNP datasets, that are data from the HapMap and Pan-Asian SNP consortium. The programs Structure, Admixture, AWclust and PCAclust were also investigated in a comparison study. SHIPS and the parametric approach Structure were the most accurate when applied to simulated datasets both in terms of individual assignments and estimation of the correct number of clusters. The analysis of the results on the real datasets highlighted that the clusterings of SHIPS were the more consistent with the population labels or those produced by the Admixture program. The performances of SHIPS when applied to SNP data, along with its relatively low computational cost and its ease of use make this method a promising

  2. Netter: re-ranking gene network inference predictions using structural network properties.

    PubMed

    Ruyssinck, Joeri; Demeester, Piet; Dhaene, Tom; Saeys, Yvan

    2016-02-09

    Many algorithms have been developed to infer the topology of gene regulatory networks from gene expression data. These methods typically produce a ranking of links between genes with associated confidence scores, after which a certain threshold is chosen to produce the inferred topology. However, the structural properties of the predicted network do not resemble those typical for a gene regulatory network, as most algorithms only take into account connections found in the data and do not include known graph properties in their inference process. This lowers the prediction accuracy of these methods, limiting their usability in practice. We propose a post-processing algorithm which is applicable to any confidence ranking of regulatory interactions obtained from a network inference method which can use, inter alia, graphlets and several graph-invariant properties to re-rank the links into a more accurate prediction. To demonstrate the potential of our approach, we re-rank predictions of six different state-of-the-art algorithms using three simple network properties as optimization criteria and show that Netter can improve the predictions made on both artificially generated data as well as the DREAM4 and DREAM5 benchmarks. Additionally, the DREAM5 E.coli. community prediction inferred from real expression data is further improved. Furthermore, Netter compares favorably to other post-processing algorithms and is not restricted to correlation-like predictions. Lastly, we demonstrate that the performance increase is robust for a wide range of parameter settings. Netter is available at http://bioinformatics.intec.ugent.be. Network inference from high-throughput data is a long-standing challenge. In this work, we present Netter, which can further refine network predictions based on a set of user-defined graph properties. Netter is a flexible system which can be applied in unison with any method producing a ranking from omics data. It can be tailored to specific prior

  3. Phylogeny and genetic history of the Siberian salamander (Salamandrella keyserlingii, Dybowski, 1870) inferred from complete mitochondrial genomes.

    PubMed

    Malyarchuk, Boris; Derenko, Miroslava; Denisova, Galina

    2013-05-01

    We assessed phylogeny of the Siberian salamander (Salamandrella keyserlingii, Dybowski, 1870), the most northern ectothermic, terrestrial vertebrate in Eurasia, by sequence analysis of complete mitochondrial genomes in 26 specimens from different localities (China, Khabarovsk region, Sakhalin, Yakutia, Magadan region, Chukotka, Kamchatka, Ural, European part of Russia). In addition, a complete mitochondrial genome of the Schrenck salamander, Salamandrella schrenckii, was determined for the first time. Bayesian phylogenetic analysis of the entire mtDNA genomes of S. keyserlingii demonstrates that two haplotype clades, AB and C, radiated about 1.4 million years ago (Mya). Bayesian skyline plots of population size change through time show an expansion around 250 thousand years ago (kya) and then a decline around the Last Glacial Maximum (25 kya) with subsequent restoration of population size. Climatic changes during the Quaternary period have dramatically affected the population genetic structure of the Siberian salamanders. In addition, complete mtDNA sequence analysis allowed us to recognize that the vast area of Northern Eurasia was colonized only by the Siberian salamander clade C1b during the last 150 kya. Meanwhile, we were unable to find evidence of molecular adaptation in this clade by analyzing the whole mitochondrial genomes of the Siberian salamanders.

  4. Parallel computation of genome-scale RNA secondary structure to detect structural constraints on human genome.

    PubMed

    Kawaguchi, Risa; Kiryu, Hisanori

    2016-05-06

    RNA secondary structure around splice sites is known to assist normal splicing by promoting spliceosome recognition. However, analyzing the structural properties of entire intronic regions or pre-mRNA sequences has been difficult hitherto, owing to serious experimental and computational limitations, such as low read coverage and numerical problems. Our novel software, "ParasoR", is designed to run on a computer cluster and enables the exact computation of various structural features of long RNA sequences under the constraint of maximal base-pairing distance. ParasoR divides dynamic programming (DP) matrices into smaller pieces, such that each piece can be computed by a separate computer node without losing the connectivity information between the pieces. ParasoR directly computes the ratios of DP variables to avoid the reduction of numerical precision caused by the cancellation of a large number of Boltzmann factors. The structural preferences of mRNAs computed by ParasoR shows a high concordance with those determined by high-throughput sequencing analyses. Using ParasoR, we investigated the global structural preferences of transcribed regions in the human genome. A genome-wide folding simulation indicated that transcribed regions are significantly more structural than intergenic regions after removing repeat sequences and k-mer frequency bias. In particular, we observed a highly significant preference for base pairing over entire intronic regions as compared to their antisense sequences, as well as to intergenic regions. A comparison between pre-mRNAs and mRNAs showed that coding regions become more accessible after splicing, indicating constraints for translational efficiency. Such changes are correlated with gene expression levels, as well as GC content, and are enriched among genes associated with cytoskeleton and kinase functions. We have shown that ParasoR is very useful for analyzing the structural properties of long RNA sequences such as mRNAs, pre

  5. Phylogeny and biogeography of the family Salamandridae (Amphibia: Caudata) inferred from complete mitochondrial genomes.

    PubMed

    Zhang, Peng; Papenfuss, Theodore J; Wake, Marvalee H; Qu, Lianghu; Wake, David B

    2008-11-01

    Phylogenetic relationships of members of the salamander family Salamandridae were examined using complete mitochondrial genomes collected from 42 species representing all 20 salamandrid genera and five outgroup taxa. Weighted maximum parsimony, partitioned maximum likelihood, and partitioned Bayesian approaches all produce an identical, well-resolved phylogeny; most branches are strongly supported with greater than 90% bootstrap values and 1.0 Bayesian posterior probabilities. Our results support recent taxonomic changes in finding the traditional genera Mertensiella, Euproctus, and Triturus to be non-monophyletic species assemblages. We successfully resolved the current polytomy at the base of the salamandrid tree: the Italian newt genus Salamandrina is sister to all remaining salamandrids. Beyond Salamandrina, a clade comprising all remaining newts is separated from a clade containing the true salamanders. Among these newts, the branching orders of well-supported clades are: primitive newts (Echinotriton, Pleurodeles, and Tylototriton), New World newts (Notophthalmus-Taricha), Corsica-Sardinia newts (Euproctus), and modern European newts (Calotriton, Lissotriton, Mesotriton, Neurergus, Ommatotriton, and Triturus) plus modern Asian newts (Cynops, Pachytriton, and Paramesotriton).Two alternative sets of calibration points and two Bayesian dating methods (BEAST and MultiDivTime) were used to estimate timescales for salamandrid evolution. The estimation difference by dating methods is slight and we propose two sets of timescales based on different calibration choices. The two timescales suggest that the initial diversification of extant salamandrids took place in Europe about 97 or 69Ma. North American salamandrids were derived from their European ancestors by dispersal through North Atlantic Land Bridges in the Late Cretaceous ( approximately 69Ma) or Middle Eocene ( approximately 43Ma). Ancestors of Asian salamandrids most probably dispersed to the eastern Asia

  6. Phylogenetic Diversity of the Enteric Pathogen Salmonella enterica subsp. enterica Inferred from Genome-Wide Reference-Free SNP Characters

    PubMed Central

    Timme, Ruth E.; Pettengill, James B.; Allard, Marc W.; Strain, Errol; Barrangou, Rodolphe; Wehnes, Chris; Van Kessel, JoAnn S.; Karns, Jeffrey S.; Musser, Steven M.; Brown, Eric W.

    2013-01-01

    The enteric pathogen Salmonella enterica is one of the leading causes of foodborne illness in the world. The species is extremely diverse, containing more than 2,500 named serovars that are designated for their unique antigen characters and pathogenicity profiles—some are known to be virulent pathogens, while others are not. Questions regarding the evolution of pathogenicity, significance of antigen characters, diversity of clustered regularly interspaced short palindromic repeat (CRISPR) loci, among others, will remain elusive until a strong evolutionary framework is established. We present the first large-scale S. enterica subsp. enterica phylogeny inferred from a new reference-free k-mer approach of gathering single nucleotide polymorphisms (SNPs) from whole genomes. The phylogeny of 156 isolates representing 78 serovars (102 were newly sequenced) reveals two major lineages, each with many strongly supported sublineages. One of these lineages is the S. Typhi group; well nested within the phylogeny. Lineage-through-time analyses suggest there have been two instances of accelerated rates of diversification within the subspecies. We also found that antigen characters and CRISPR loci reveal different evolutionary patterns than that of the phylogeny, suggesting that a horizontal gene transfer or possibly a shared environmental acquisition might have influenced the present character distribution. Our study also shows the ability to extract reference-free SNPs from a large set of genomes and then to use these SNPs for phylogenetic reconstruction. This automated, annotation-free approach is an important step forward for bacterial disease tracking and in efficiently elucidating the evolutionary history of highly clonal organisms. PMID:24158624

  7. Inference of hazel grouse population structure using multilocus data: a landscape genetic approach.

    PubMed

    Sahlsten, J; Thörngren, H; Höglund, J

    2008-12-01

    In conservation and management of species it is important to make inferences about gene flow, dispersal and population structure. In this study, we used 613 georeferenced tissue samples from hazel grouse (Bonasa bonasia) where each individual was genotyped at 12 microsatellite loci to make inference on population genetic structure, gene flow and dispersal in northern Sweden. Observed levels of genetic diversity suggest that Swedish hazel grouse do not suffer loss of genetic diversity compared with other grouse species. We found significant F(IS) (deviation from Hardy-Weinberg expectations) over the entire sample using jack-knifed estimators over loci, which is most likely explained by a Wahlund effect. With the use of spatial autocorrelation methods, we detected significant isolation by distance among individuals. Neighbourhood size was estimated in the order of 62-158 individuals corresponding to a dispersal distance of 950-1500 m. Using a spatial statistical model for landscape genetics to infer the number of populations and the spatial location of genetic discontinuities between these populations we found indications that Swedish hazel grouse are divided into a northern and a southern population. We could not find a sharp border between these two populations and none of the observed borders appeared to coincide with any potential geographical barriers.These results imply that gene flow appears somewhat unrestricted in the boreal taiga forests of northern Sweden and that the two populations of hazel grouse in Sweden may be explained by the post-glacial reinvasion history of the Scandinavian Peninsula.

  8. Process-Driven Inference of Biological Network Structure: Feasibility, Minimality, and Multiplicity

    PubMed Central

    Wang, Guanyu; Rong, Yongwu; Chen, Hao; Pearson, Carl; Du, Chenghang; Simha, Rahul; Zeng, Chen

    2012-01-01

    A common problem in molecular biology is to use experimental data, such as microarray data, to infer knowledge about the structure of interactions between important molecules in subsystems of the cell. By approximating the state of each molecule as “on” or “off”, it becomes possible to simplify the problem, and exploit the tools of Boolean analysis for such inference. Amongst Boolean techniques, the process-driven approach has shown promise in being able to identify putative network structures, as well as stability and modularity properties. This paper examines the process-driven approach more formally, and makes four contributions about the computational complexity of the inference problem, under the “dominant inhibition” assumption of molecular interactions. The first is a proof that the feasibility problem (does there exist a network that explains the data?) can be solved in polynomial-time. Second, the minimality problem (what is the smallest network that explains the data?) is shown to be NP-hard, and therefore unlikely to result in a polynomial-time algorithm. Third, a simple polynomial-time heuristic is shown to produce near-minimal solutions, as demonstrated by simulation. Fourth, the theoretical framework explains how multiplicity (the number of network solutions to realize a given biological process), which can take exponential-time to compute, can instead be accurately estimated by a fast, polynomial-time heuristic. PMID:22815739

  9. Bayesian inference of the initial conditions from large-scale structure surveys

    NASA Astrophysics Data System (ADS)

    Leclercq, Florent

    2016-10-01

    Analysis of three-dimensional cosmological surveys has the potential to answer outstanding questions on the initial conditions from which structure appeared, and therefore on the very high energy physics at play in the early Universe. We report on recently proposed statistical data analysis methods designed to study the primordial large-scale structure via physical inference of the initial conditions in a fully Bayesian framework, and applications to the Sloan Digital Sky Survey data release 7. We illustrate how this approach led to a detailed characterization of the dynamic cosmic web underlying the observed galaxy distribution, based on the tidal environment.

  10. A Bayesian approach to inferring the phylogenetic structure of communities from metagenomic data.

    PubMed

    O'Brien, John D; Didelot, Xavier; Iqbal, Zamin; Amenga-Etego, Lucas; Ahiska, Bartu; Falush, Daniel

    2014-07-01

    Metagenomics provides a powerful new tool set for investigating evolutionary interactions with the environment. However, an absence of model-based statistical methods means that researchers are often not able to make full use of this complex information. We present a Bayesian method for inferring the phylogenetic relationship among related organisms found within metagenomic samples. Our approach exploits variation in the frequency of taxa among samples to simultaneously infer each lineage haplotype, the phylogenetic tree connecting them, and their frequency within each sample. Applications of the algorithm to simulated data show that our method can recover a substantial fraction of the phylogenetic structure even in the presence of high rates of migration among sample sites. We provide examples of the method applied to data from green sulfur bacteria recovered from an Antarctic lake, plastids from mixed Plasmodium falciparum infections, and virulent Neisseria meningitidis samples. Copyright © 2014 by the Genetics Society of America.

  11. Phylogenetic relationships and divergence dates of softshell turtles (Testudines: Trionychidae) inferred from complete mitochondrial genomes.

    PubMed

    Li, Haifeng; Liu, Juanjuan; Xiong, Lei; Zhang, Huanhuan; Zhou, Huaxing; Yin, Huazong; Jing, Wanxing; Li, Jun; Shi, Qiong; Wang, Yuqin; Liu, Jianjun; Nie, Liuwang

    2017-03-15

    The softshell turtles (Trionychidae) are one of the most widely distributed reptile groups in the world, and fossils have been found on all continents except Antarctica. The phylogenetic relationships among members of this group have been previously studied; however, there are disagreements regarding its taxonomy, its phylogeography and divergence times are still poorly understood as well. Here we present a comprehensive mitogenomic study of softshell turtles. We sequenced the complete mitochondrial genomes of 10 softshell turtles, in addition to the GenBank sequence of Dogania subplana, Lissemys punctata, Trionyx triunguis, which cover all extant genera within Trionychidae except for Cyclanorbis and Cycloderma. These data were combined with other mitogenomes of turtles for phylogenetic analyses. Divergence time-calibration and ancestral reconstruction were calculated using BEAST and RASP software, respectively. Our phylogenetic analyses indicate that Trionychidae is the sister taxon of Carettochelyidae, and support the monophyly of Trionychinae and Cyclanorbinae, which is consistent with morphological data and molecular analysis. Our phylogenetic analyses have established a sister taxon relationship between the Asian Rafetus and the Asian Palea + Pelodiscus + Dogania + Nilssonia + Amyda, whereas a previous study grouped the Asian Rafetus with the American Apalone. The results of divergence time estimates and area ancestral reconstruction show that extant Trionychidae originated in Asia at around 108 million years ago (MA), and radiations mainly occurred during two warm periods, namely, Late Cretaceous-Early Eocene and Oligocene. By combining the estimateddivergence time and the reconstructed ancestral area of softshell turtles, we determined that the dispersal of softshell turtles out of Asia may have taken three routes. Furthermore, the times of dispersal seem to be in agreement with the time of the India-Asia collision and opening of the Bering Strait, which

  12. Higher-level salamander relationships and divergence dates inferred from complete mitochondrial genomes.

    PubMed

    Zhang, Peng; Wake, David B

    2009-11-01

    Phylogenetic relationships among the salamander families have been difficult to resolve, largely because the window of time in which major lineages diverged was very short relative to the subsequently long evolutionary history of each family. We present seven new complete mitochondrial genomes representing five salamander families that have no or few mitogenome records in GenBank in order to assess the phylogenetic relationships of all salamander families from a mitogenomic perspective. Phylogenetic analyses of two data sets-one combining the entire mitogenome sequence except for the D-loop, and the other combining the deduced amino acid sequences of all 13 mitochondrial protein-coding genes-produce nearly identical well-resolved topologies. The monophyly of each family is supported, including the controversial Proteidae. The internally fertilizing salamanders are demonstrated to be a clade, concordant with recent results using nuclear genes. The internally fertilizing salamanders include two well-supported clades: one is composed of Ambystomatidae, Dicamptodontidae, and Salamandridae, the other Proteidae, Rhyacotritonidae, Amphiumidae, and Plethodontidae. In contrast to results from nuclear loci, our results support the conventional morphological hypothesis that Sirenidae is the sister-group to all other salamanders and they statistically reject the hypothesis from nuclear genes that the suborder Cryptobranchoidea (Cryptobranchidae+Hynobiidae) branched earlier than the Sirenidae. Using recently recommended fossil calibration points and a "soft bound" calibration strategy, we recalculated evolutionary timescales for tetrapods with an emphasis on living salamanders, under a Bayesian framework with and without a rate-autocorrelation assumption. Our dating results indicate: (i) the widely used rate-autocorrelation assumption in relaxed clock analyses is problematic and the accuracy of molecular dating for early lissamphibian evolution is questionable; (ii) the initial

  13. Multiple genome alignment for identifying the core structure among moderately related microbial genomes.

    PubMed

    Uchiyama, Ikuo

    2008-10-31

    Identifying the set of intrinsically conserved genes, or the genomic core, among related genomes is crucial for understanding prokaryotic genomes where horizontal gene transfers are common. Although core genome identification appears to be obvious among very closely related genomes, it becomes more difficult when more distantly related genomes are compared. Here, we consider the core structure as a set of sufficiently long segments in which gene orders are conserved so that they are likely to have been inherited mainly through vertical transfer, and developed a method for identifying the core structure by finding the order of pre-identified orthologous groups (OGs) that maximally retains the conserved gene orders. The method was applied to genome comparisons of two well-characterized families, Bacillaceae and Enterobacteriaceae, and identified their core structures comprising 1438 and 2125 OGs, respectively. The core sets contained most of the essential genes and their related genes, which were primarily included in the intersection of the two core sets comprising around 700 OGs. The definition of the genomic core based on gene order conservation was demonstrated to be more robust than the simpler approach based only on gene conservation. We also investigated the core structures in terms of G+C content homogeneity and phylogenetic congruence, and found that the core genes primarily exhibited the expected characteristic, i.e., being indigenous and sharing the same history, more than the non-core genes. The results demonstrate that our strategy of genome alignment based on gene order conservation can provide an effective approach to identify the genomic core among moderately related microbial genomes.

  14. Multiple genome alignment for identifying the core structure among moderately related microbial genomes

    PubMed Central

    Uchiyama, Ikuo

    2008-01-01

    Background Identifying the set of intrinsically conserved genes, or the genomic core, among related genomes is crucial for understanding prokaryotic genomes where horizontal gene transfers are common. Although core genome identification appears to be obvious among very closely related genomes, it becomes more difficult when more distantly related genomes are compared. Here, we consider the core structure as a set of sufficiently long segments in which gene orders are conserved so that they are likely to have been inherited mainly through vertical transfer, and developed a method for identifying the core structure by finding the order of pre-identified orthologous groups (OGs) that maximally retains the conserved gene orders. Results The method was applied to genome comparisons of two well-characterized families, Bacillaceae and Enterobacteriaceae, and identified their core structures comprising 1438 and 2125 OGs, respectively. The core sets contained most of the essential genes and their related genes, which were primarily included in the intersection of the two core sets comprising around 700 OGs. The definition of the genomic core based on gene order conservation was demonstrated to be more robust than the simpler approach based only on gene conservation. We also investigated the core structures in terms of G+C content homogeneity and phylogenetic congruence, and found that the core genes primarily exhibited the expected characteristic, i.e., being indigenous and sharing the same history, more than the non-core genes. Conclusion The results demonstrate that our strategy of genome alignment based on gene order conservation can provide an effective approach to identify the genomic core among moderately related microbial genomes. PMID:18976470

  15. Evolution of genomic structural variation and genomic architecture in the adaptive radiations of African cichlid fishes

    PubMed Central

    Fan, Shaohua; Meyer, Axel

    2014-01-01

    African cichlid fishes are an ideal system for studying explosive rates of speciation and the origin of diversity in adaptive radiation. Within the last few million years, more than 2000 species have evolved in the Great Lakes of East Africa, the largest adaptive radiation in vertebrates. These young species show spectacular diversity in their coloration, morphology and behavior. However, little is known about the genomic basis of this astonishing diversity. Recently, five African cichlid genomes were sequenced, including that of the Nile Tilapia (Oreochromis niloticus), a basal and only relatively moderately diversified lineage, and the genomes of four representative endemic species of the adaptive radiations, Neolamprologus brichardi, Astatotilapia burtoni, Metriaclima zebra, and Pundamila nyererei. Using the Tilapia genome as a reference genome, we generated a high-resolution genomic variation map, consisting of single nucleotide polymorphisms (SNPs), short insertions and deletions (indels), inversions and deletions. In total, around 18.8, 17.7, 17.0, and 17.0 million SNPs, 2.3, 2.2, 1.4, and 1.9 million indels, 262, 306, 162, and 154 inversions, and 3509, 2705, 2710, and 2634 deletions were inferred to have evolved in N. brichardi, A. burtoni, P. nyererei, and M. zebra, respectively. Many of these variations affected the annotated gene regions in the genome. Different patterns of genetic variation were detected during the adaptive radiation of African cichlid fishes. For SNPs, the highest rate of evolution was detected in the common ancestor of N. brichardi, A. burtoni, P. nyererei, and M. zebra. However, for the evolution of inversions and deletions, we found that the rates at the terminal taxa are substantially higher than the rates at the ancestral lineages. The high-resolution map provides an ideal opportunity to understand the genomic bases of the adaptive radiation of African cichlid fishes. PMID:24917883

  16. Limitations to estimating bacterial cross-species transmission using genetic and genomic markers: inferences from simulation modeling

    PubMed Central

    Benavides, Julio A; Cross, Paul C; Luikart, Gordon; Creel, Scott

    2014-01-01

    Cross-species transmission (CST) of bacterial pathogens has major implications for human health, livestock, and wildlife management because it determines whether control actions in one species may have subsequent effects on other potential host species. The study of bacterial transmission has benefitted from methods measuring two types of genetic variation: variable number of tandem repeats (VNTRs) and single nucleotide polymorphisms (SNPs). However, it is unclear whether these data can distinguish between different epidemiological scenarios. We used a simulation model with two host species and known transmission rates (within and between species) to evaluate the utility of these markers for inferring CST. We found that CST estimates are biased for a wide range of parameters when based on VNTRs and a most parsimonious reconstructed phylogeny. However, estimations of CST rates lower than 5% can be achieved with relatively low bias using as low as 250 SNPs. CST estimates are sensitive to several parameters, including the number of mutations accumulated since introduction, stochasticity, the genetic difference of strains introduced, and the sampling effort. Our results suggest that, even with whole-genome sequences, unbiased estimates of CST will be difficult when sampling is limited, mutation rates are low, or for pathogens that were recently introduced. PMID:25469159

  17. Limitations to estimating bacterial cross-species transmission using genetic and genomic markers: inferences from simulation modeling.

    PubMed

    Benavides, Julio A; Cross, Paul C; Luikart, Gordon; Creel, Scott

    2014-08-01

    Cross-species transmission (CST) of bacterial pathogens has major implications for human health, livestock, and wildlife management because it determines whether control actions in one species may have subsequent effects on other potential host species. The study of bacterial transmission has benefitted from methods measuring two types of genetic variation: variable number of tandem repeats (VNTRs) and single nucleotide polymorphisms (SNPs). However, it is unclear whether these data can distinguish between different epidemiological scenarios. We used a simulation model with two host species and known transmission rates (within and between species) to evaluate the utility of these markers for inferring CST. We found that CST estimates are biased for a wide range of parameters when based on VNTRs and a most parsimonious reconstructed phylogeny. However, estimations of CST rates lower than 5% can be achieved with relatively low bias using as low as 250 SNPs. CST estimates are sensitive to several parameters, including the number of mutations accumulated since introduction, stochasticity, the genetic difference of strains introduced, and the sampling effort. Our results suggest that, even with whole-genome sequences, unbiased estimates of CST will be difficult when sampling is limited, mutation rates are low, or for pathogens that were recently introduced.

  18. A new statistic and its power to infer membership in a genome-wide association study using genotype frequencies.

    PubMed

    Jacobs, Kevin B; Yeager, Meredith; Wacholder, Sholom; Craig, David; Kraft, Peter; Hunter, David J; Paschal, Justin; Manolio, Teri A; Tucker, Margaret; Hoover, Robert N; Thomas, Gilles D; Chanock, Stephen J; Chatterjee, Nilanjan

    2009-11-01

    Aggregate results from genome-wide association studies (GWAS), such as genotype frequencies for cases and controls, were until recently often made available on public websites because they were thought to disclose negligible information concerning an individual's participation in a study. Homer et al. recently suggested that a method for forensic detection of an individual's contribution to an admixed DNA sample could be applied to aggregate GWAS data. Using a likelihood-based statistical framework, we developed an improved statistic that uses genotype frequencies and individual genotypes to infer whether a specific individual or any close relatives participated in the GWAS and, if so, what the participant's phenotype status is. Our statistic compares the logarithm of genotype frequencies, in contrast to that of Homer et al., which is based on differences in either SNP probe intensity or allele frequencies. We derive the theoretical power of our test statistics and explore the empirical performance in scenarios with varying numbers of randomly chosen or top-associated SNPs.

  19. Genome Structure of the Genus Azospirillum

    PubMed Central

    Martin-Didonet, Claudia C. G.; Chubatsu, Leda S.; Souza, Emanuel M.; Kleina, Margareth; Rego, Fabiane G. M.; Rigo, Liu U.; Yates, M. Geoffrey; Pedrosa, Fabio O.

    2000-01-01

    Azospirillum species are plant-associated diazotrophs of the alpha subclass of Proteobacteria. The genomes of five of the six Azospirillum species were analyzed by pulsed-field gel electrophoresis. All strains possessed several megareplicons, some probably linear, and 16S ribosomal DNA hybridization indicated multiple chromosomes in genomes ranging in size from 4.8 to 9.7 Mbp. The nifHDK operon was identified in the largest replicon. PMID:10869094

  20. Graphic analysis of population structure on genome-wide rheumatoid arthritis data.

    PubMed

    Zhang, Jun; Weng, Chunhua; Niyogi, Partha

    2009-12-15

    Principal-component analysis (PCA) has been used for decades to summarize the human genetic variation across geographic regions and to infer population migration history. Reduction of spurious associations due to population structure is crucial for the success of disease association studies. Recently, PCA has also become a popular method for detecting population structure and correction of population stratification in disease association studies. Inspired by manifold learning, we propose a novel method based on spectral graph theory. Regarding each study subject as a node with suitably defined weights for its edges to close neighbors, one can form a weighted graph. We suggest using the spectrum of the associated graph Laplacian operator, namely, Laplacian eigenfunctions, to infer population structures instead of principal components (PCs). For the whole genome-wide association data for the North American Rheumatoid Arthritis Consortium (NARAC) provided by Genetic Workshop Analysis 16, Laplacian eigenfunctions revealed more meaningful structures of the underlying population than PCA. The proposed method has connection to PCA, and it naturally includes PCA as a special case. Our simple method is computationally fast and is suitable for disease studies at the genome-wide scale.

  1. [Research progress on mitochondrial genome structure in the phylum apicomplexa].

    PubMed

    Li, Xue-mei; Li, Xiao-bing; Huang, Wei

    2014-10-01

    Mitochondria are ubiquitous organelles in all eukaryotic cells which are essential for a series of cellular processes and signal transduction. The phylum Apicomplexa includes series of unicellular eukaryotes and some of them are clinically or economically important parasites. Recent studies have demonstrated that apicomplexan parasites' mitochondrial genomes exhibit remarkably diverse structures and they are ideal biological models to comprehend the evolution of mitochondrial genomes. This paper summarizes the mitochondrial genome structure of some representative apicomplexan, highlights their structure characteristics along with evolution process, and briefly describes their nuclear mitochondrial DNA and nuclear plastid DNA.

  2. Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology.

    PubMed

    Cao, Hongzhi; Hastie, Alex R; Cao, Dandan; Lam, Ernest T; Sun, Yuhui; Huang, Haodong; Liu, Xiao; Lin, Liya; Andrews, Warren; Chan, Saki; Huang, Shujia; Tong, Xin; Requa, Michael; Anantharaman, Thomas; Krogh, Anders; Yang, Huanming; Cao, Han; Xu, Xun

    2014-01-01

    Structural variants (SVs) are less common than single nucleotide polymorphisms and indels in the population, but collectively account for a significant fraction of genetic polymorphism and diseases. Base pair differences arising from SVs are on a much higher order (>100 fold) than point mutations; however, none of the current detection methods are comprehensive, and currently available methodologies are incapable of providing sufficient resolution and unambiguous information across complex regions in the human genome. To address these challenges, we applied a high-throughput, cost-effective genome mapping technology to comprehensively discover genome-wide SVs and characterize complex regions of the YH genome using long single molecules (>150 kb) in a global fashion. Utilizing nanochannel-based genome mapping technology, we obtained 708 insertions/deletions and 17 inversions larger than 1 kb. Excluding the 59 SVs (54 insertions/deletions, 5 inversions) that overlap with N-base gaps in the reference assembly hg19, 666 non-gap SVs remained, and 396 of them (60%) were verified by paired-end data from whole-genome sequencing-based re-sequencing or de novo assembly sequence from fosmid data. Of the remaining 270 SVs, 260 are insertions and 213 overlap known SVs in the Database of Genomic Variants. Overall, 609 out of 666 (90%) variants were supported by experimental orthogonal methods or historical evidence in public databases. At the same time, genome mapping also provides valuable information for complex regions with haplotypes in a straightforward fashion. In addition, with long single-molecule labeling patterns, exogenous viral sequences were mapped on a whole-genome scale, and sample heterogeneity was analyzed at a new level. Our study highlights genome mapping technology as a comprehensive and cost-effective method for detecting structural variation and studying complex regions in the human genome, as well as deciphering viral integration into the host genome.

  3. Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes

    PubMed Central

    2010-01-01

    Background Structured noncoding RNAs perform many functions that are essential for protein synthesis, RNA processing, and gene regulation. Structured RNAs can be detected by comparative genomics, in which homologous sequences are identified and inspected for mutations that conserve RNA secondary structure. Results By applying a comparative genomics-based approach to genome and metagenome sequences from bacteria and archaea, we identified 104 candidate structured RNAs and inferred putative functions for many of these. Twelve candidate metabolite-binding RNAs were identified, three of which were validated, including one reported herein that binds the coenzyme S-adenosylmethionine. Newly identified cis-regulatory RNAs are implicated in photosynthesis or nitrogen regulation in cyanobacteria, purine and one-carbon metabolism, stomach infection by Helicobacter, and many other physiological processes. A candidate riboswitch termed crcB is represented in both bacteria and archaea. Another RNA motif may control gene expression from 3'-untranslated regions of mRNAs, which is unusual for bacteria. Many noncoding RNAs that likely act in trans are also revealed, and several of the noncoding RNA candidates are found mostly or exclusively in metagenome DNA sequences. Conclusions This work greatly expands the variety of highly structured noncoding RNAs known to exist in bacteria and archaea and provides a starting point for biochemical and genetic studies needed to validate their biologic functions. Given the sustained rate of RNA discovery over several similar projects, we expect that far more structured RNAs remain to be discovered from bacterial and archaeal organisms. PMID:20230605

  4. An Estimation of Correlation Structure of Hydraulic Parameters using Bayesian Inference Approach

    NASA Astrophysics Data System (ADS)

    Pan, F.; Zhu, J.; Ye, M.; Yu, Z.

    2008-12-01

    Accuracy of flow and contaminant transport predictions largely depends on proper characterization of hydraulic parameter fields, especially spatial variability of the parameters. However, it is difficult to estimate accurately the correlation structure using traditional geostatistical approached (e.g., sample variaogram) when field measurements of the parameters are sparse. This problem is resolved in this study using the Bayesian Inference approach, and probability distribution of the correlation lengths are estimated from sparse measurements and expert judgment. Prior distribution of the correlation length is inferred from literature, expert judgment, and study of similar hydrogeologic units. The prior distribution is updated to yield the posterior distribution by the likelihood function estimated from the on-site measurements. The mean correlation length of the posterior distribution is used for subsequent random field generation. We applied the Bayesian inference method to the unsaturated zone (UZ) of Yucca Mountain (YM). For each hydrogeologic layer of the UZ, the horizontal and vertical correlation lengths of permeability and porosity are estimated, and multiple realizations of the parameters are generated for evaluating flow and transport uncertainty in the UZ of YM. The results indicate that the estimated correlation lengths of hydraulic parameters can improve the accuracy of flow and transport predictions and reduce the predictive uncertainty in the UZ of YM.

  5. Mediation Analysis With Intermediate Confounding: Structural Equation Modeling Viewed Through the Causal Inference Lens

    PubMed Central

    De Stavola, Bianca L.; Daniel, Rhian M.; Ploubidis, George B.; Micali, Nadia

    2015-01-01

    The study of mediation has a long tradition in the social sciences and a relatively more recent one in epidemiology. The first school is linked to path analysis and structural equation models (SEMs), while the second is related mostly to methods developed within the potential outcomes approach to causal inference. By giving model-free definitions of direct and indirect effects and clear assumptions for their identification, the latter school has formalized notions intuitively developed in the former and has greatly increased the flexibility of the models involved. However, through its predominant focus on nonparametric identification, the causal inference approach to effect decomposition via natural effects is limited to settings that exclude intermediate confounders. Such confounders are naturally dealt with (albeit with the caveats of informality and modeling inflexibility) in the SEM framework. Therefore, it seems pertinent to revisit SEMs with intermediate confounders, armed with the formal definitions and (parametric) identification assumptions from causal inference. Here we investigate: 1) how identification assumptions affect the specification of SEMs, 2) whether the more restrictive SEM assumptions can be relaxed, and 3) whether existing sensitivity analyses can be extended to this setting. Data from the Avon Longitudinal Study of Parents and Children (1990–2005) are used for illustration. PMID:25504026

  6. Mediation analysis with intermediate confounding: structural equation modeling viewed through the causal inference lens.

    PubMed

    De Stavola, Bianca L; Daniel, Rhian M; Ploubidis, George B; Micali, Nadia

    2015-01-01

    The study of mediation has a long tradition in the social sciences and a relatively more recent one in epidemiology. The first school is linked to path analysis and structural equation models (SEMs), while the second is related mostly to methods developed within the potential outcomes approach to causal inference. By giving model-free definitions of direct and indirect effects and clear assumptions for their identification, the latter school has formalized notions intuitively developed in the former and has greatly increased the flexibility of the models involved. However, through its predominant focus on nonparametric identification, the causal inference approach to effect decomposition via natural effects is limited to settings that exclude intermediate confounders. Such confounders are naturally dealt with (albeit with the caveats of informality and modeling inflexibility) in the SEM framework. Therefore, it seems pertinent to revisit SEMs with intermediate confounders, armed with the formal definitions and (parametric) identification assumptions from causal inference. Here we investigate: 1) how identification assumptions affect the specification of SEMs, 2) whether the more restrictive SEM assumptions can be relaxed, and 3) whether existing sensitivity analyses can be extended to this setting. Data from the Avon Longitudinal Study of Parents and Children (1990-2005) are used for illustration.

  7. Population structure of Atlantic mackerel inferred from RAD-seq-derived SNP markers: effects of sequence clustering parameters and hierarchical SNP selection.

    PubMed

    Rodríguez-Ezpeleta, Naiara; Bradbury, Ian R; Mendibil, Iñaki; Álvarez, Paula; Cotano, Unai; Irigoien, Xabier

    2016-07-01

    Restriction-site-associated DNA sequencing (RAD-seq) and related methods are revolutionizing the field of population genomics in nonmodel organisms as they allow generating an unprecedented number of single nucleotide polymorphisms (SNPs) even when no genomic information is available. Yet, RAD-seq data analyses rely on assumptions on nature and number of nucleotide variants present in a single locus, the choice of which may lead to an under- or overestimated number of SNPs and/or to incorrectly called genotypes. Using the Atlantic mackerel (Scomber scombrus L.) and a close relative, the Atlantic chub mackerel (Scomber colias), as case study, here we explore the sensitivity of population structure inferences to two crucial aspects in RAD-seq data analysis: the maximum number of mismatches allowed to merge reads into a locus and the relatedness of the individuals used for genotype calling and SNP selection. Our study resolves the population structure of the Atlantic mackerel, but, most importantly, provides insights into the effects of alternative RAD-seq data analysis strategies on population structure inferences that are directly applicable to other species.

  8. Structural mapping in statistical word problems: A relational reasoning approach to Bayesian inference.

    PubMed

    Johnson, Eric D; Tubau, Elisabet

    2016-09-27

    Presenting natural frequencies facilitates Bayesian inferences relative to using percentages. Nevertheless, many people, including highly educated and skilled reasoners, still fail to provide Bayesian responses to these computationally simple problems. We show that the complexity of relational reasoning (e.g., the structural mapping between the presented and requested relations) can help explain the remaining difficulties. With a non-Bayesian inference that required identical arithmetic but afforded a more direct structural mapping, performance was universally high. Furthermore, reducing the relational demands of the task through questions that directed reasoners to use the presented statistics, as compared with questions that prompted the representation of a second, similar sample, also significantly improved reasoning. Distinct error patterns were also observed between these presented- and similar-sample scenarios, which suggested differences in relational-reasoning strategies. On the other hand, while higher numeracy was associated with better Bayesian reasoning, higher-numerate reasoners were not immune to the relational complexity of the task. Together, these findings validate the relational-reasoning view of Bayesian problem solving and highlight the importance of considering not only the presented task structure, but also the complexity of the structural alignment between the presented and requested relations.

  9. An improved algorithm for generalized community structure inference in complex networks

    NASA Astrophysics Data System (ADS)

    Qu, Yingfei; Shi, Weiren; Shi, Xin

    2017-07-01

    In recent years, the research of the community detection is not only on the structure that densely connected internally, but also on the structure of more patterns, such as heterogeneity, overlapping, core-periphery. In this paper, we build the network model based on the random graph models and propose an improved algorithm to infer the generalized community structures. We achieve it by introducing the generalized Bernstein polynomials and computing the latent parameters of vertices. The algorithm is tested both on the computer-generated benchmark networks and the real-world networks. Results show that the algorithm makes better performances on convergence speed and is able to discover the latent continuous structures in networks.

  10. Testing the phylogenetic position of a parasitic plant (Cuscuta, Convolvulaceae, asteridae): Bayesian inference and the parametric bootstrap on data drawn from three genomes.

    PubMed

    Stefanović, Sasa; Olmstead, Richard G

    2004-06-01

    Previous findings on structural rearrangements in the chloroplast genome of Cuscuta (dodder), the only parasitic genus in the morning-glory family, Convolvulaceae, were attributed to its parasitic life style, but without proper comparison to related nonparasitic members of the family. Before molecular evolutionary questions regarding genome evolution can be answered, the phylogenetic problems within the family need to be resolved. However, the phylogenetic position of parasitic angiosperms and their precise relationship to nonparasitic relatives are difficult to infer. Problems are encountered with both morphological and molecular evidence. Molecular data have been used in numerous studies to elucidate relationships of parasitic taxa, despite accelerated rates of sequence evolution. To address the question of the position of the genus Cuscuta within Convolvulaceae, we generated a new molecular data set consisting of mitochondrial (atpA) and nuclear (RPB2) genes, and analyzed these data together with an existing chloroplast data matrix (rbcL, atpB, trnL-F, and psbE-J), to which an additional chloroplast gene (rpl2) was added. This data set was analyzed with an array of phylogenetic methods, including Bayesian analysis, maximum likelihood, and maximum parsimony. Further exploration of data was done by using methods of phylogeny hypothesis testing. At least two nonparasitic lineages are shown to diverge within the Convolvulaceae before Cuscuta. However, the exact sister group of Cuscuta could not be ascertained, even though many alternatives were rejected with confidence. Caution is therefore warranted when interpreting the causes of molecular evolution in Cuscuta. Detailed comparisons with nonparasitic Convolvulaceae are necessary before firm conclusions can be reached regarding the effects of the parasitic mode of life on patterns of molecular evolution in Cuscuta.

  11. Inference of Gene Regulatory Networks with Sparse Structural Equation Models Exploiting Genetic Perturbations

    PubMed Central

    Cai, Xiaodong; Bazerque, Juan Andrés; Giannakis, Georgios B.

    2013-01-01

    Integrating genetic perturbations with gene expression data not only improves accuracy of regulatory network topology inference, but also enables learning of causal regulatory relations between genes. Although a number of methods have been developed to integrate both types of data, the desiderata of efficient and powerful algorithms still remains. In this paper, sparse structural equation models (SEMs) are employed to integrate both gene expression data and cis-expression quantitative trait loci (cis-eQTL), for modeling gene regulatory networks in accordance with biological evidence about genes regulating or being regulated by a small number of genes. A systematic inference method named sparsity-aware maximum likelihood (SML) is developed for SEM estimation. Using simulated directed acyclic or cyclic networks, the SML performance is compared with that of two state-of-the-art algorithms: the adaptive Lasso (AL) based scheme, and the QTL-directed dependency graph (QDG) method. Computer simulations demonstrate that the novel SML algorithm offers significantly better performance than the AL-based and QDG algorithms across all sample sizes from 100 to 1,000, in terms of detection power and false discovery rate, in all the cases tested that include acyclic or cyclic networks of 10, 30 and 300 genes. The SML method is further applied to infer a network of 39 human genes that are related to the immune function and are chosen to have a reliable eQTL per gene. The resulting network consists of 9 genes and 13 edges. Most of the edges represent interactions reasonably expected from experimental evidence, while the remaining may just indicate the emergence of new interactions. The sparse SEM and efficient SML algorithm provide an effective means of exploiting both gene expression and perturbation data to infer gene regulatory networks. An open-source computer program implementing the SML algorithm is freely available upon request. PMID:23717196

  12. Structural Genomics of Minimal Organisms: Pipeline and Results

    SciTech Connect

    Kim, Sung-Hou; Shin, Dong-Hae; Kim, Rosalind; Adams, Paul; Chandonia, John-Marc

    2007-09-14

    The initial objective of the Berkeley Structural Genomics Center was to obtain a near complete three-dimensional (3D) structural information of all soluble proteins of two minimal organisms, closely related pathogens Mycoplasma genitalium and M. pneumoniae. The former has fewer than 500 genes and the latter has fewer than 700 genes. A semiautomated structural genomics pipeline was set up from target selection, cloning, expression, purification, and ultimately structural determination. At the time of this writing, structural information of more than 93percent of all soluble proteins of M. genitalium is avail able. This chapter summarizes the approaches taken by the authors' center.

  13. Computational inference of grammars for larger-than-gene structures from annotated gene sequences.

    PubMed

    Tsafnat, Guy; Schaeffer, Jaron; Clayphan, Andrew; Iredell, Jon R; Partridge, Sally R; Coiera, Enrico

    2011-03-15

    Larger than gene structures (LGS) are DNA segments that include at least one gene and often other segments such as inverted repeats and gene promoters. Mobile genetic elements (MGE) such as integrons are LGS that play an important role in horizontal gene transfer, primarily in Gram-negative organisms. Known LGS have a profound effect on organism virulence, antibiotic resistance and other properties of the organism due to the number of genes involved. Expert-compiled grammars have been shown to be an effective computational representation of LGS, well suited to automating annotation, and supporting de novo gene discovery. However, development of LGS grammars by experts is labour intensive and restricted to known LGS. This study uses computational grammar inference methods to automate LGS discovery. We compare the ability of six algorithms to infer LGS grammars from DNA sequences annotated with genes and other short sequences. We compared the predictive power of learned grammars against an expert-developed grammar for gene cassette arrays found in Class 1, 2 and 3 integrons, which are modular LGS containing up to 9 of about 240 cassette types. Using a Bayesian generalization algorithm our inferred grammar was able to predict > 95% of MGE structures in a corpus of 1760 sequences obtained from Genbank (F-score 75%). Even with 100% noise added to the training and test sets, we obtained an F-score of 68%, indicating that the method is robust and has the potential to predict de novo LGS structures when the underlying gene features are known. http://www2.chi.unsw.edu.au/attacca.

  14. Assessing structural variation in a personal genome-towards a human reference diploid genome.

    PubMed

    English, Adam C; Salerno, William J; Hampton, Oliver A; Gonzaga-Jauregui, Claudia; Ambreth, Shruthi; Ritter, Deborah I; Beck, Christine R; Davis, Caleb F; Dahdouli, Mahmoud; Ma, Singer; Carroll, Andrew; Veeraraghavan, Narayanan; Bruestle, Jeremy; Drees, Becky; Hastie, Alex; Lam, Ernest T; White, Simon; Mishra, Pamela; Wang, Min; Han, Yi; Zhang, Feng; Stankiewicz, Pawel; Wheeler, David A; Reid, Jeffrey G; Muzny, Donna M; Rogers, Jeffrey; Sabo, Aniko; Worley, Kim C; Lupski, James R; Boerwinkle, Eric; Gibbs, Richard A

    2015-04-11

    Characterizing large genomic variants is essential to expanding the research and clinical applications of genome sequencing. While multiple data types and methods are available to detect these structural variants (SVs), they remain less characterized than smaller variants because of SV diversity, complexity, and size. These challenges are exacerbated by the experimental and computational demands of SV analysis. Here, we characterize the SV content of a personal genome with Parliament, a publicly available consensus SV-calling infrastructure that merges multiple data types and SV detection methods. We demonstrate Parliament's efficacy via integrated analyses of data from whole-genome array comparative genomic hybridization, short-read next-generation sequencing, long-read (Pacific BioSciences RSII), long-insert (Illumina Nextera), and whole-genome architecture (BioNano Irys) data from the personal genome of a single subject (HS1011). From this genome, Parliament identified 31,007 genomic loci between 100 bp and 1 Mbp that are inconsistent with the hg19 reference assembly. Of these loci, 9,777 are supported as putative SVs by hybrid local assembly, long-read PacBio data, or multi-source heuristics. These SVs span 59 Mbp of the reference genome (1.8%) and include 3,801 events identified only with long-read data. The HS1011 data and complete Parliament infrastructure, including a BAM-to-SV workflow, are available on the cloud-based service DNAnexus. HS1011 SV analysis reveals the limits and advantages of multiple sequencing technologies, specifically the impact of long-read SV discovery. With the full Parliament infrastructure, the HS1011 data constitute a public resource for novel SV discovery, software calibration, and personal genome structural variation analysis.

  15. Evolutionary genomics and population structure of Entamoeba histolytica

    PubMed Central

    Das, Koushik; Ganguly, Sandipan

    2014-01-01

    Amoebiasis caused by the gastrointestinal parasite Entamoeba histolytica has diverse disease outcomes. Study of genome and evolution of this fascinating parasite will help us to understand the basis of its virulence and explain why, when and how it causes diseases. In this review, we have summarized current knowledge regarding evolutionary genomics of E. histolytica and discussed their association with parasite phenotypes and its differential pathogenic behavior. How genetic diversity reveals parasite population structure has also been discussed. Queries concerning their evolution and population structure which were required to be addressed have also been highlighted. This significantly large amount of genomic data will improve our knowledge about this pathogenic species of Entamoeba. PMID:25505504

  16. 3D genome structure modeling by Lorentzian objective function.

    PubMed

    Trieu, Tuan; Cheng, Jianlin

    2017-02-17

    The 3D structure of the genome plays a vital role in biological processes such as gene interaction, gene regulation, DNA replication and genome methylation. Advanced chromosomal conformation capture techniques, such as Hi-C and tethered conformation capture, can generate chromosomal contact data that can be used to computationally reconstruct 3D structures of the genome. We developed a novel restraint-based method that is capable of reconstructing 3D genome structures utilizing both intra-and inter-chromosomal contact data. Our method was robust to noise and performed well in comparison with a panel of existing methods on a controlled simulated data set. On a real Hi-C data set of the human genome, our method produced chromosome and genome structures that are consistent with 3D FISH data and known knowledge about the human chromosome and genome, such as, chromosome territories and the cluster of small chromosomes in the nucleus center with the exception of the chromosome 18. The tool and experimental data are available at https://missouri.box.com/v/LorDG.

  17. 3D genome structure modeling by Lorentzian objective function.

    PubMed

    Trieu, Tuan; Cheng, Jianlin

    2016-11-29

    The 3D structure of the genome plays a vital role in biological processes such as gene interaction, gene regulation, DNA replication and genome methylation. Advanced chromosomal conformation capture techniques, such as Hi-C and tethered conformation capture, can generate chromosomal contact data that can be used to computationally reconstruct 3D structures of the genome. We developed a novel restraint-based method that is capable of reconstructing 3D genome structures utilizing both intra-and inter-chromosomal contact data. Our method was robust to noise and performed well in comparison with a panel of existing methods on a controlled simulated data set. On a real Hi-C data set of the human genome, our method produced chromosome and genome structures that are consistent with 3D FISH data and known knowledge about the human chromosome and genome, such as, chromosome territories and the cluster of small chromosomes in the nucleus center with the exception of the chromosome 18. The tool and experimental data are available at https://missouri.box.com/v/LorDG.

  18. PHAISTOS: a framework for Markov chain Monte Carlo simulation and inference of protein structure.

    PubMed

    Boomsma, Wouter; Frellsen, Jes; Harder, Tim; Bottaro, Sandro; Johansson, Kristoffer E; Tian, Pengfei; Stovgaard, Kasper; Andreetta, Christian; Olsson, Simon; Valentin, Jan B; Antonov, Lubomir D; Christensen, Anders S; Borg, Mikael; Jensen, Jan H; Lindorff-Larsen, Kresten; Ferkinghoff-Borg, Jesper; Hamelryck, Thomas

    2013-07-15

    We present a new software framework for Markov chain Monte Carlo sampling for simulation, prediction, and inference of protein structure. The software package contains implementations of recent advances in Monte Carlo methodology, such as efficient local updates and sampling from probabilistic models of local protein structure. These models form a probabilistic alternative to the widely used fragment and rotamer libraries. Combined with an easily extendible software architecture, this makes PHAISTOS well suited for Bayesian inference of protein structure from sequence and/or experimental data. Currently, two force-fields are available within the framework: PROFASI and OPLS-AA/L, the latter including the generalized Born surface area solvent model. A flexible command-line and configuration-file interface allows users quickly to set up simulations with the desired configuration. PHAISTOS is released under the GNU General Public License v3.0. Source code and documentation are freely available from http://phaistos.sourceforge.net. The software is implemented in C++ and has been tested on Linux and OSX platforms.

  19. Inferring a District-Based Hierarchical Structure of Social Contacts from Census Data

    PubMed Central

    Yu, Zhiwen; Liu, Jiming; Zhu, Xianjun

    2015-01-01

    Researchers have recently paid attention to social contact patterns among individuals due to their useful applications in such areas as epidemic evaluation and control, public health decisions, chronic disease research and social network research. Although some studies have estimated social contact patterns from social networks and surveys, few have considered how to infer the hierarchical structure of social contacts directly from census data. In this paper, we focus on inferring an individual’s social contact patterns from detailed census data, and generate various types of social contact patterns such as hierarchical-district-structure-based, cross-district and age-district-based patterns. We evaluate newly generated contact patterns derived from detailed 2011 Hong Kong census data by incorporating them into a model and simulation of the 2009 Hong Kong H1N1 epidemic. We then compare the newly generated social contact patterns with the mixing patterns that are often used in the literature, and draw the following conclusions. First, the generation of social contact patterns based on a hierarchical district structure allows for simulations at different district levels. Second, the newly generated social contact patterns reflect individuals social contacts. Third, the newly generated social contact patterns improve the accuracy of the SEIR-based epidemic model. PMID:25679787

  20. Inferring a district-based hierarchical structure of social contacts from census data.

    PubMed

    Yu, Z; Liu, J; Zhu, X

    2015-01-01

    Researchers have recently paid attention to social contact patterns among individuals due to their useful applications in such areas as epidemic evaluation and control, public health decisions, chronic disease research and social network research. Although some studies have estimated social contact patterns from social networks and surveys, few have considered how to infer the hierarchical structure of social contacts directly from census data. In this paper, we focus on inferring an individual's social contact patterns from detailed census data, and generate various types of social contact patterns such as hierarchical-district-structure-based, cross-district and age-district-based patterns. We evaluate newly generated contact patterns derived from detailed 2011 Hong Kong census data by incorporating them into a model and simulation of the 2009 Hong Kong H1N1 epidemic. We then compare the newly generated social contact patterns with the mixing patterns that are often used in the literature, and draw the following conclusions. First, the generation of social contact patterns based on a hierarchical district structure allows for simulations at different district levels. Second, the newly generated social contact patterns reflect individuals social contacts. Third, the newly generated social contact patterns improve the accuracy of the SEIR-based epidemic model.

  1. Inferring the Clonal Structure of Viral Populations from Time Series Sequencing

    PubMed Central

    Chedom, Donatien F.; Murcia, Pablo R.; Greenman, Chris D.

    2015-01-01

    RNA virus populations will undergo processes of mutation and selection resulting in a mixed population of viral particles. High throughput sequencing of a viral population subsequently contains a mixed signal of the underlying clones. We would like to identify the underlying evolutionary structures. We utilize two sources of information to attempt this; within segment linkage information, and mutation prevalence. We demonstrate that clone haplotypes, their prevalence, and maximum parsimony reticulate evolutionary structures can be identified, although the solutions may not be unique, even for complete sets of information. This is applied to a chain of influenza infection, where we infer evolutionary structures, including reassortment, and demonstrate some of the difficulties of interpretation that arise from deep sequencing due to artifacts such as template switching during PCR amplification. PMID:26571026

  2. The Mutate-and-Map Protocol for Inferring Base Pairs in Structured RNA

    PubMed Central

    VanLang, Christopher C.; Das, Rhiju

    2014-01-01

    Chemical mapping is a widespread technique for structural analysis of nucleic acids in which a molecule’s reactivity to different probes is quantified at single nucleotide resolution and used to constrain structural modeling. This experimental framework has been extensively revisited in the past decade with new strategies for high-throughput readouts, chemical modification, and rapid data analysis. Recently, we have coupled the technique to high-throughput mutagenesis. Point mutations of a base paired nucleotide can lead to exposure of not only that nucleotide but also its interaction partner. Systematically carrying out the mutation and mapping for the entire system gives an experimental approximation of the molecule’s “contact map.” Here, we give our in-house protocol for this “mutate-and-map” (M2) strategy, based on 96-well capillary electrophoresis, and we provide practical tips on interpreting the data to infer nucleic acid structure. PMID:24136598

  3. Hebbian Wiring Plasticity Generates Efficient Network Structures for Robust Inference with Synaptic Weight Plasticity

    PubMed Central

    Hiratani, Naoki; Fukai, Tomoki

    2016-01-01

    In the adult mammalian cortex, a small fraction of spines are created and eliminated every day, and the resultant synaptic connection structure is highly nonrandom, even in local circuits. However, it remains unknown whether a particular synaptic connection structure is functionally advantageous in local circuits, and why creation and elimination of synaptic connections is necessary in addition to rich synaptic weight plasticity. To answer these questions, we studied an inference task model through theoretical and numerical analyses. We demonstrate that a robustly beneficial network structure naturally emerges by combining Hebbian-type synaptic weight plasticity and wiring plasticity. Especially in a sparsely connected network, wiring plasticity achieves reliable computation by enabling efficient information transmission. Furthermore, the proposed rule reproduces experimental observed correlation between spine dynamics and task performance. PMID:27303271

  4. Effects of vegetation canopy structure on remotely sensed canopy temperatures. [inferring plant water stress and yield

    NASA Technical Reports Server (NTRS)

    Kimes, D. S.

    1979-01-01

    The effects of vegetation canopy structure on thermal infrared sensor response must be understood before vegetation surface temperatures of canopies with low percent ground cover can be accurately inferred. The response of a sensor is a function of vegetation geometric structure, the vertical surface temperature distribution of the canopy components, and sensor view angle. Large deviations between the nadir sensor effective radiant temperature (ERT) and vegetation ERT for a soybean canopy were observed throughout the growing season. The nadir sensor ERT of a soybean canopy with 35 percent ground cover deviated from the vegetation ERT by as much as 11 C during the mid-day. These deviations were quantitatively explained as a function of canopy structure and soil temperature. Remote sensing techniques which determine the vegetation canopy temperature(s) from the sensor response need to be studied.

  5. The mutate-and-map protocol for inferring base pairs in structured RNA.

    PubMed

    Cordero, Pablo; Kladwang, Wipapat; VanLang, Christopher C; Das, Rhiju

    2014-01-01

    Chemical mapping is a widespread technique for structural analysis of nucleic acids in which a molecule's reactivity to different probes is quantified at single nucleotide resolution and used to constrain structural modeling. This experimental framework has been extensively revisited in the past decade with new strategies for high-throughput readouts, chemical modification, and rapid data analysis. Recently, we have coupled the technique to high-throughput mutagenesis. Point mutations of a base paired nucleotide can lead to exposure of not only that nucleotide but also its interaction partner. Systematically carrying out the mutation and mapping for the entire system gives an experimental approximation of the molecule's "contact map." Here, we give our in-house protocol for this "mutate-and-map" (M2) strategy, based on 96-well capillary electrophoresis, and we provide practical tips on interpreting the data to infer nucleic acid structure.

  6. Phylogeny and biogeography of highly diverged freshwater fish species (Leuciscinae, Cyprinidae, Teleostei) inferred from mitochondrial genome analysis.

    PubMed

    Imoto, Junichi M; Saitoh, Kenji; Sasaki, Takeshi; Yonezawa, Takahiro; Adachi, Jun; Kartavtsev, Yuri P; Miya, Masaki; Nishida, Mutsumi; Hanzawa, Naoto

    2013-02-10

    The distribution of freshwater taxa is a good biogeographic model to study pattern and process of vicariance and dispersal. The subfamily Leuciscinae (Cyprinidae, Teleostei) consists of many species distributed widely in Eurasia and North America. Leuciscinae have been divided into two phyletic groups, leuciscin and phoxinin. The phylogenetic relationships between major clades within the subfamily are poorly understood, largely because of the overwhelming diversity of the group. The origin of the Far Eastern phoxinin is an interesting question regarding the evolutionary history of Leuciscinae. Here we present phylogenetic analysis of 31 species of Leuciscinae and outgroups based on complete mitochondrial genome sequences to clarify the phylogenetic relationships and to infer the evolutionary history of the subfamily. Phylogenetic analysis suggests that the Far Eastern phoxinin species comprised the monophyletic clades Tribolodon, Pseudaspius, Oreoleuciscus and Far Eastern Phoxinus. The Far Eastern phoxinin clade was independent of other Leuciscinae lineages and was closer to North American phoxinins than European leuciscins. All of our analysis also suggested that leuciscins and phoxinins each constituted monophyletic groups. Divergence time estimation suggested that Leuciscinae species diverged from outgroups such as Tincinae to be 83.3 million years ago (Mya) in the Late Cretaceous and leuciscin and phoxinin shared a common ancestor 70.7 Mya. Radiation of Leuciscinae lineages occurred during the Late Cretaceous to Paleocene. This period also witnessed the radiation of tetrapods. Reconstruction of ancestral areas indicates Leuciscinae species originated within Europe. Leuciscin species evolved in Europe and the ancestor of phoxinin was distributed in North America. The Far Eastern phoxinins would have dispersed from North America to Far East across the Beringia land bridge. The present study suggests important roles for the continental rearrangements during the

  7. Structural genomics of eukaryotic targets at a laboratory scale.

    PubMed

    Busso, Didier; Poussin-Courmontagne, Pierre; Rosé, David; Ripp, Raymond; Litt, Alain; Thierry, Jean-Claude; Moras, Dino

    2005-01-01

    Structural genomics programs are distributed worldwide and funded by large institutions such as the NIH in United-States, the RIKEN in Japan or the European Commission through the SPINE network in Europe. Such initiatives, essentially managed by large consortia, led to technology and method developments at the different steps required to produce biological samples compatible with structural studies. Besides specific applications, method developments resulted mainly upon miniaturization and parallelization. The challenge that academic laboratories faces to pursue structural genomics programs is to produce, at a higher rate, protein samples. The Structural Biology and Genomics Department (IGBMC - Illkirch - France) is implicated in a structural genomics program of high eukaryotes whose goal is solving crystal structures of proteins and their complexes (including large complexes) related to human health and biotechnology. To achieve such a challenging goal, the Department has established a medium-throughput pipeline for producing protein samples suitable for structural biology studies. Here, we describe the setting up of our initiative from cloning to crystallization and we demonstrate that structural genomics may be manageable by academic laboratories by strategic investments in robotic and by adapting classical bench protocols and new developments, in particular in the field of protein expression, to parallelization.

  8. Genome Pool Strategy for Structural Coverage of Protein Families

    PubMed Central

    Jaroszewski, Lukasz; Slabinski, Lukasz; Wooley, John; Deacon, Ashley M.; Lesley, Scott A.; Wilson, Ian. A.; Godzik, Adam

    2010-01-01

    As noticed by generations of structural biologists, closely homologous proteins may have substantially different crystallization properties and propensities. These observations can be used to systematically introduce additional dimensionality into crystallization trials by targeting homologous proteins from multiple genomes in a “genome pool” strategy. Through extensive use of our recently introduced “crystallization feasibility score” (Slabinski et al., 2007a), we can explain that the genome pool strategy works well because the crystallization feasibility scores are surprisingly broad within families of homologous proteins, with most families containing a range of optimal to very difficult targets. We also show that some families can be regarded as relatively “easy”, where a significant number of proteins are predicted to have optimal crystallization features, and others are “very difficult”, where almost none are predicted to result in a crystal structure. Thus, the outcome of such variable distributions of such crystallizability' preferences leads to uneven structural coverage of known families, with “easier” or “optimal” families having several times more solved structures than “very difficult” ones. Nevertheless, this latter category can be successfully targeted by increasing the number of genomes that are used to select targets from a given family. On average, adding 10 new genomes to the “genome pool” provides more promising targets for 7 “very difficult” families. In contrast, our crystallization feasibility score does not indicate that any specific microbial genomes can be readily classified as “easier” or “very difficult” with respect to providing suitable candidates for crystallization and structure determination. Finally, our analyses show that specific physicochemical properties of the protein sequence favor successful outcomes for structure determination and, hence, the group of proteins with known 3D

  9. Statistical inference of seabed sound-speed structure in the Gulf of Oman Basin.

    PubMed

    Sagers, Jason D; Knobles, David P

    2014-06-01

    Addressed is the statistical inference of the sound-speed depth profile of a thick soft seabed from broadband sound propagation data recorded in the Gulf of Oman Basin in 1977. The acoustic data are in the form of time series signals recorded on a sparse vertical line array and generated by explosive sources deployed along a 280 km track. The acoustic data offer a unique opportunity to study a deep-water bottom-limited thickly sedimented environment because of the large number of time series measurements, very low seabed attenuation, and auxiliary measurements. A maximum entropy method is employed to obtain a conditional posterior probability distribution (PPD) for the sound-speed ratio and the near-surface sound-speed gradient. The multiple data samples allow for a determination of the average error constraint value required to uniquely specify the PPD for each data sample. Two complicating features of the statistical inference study are addressed: (1) the need to develop an error function that can both utilize the measured multipath arrival structure and mitigate the effects of data errors and (2) the effect of small bathymetric slopes on the structure of the bottom interacting arrivals.

  10. Fully Bayesian inference for structural MRI: application to segmentation and statistical analysis of T2-hypointensities.

    PubMed

    Schmidt, Paul; Schmid, Volker J; Gaser, Christian; Buck, Dorothea; Bührlen, Susanne; Förschler, Annette; Mühlau, Mark

    2013-01-01

    Aiming at iron-related T2-hypointensity, which is related to normal aging and neurodegenerative processes, we here present two practicable approaches, based on Bayesian inference, for preprocessing and statistical analysis of a complex set of structural MRI data. In particular, Markov Chain Monte Carlo methods were used to simulate posterior distributions. First, we rendered a segmentation algorithm that uses outlier detection based on model checking techniques within a Bayesian mixture model. Second, we rendered an analytical tool comprising a Bayesian regression model with smoothness priors (in the form of Gaussian Markov random fields) mitigating the necessity to smooth data prior to statistical analysis. For validation, we used simulated data and MRI data of 27 healthy controls (age: [Formula: see text]; range, [Formula: see text]). We first observed robust segmentation of both simulated T2-hypointensities and gray-matter regions known to be T2-hypointense. Second, simulated data and images of segmented T2-hypointensity were analyzed. We found not only robust identification of simulated effects but also a biologically plausible age-related increase of T2-hypointensity primarily within the dentate nucleus but also within the globus pallidus, substantia nigra, and red nucleus. Our results indicate that fully Bayesian inference can successfully be applied for preprocessing and statistical analysis of structural MRI data.

  11. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies.

    PubMed Central

    Falush, Daniel; Stephens, Matthew; Pritchard, Jonathan K

    2003-01-01

    We describe extensions to the method of Pritchard et al. for inferring population structure from multilocus genotype data. Most importantly, we develop methods that allow for linkage between loci. The new model accounts for the correlations between linked loci that arise in admixed populations ("admixture linkage disequilibium"). This modification has several advantages, allowing (1) detection of admixture events farther back into the past, (2) inference of the population of origin of chromosomal regions, and (3) more accurate estimates of statistical uncertainty when linked loci are used. It is also of potential use for admixture mapping. In addition, we describe a new prior model for the allele frequencies within each population, which allows identification of subtle population subdivisions that were not detectable using the existing method. We present results applying the new methods to study admixture in African-Americans, recombination in Helicobacter pylori, and drift in populations of Drosophila melanogaster. The methods are implemented in a program, structure, version 2.0, which is available at http://pritch.bsd.uchicago.edu. PMID:12930761

  12. Inferring the population structure and demography of Drosophila ananassae from multilocus data.

    PubMed

    Das, Aparup; Mohanty, Sujata; Stephan, Wolfgang

    2004-12-01

    Inferring the origin, population structure, and demographic history of a species is a major objective of population genetics. Although many organisms have been analyzed, the genetic structures of subdivided populations are not well understood. Here we analyze Drosophila ananassae, a highly substructured, cosmopolitan, and human-commensal species distributed in the tropical, subtropical, and mildly temperate regions of the world. We adopt a multilocus approach (with 10 neutral loci) using 16 population samples covering almost the entire species range (Asia, Australia, and America). Analyzed with our recently developed Bayesian method, 5 populations in Southeast Asia are found to be central, while the other 11 are peripheral. These 5 central populations were sampled from localities that belonged to a single landmass ("Sundaland") during the late Pleistocene ( approximately 18,000 years ago), when sea level was approximately 120 m below the present level. The inferred migration routes of D. ananassae out of Sundaland seem to parallel those of humans in this region. Strong evidence for a population size expansion is seen particularly in the ancestral populations.

  13. A Genome-Wide Survey of Switchgrass Genome Structure and Organization

    PubMed Central

    Sharma, Manoj K.; Sharma, Rita; Cao, Peijian; Jenkins, Jerry; Bartley, Laura E.; Qualls, Morgan; Grimwood, Jane; Schmutz, Jeremy; Rokhsar, Daniel; Ronald, Pamela C.

    2012-01-01

    The perennial grass, switchgrass (Panicum virgatum L.), is a promising bioenergy crop and the target of whole genome sequencing. We constructed two bacterial artificial chromosome (BAC) libraries from the AP13 clone of switchgrass to gain insight into the genome structure and organization, initiate functional and comparative genomic studies, and assist with genome assembly. Together representing 16 haploid genome equivalents of switchgrass, each library comprises 101,376 clones with average insert sizes of 144 (HindIII-generated) and 110 kb (BstYI-generated). A total of 330,297 high quality BAC-end sequences (BES) were generated, accounting for 263.2 Mbp (16.4%) of the switchgrass genome. Analysis of the BES identified 279,099 known repetitive elements, >50,000 SSRs, and 2,528 novel repeat elements, named switchgrass repetitive elements (SREs). Comparative mapping of 47 full-length BAC sequences and 330K BES revealed high levels of synteny with the grass genomes sorghum, rice, maize, and Brachypodium. Our data indicate that the sorghum genome has retained larger microsyntenous regions with switchgrass besides high gene order conservation with rice. The resources generated in this effort will be useful for a broad range of applications. PMID:22511929

  14. Structural Information Inference from Lanthanoid Complexing Systems: Photoluminescence Studies on Isolated Ions

    NASA Astrophysics Data System (ADS)

    Greisch, Jean Francois; Harding, Michael E.; Chmela, Jiri; Klopper, Willem M.; Schooss, Detlef; Kappes, Manfred M.

    2016-06-01

    The application of lanthanoid complexes ranges from photovoltaics and light-emitting diodes to quantum memories and biological assays. Rationalization of their design requires a thorough understanding of intramolecular processes such as energy transfer, charge transfer, and non-radiative decay involving their subunits. Characterization of the excited states of such complexes considerably benefits from mass spectrometric methods since the associated optical transitions and processes are strongly affected by stoichiometry, symmetry, and overall charge state. We report herein spectroscopic measurements on ensembles of ions trapped in the gas phase and soft-landed in neon matrices. Their interpretation is considerably facilitated by direct comparison with computations. The combination of energy- and time-resolved measurements on isolated species with density functional as well as ligand-field and Franck-Condon computations enables us to infer structural as well as dynamical information about the species studied. The approach is first illustrated for sets of model lanthanoid complexes whose structure and electronic properties are systematically varied via the substitution of one component (lanthanoid or alkali,alkali-earth ion): (i) systematic dependence of ligand-centered phosphorescence on the lanthanoid(III) promotion energy and its impact on sensitization, and (ii) structural changes induced by the substitution of alkali or alkali-earth ions in relation with structures inferred using ion mobility spectroscopy. The temperature dependence of sensitization is briefly discussed. The focus is then shifted to measurements involving europium complexes with doxycycline an antibiotic of the tetracycline family. Besides discussing the complexes' structural and electronic features, we report on their use to monitor enzymatic processes involving hydrogen peroxide or biologically relevant molecules such as adenosine triphosphate (ATP).

  15. New Inferences of Earth's Mantle Viscosity Structure and Implications for Long-wavelength Structure in the Lower Mantle

    NASA Astrophysics Data System (ADS)

    Rudolph, M. L.; Lekic, V.; Lithgow-Bertelloni, C. R.

    2015-12-01

    The viscosity structure of Earth's deep mantle affects the thermal evolution of Earth, the ascent of mantle plumes, settling of subducted oceanic lithosphere, and the mixing of compositional heterogeneities in the mantle. Modeling the long wavelength non-hydrostatic geoid provides a constraint on the radial viscosity structure of Earth's mantle. We carried out inversions for the radial mantle viscosity structure using a transdimensional, hierarchical Bayesian technique that allows us to obtain solutions without specifying at the outset the number or locations of viscosity changes within the mantle. We obtained a posterior probability distribution of mantle viscosity structures, which allowed us to assess our confidence in our inferences of the viscosity structure. We find robust evidence for an increase in viscosity at 800-1200 km depth, significantly deeper than the mineral phase transformations which define the mantle transition zone. The viscosity increase is coincident in depth with regions where tomographic models image slab stagnation, plume deflection, and changes in large-scale structure, manifested in the mantle radial correlation function for the lowest spherical harmonic degrees. Here, we present new results from 3D, spherical-shell geometry thermal and thermochemical mantle convection simulations with prescribed plate motions based on paleogeographic reconstructions. These simulations employ a range of admissible mantle viscosity structures from our geoid inversions. We find that by including the inferred increase in viscosity at 1000 km depth, we can better reproduce the long wavelength mantle radial correlation function observed in the latest tomographic models GAP-P4 and SEMUCB-WM1. The similarity of the modeled and observed radial correlation functions is sensitive to the choice of lower mantle viscosity and the inclusion of phase changes in the transition zone and the mid-mantle. We will also discuss the effect of these viscosity structures on

  16. Structural and operational complexity of the Geobacter sulfurreducens genome

    PubMed Central

    Qiu, Yu; Cho, Byung-Kwan; Park, Young Seoub; Lovley, Derek; Palsson, Bernhard Ø.; Zengler, Karsten

    2010-01-01

    Prokaryotic genomes can be annotated based on their structural, operational, and functional properties. These annotations provide the pivotal scaffold for understanding cellular functions on a genome-scale, such as metabolism and transcriptional regulation. Here, we describe a systems approach to simultaneously determine the structural and operational annotation of the Geobacter sulfurreducens genome. Integration of proteomics, transcriptomics, RNA polymerase, and sigma factor-binding information with deep-sequencing-based analysis of primary 5′-end transcripts allowed for a most precise annotation. The structural annotation is comprised of numerous previously undetected genes, noncoding RNAs, prevalent leaderless mRNA transcripts, and antisense transcripts. When compared with other prokaryotes, we found that the number of antisense transcripts reversely correlated with genome size. The operational annotation consists of 1453 operons, 22% of which have multiple transcription start sites that use different RNA polymerase holoenzymes. Several operons with multiple transcription start sites encoded genes with essential functions, giving insight into the regulatory complexity of the genome. The experimentally determined structural and operational annotations can be combined with functional annotation, yielding a new three-level annotation that greatly expands our understanding of prokaryotic genomes. PMID:20592237

  17. Structural and Operational Complexity of the Geobacter Sulfurreducens Genome

    SciTech Connect

    Qiu, Yu; Cho, Byung-Kwan; Park, Young S.; Lovley, Derek R.; Palsson, Bernhard O.; Zengler, Karsten

    2010-06-30

    Prokaryotic genomes can be annotated based on their structural, operational, and functional properties. These annotations provide the pivotal scaffold for understanding cellular functions on a genome-scale, such as metabolism and transcriptional regulation. Here, we describe a systems approach to simultaneously determine the structural and operational annotation of the Geobacter sulfurreducens genome. Integration of proteomics, transcriptomics, RNA polymerase, and sigma factor-binding information with deep-sequencing-based analysis of primary 59-end transcripts allowed for a most precise annotation. The structural annotation is comprised of numerous previously undetected genes, noncoding RNAs, prevalent leaderless mRNA transcripts, and antisense transcripts. When compared with other prokaryotes, we found that the number of antisense transcripts reversely correlated with genome size. The operational annotation consists of 1453 operons, 22% of which have multiple transcription start sites that use different RNA polymerase holoenzymes. Several operons with multiple transcription start sites encoded genes with essential functions, giving insight into the regulatory complexity of the genome. The experimentally determined structural and operational annotations can be combined with functional annotation, yielding a new three-level annotation that greatly expands our understanding of prokaryotic genomes.

  18. Inference of 3-dimensional structure underlying large-scale coronal events observed by Yohkoh and Ulysses

    NASA Technical Reports Server (NTRS)

    Slater, G. L.; Freeland, S. L.; Hoeksema, T.; Zhao, X.; Hudson, H. S.

    1995-01-01

    The Yohkoh/SXT images provide full-disk coverage of the solar corona, usually extending before and after one of the large-scale eruptive events that occur in the polar crown These produce large arcades of X-ray loops, often with a cusp-shaped coronal extension, and are known to be associated with coronal mass ejections. The Yohkoh prototype of such events occurred 12 Nov. 1991. This allows us to determine heights from the apparent rotation rates of these structures. In comparison v with magnetic-field extrapolations from Wilcox Solar Observatory. use use this tool to infer the three dimensional structure of the corona in particular cases: 24 Jan. 1992, 24 Feb. 1993, 14 Apr. 1994, and 13 Nov. 1994. The last event is a long-duration flare event.

  19. Mapping and sequencing of structural variation from eight human genomes

    PubMed Central

    Kidd, Jeffrey M.; Cooper, Gregory M.; Donahue, William F.; Hayden, Hillary S.; Sampas, Nick; Graves, Tina; Hansen, Nancy; Teague, Brian; Alkan, Can; Antonacci, Francesca; Haugen, Eric; Zerr, Troy; Yamada, N. Alice; Tsang, Peter; Newman, Tera L.; Tüzün, Eray; Cheng, Ze; Ebling, Heather M.; Tusneem, Nadeem; David, Robert; Gillett, Will; Phelps, Karen A.; Weaver, Molly; Saranga, David; Brand, Adrianne; Tao, Wei; Gustafson, Erik; McKernan, Kevin; Chen, Lin; Malig, Maika; Smith, Joshua D.; Korn, Joshua M.; McCarroll, Steven A.; Altshuler, David A.; Peiffer, Daniel A.; Dorschner, Michael; Stamatoyannopoulos, John; Schwartz, David; Nickerson, Deborah A.; Mullikin, James C.; Wilson, Richard K.; Bruhn, Laurakay; Olson, Maynard V.; Kaul, Rajinder; Smith, Douglas R.; Eichler, Evan E.

    2008-01-01

    Genetic variation among individual humans occurs on many different scales, ranging from gross alterations in the human karyotype to single nucleotide changes. Here we explore variation on an intermediate scale—particularly insertions, deletions and inversions affecting from a few thousand to a few million base pairs. We employed a clone-based method to interrogate this intermediate structural variation in eight individuals of diverse geographic ancestry. Our analysis provides a comprehensive overview of the normal pattern of structural variation present in these genomes, refining the location of 1,695 structural variants. We find that 50% were seen in more than one individual and that nearly half lay outside regions of the genome previously described as structurally variant. We discover 525 new insertion sequences that are not present in the human reference genome and show that many of these are variable in copy number between individuals. Complete sequencing of 261 structural variants reveals considerable locus complexity and provides insights into the different mutational processes that have shaped the human genome. These data provide the first high-resolution sequence map of human structural variation—a standard for genotyping platforms and a prelude to future individual genome sequencing projects. PMID:18451855

  20. Genome-wide patterns of population structure and admixture in West Africans and African Americans

    PubMed Central

    Bryc, Katarzyna; Auton, Adam; Nelson, Matthew R.; Oksenberg, Jorge R.; Hauser, Stephen L.; Williams, Scott; Froment, Alain; Bodo, Jean-Marie; Wambebe, Charles; Tishkoff, Sarah A.; Bustamante, Carlos D.

    2009-01-01

    Quantifying patterns of population structure in Africans and African Americans illuminates the history of human populations and is critical for undertaking medical genomic studies on a global scale. To obtain a fine-scale genome-wide perspective of ancestry, we analyze Affymetrix GeneChip 500K genotype data from African Americans (n = 365) and individuals with ancestry from West Africa (n = 203 from 12 populations) and Europe (n = 400 from 42 countries). We find that population structure within the West African sample reflects primarily language and secondarily geographical distance, echoing the Bantu expansion. Among African Americans, analysis of genomic admixture by a principal component-based approach indicates that the median proportion of European ancestry is 18.5% (25th–75th percentiles: 11.6–27.7%), with very large variation among individuals. In the African-American sample as a whole, few autosomal regions showed exceptionally high or low mean African ancestry, but the X chromosome showed elevated levels of African ancestry, consistent with a sex-biased pattern of gene flow with an excess of European male and African female ancestry. We also find that genomic profiles of individual African Americans afford personalized ancestry reconstructions differentiating ancient vs. recent European and African ancestry. Finally, patterns of genetic similarity among inferred African segments of African-American genomes and genomes of contemporary African populations included in this study suggest African ancestry is most similar to non-Bantu Niger-Kordofanian-speaking populations, consistent with historical documents of the African Diaspora and trans-Atlantic slave trade. PMID:20080753

  1. Genome-wide patterns of population structure and admixture in West Africans and African Americans.

    PubMed

    Bryc, Katarzyna; Auton, Adam; Nelson, Matthew R; Oksenberg, Jorge R; Hauser, Stephen L; Williams, Scott; Froment, Alain; Bodo, Jean-Marie; Wambebe, Charles; Tishkoff, Sarah A; Bustamante, Carlos D

    2010-01-12

    Quantifying patterns of population structure in Africans and African Americans illuminates the history of human populations and is critical for undertaking medical genomic studies on a global scale. To obtain a fine-scale genome-wide perspective of ancestry, we analyze Affymetrix GeneChip 500K genotype data from African Americans (n = 365) and individuals with ancestry from West Africa (n = 203 from 12 populations) and Europe (n = 400 from 42 countries). We find that population structure within the West African sample reflects primarily language and secondarily geographical distance, echoing the Bantu expansion. Among African Americans, analysis of genomic admixture by a principal component-based approach indicates that the median proportion of European ancestry is 18.5% (25th-75th percentiles: 11.6-27.7%), with very large variation among individuals. In the African-American sample as a whole, few autosomal regions showed exceptionally high or low mean African ancestry, but the X chromosome showed elevated levels of African ancestry, consistent with a sex-biased pattern of gene flow with an excess of European male and African female ancestry. We also find that genomic profiles of individual African Americans afford personalized ancestry reconstructions differentiating ancient vs. recent European and African ancestry. Finally, patterns of genetic similarity among inferred African segments of African-American genomes and genomes of contemporary African populations included in this study suggest African ancestry is most similar to non-Bantu Niger-Kordofanian-speaking populations, consistent with historical documents of the African Diaspora and trans-Atlantic slave trade.

  2. Inferring the structure of latent class models using a genetic algorithm.

    PubMed

    van der Maas, Han L J; Raijmakers, Maartje E J; Visser, Ingmar

    2005-05-01

    Present optimization techniques in latent class analysis apply the expectation maximization algorithm or the Newton-Raphson algorithm for optimizing the parameter values of a prespecified model. These techniques can be used to find maximum likelihood estimates of the parameters, given the specified structure of the model, which is defined by the number of classes and, possibly, fixation and equality constraints. The model structure is usually chosen on theoretical grounds. A large variety of structurally different latent class models can be compared using goodness-of-fit indices of the chi-square family, Akaike's information criterion, the Bayesian information criterion, and various other statistics. However, finding the optimal structure for a given goodness-of-fit index often requires a lengthy search in which all kinds of model structures are tested. Moreover, solutions may depend on the choice of initial values for the parameters. This article presents a new method by which one can simultaneously infer the model structure from the data and optimize the parameter values. The method consists of a genetic algorithm in which any goodness-of-fit index can be used as a fitness criterion. In a number of test cases in which data sets from the literature were used, it is shown that this method provides models that fit equally well as or better than the models suggested in the original articles.

  3. Electrical Structure Inferred by 3-D Lightning Mapping Observations During STEPS

    NASA Astrophysics Data System (ADS)

    Hamlin, T.; Krehbiel, P. R.; Zhang, Y.; Thomas, R. J.

    2002-12-01

    The Severe Thunderstorm Electrification and Precipitation Study (STEPS) provided numerous examples of storms which electrified anomalously, developing inverted tripole or quadrupole electrical structures. The storms were often supercells and cases where the lightning activity consisted primarily of IC flashes for substantial periods of time, only followed (if at all) much later by the onset of CG activity, were observed on several occasions. Radar comparisons for the tornadic storm of June 29 and the Bird City storm of June 3 during STEPS indicate that the main positive charge was localized in the precipitation core, but the electrification also had a definite horizontally extensive, multilayer structure extending away from the core. In these storms the upper positive charge region developed rapidly and produced intense lightning activity. The upper positive gradually evolved downward in altitude to become the dominant mid-level charge, forming an inverted tripole structure which appears to be stable for long periods of time. By assuming that a given polarity breakdown is moving into regions of opposite polarity charge (with exceptions) the total charge structure can be inferred and mapped based on information gleaned from the individual flashes; this allows use of the LMA data to detail the charge structure of storms. We take this approach to study the evolution of charge structures for storms during STEPS.

  4. Spectral entropy criteria for structural segmentation in genomic DNA sequences

    NASA Astrophysics Data System (ADS)

    Chechetkin, V. R.; Lobzin, V. V.

    2004-07-01

    The spectral entropy is calculated with Fourier structure factors and characterizes the level of structural ordering in a sequence of symbols. It may efficiently be applied to the assessment and reconstruction of the modular structure in genomic DNA sequences. We present the relevant spectral entropy criteria for the local and non-local structural segmentation in DNA sequences. The results are illustrated with the model examples and analysis of intervening exon-intron segments in the protein-coding regions.

  5. Residual dipolar couplings: synergy between NMR and structural genomics.

    PubMed

    Al-Hashimi, Hashim M; Patel, Dinshaw J

    2002-01-01

    Structural genomics is on a quest for the structure and function of a significant fraction of gene products. Current efforts are focusing on structure determination of single-domain proteins, which can readily be targeted by X-ray crystallography, NMR spectroscopy and computational homology modeling. However, comprehensive association of gene products with functions also requires systematic determination of more complex protein structures and other biomolecules participating in cellular processes such as nucleic acids, and characterization of biomolecular interactions and dynamics relevant to function. Such NMR investigations are becoming more feasible, not only due to recent advances in NMR methodology, but also because structural genomics is providing valuable structural information and new experimental and computational tools. The measurement of residual dipolar couplings in partially oriented systems and other new NMR methods will play an important role in this synergistic relationship between NMR and structural genomics. Both an expansion in the domain of NMR application, and important contributions to future structural genomics efforts can be anticipated.

  6. Estimating Patient’s Health State Using Latent Structure Inferred from Clinical Time Series and Text

    PubMed Central

    Zalewski, Aaron; Long, William; Johnson, Alistair E. W.; Mark, Roger G.; Lehman, Li-wei H.

    2017-01-01

    Modern intensive care units (ICUs) collect large volumes of data in monitoring critically ill patients. Clinicians in the ICUs face the challenge of interpreting large volumes of high-dimensional data to diagnose and treat patients. In this work, we explore the use of Hierarchical Dirichlet Processes (HDP) as a Bayesian nonparametric framework to infer patients’ states of health by combining multiple sources of data. In particular, we employ HDP to combine clinical time series and text from the nursing progress notes in a probabilistic topic modeling framework for patient risk stratification. Given a patient cohort, we use HDP to infer latent “topics” shared across multimodal patient data from the entire cohort. Each topic is modeled as a multinomial distribution over a vocabulary of codewords, defined over heterogeneous data sources. We evaluate the clinical utility of the learned topic structure using the first 24-hour ICU data from over 17,000 adult patients in the MIMIC-II database to estimate patients’ risks of in-hospital mortality. Our results demonstrate that our approach provides a viable framework for combining different data modalities to model patient’s states of health, and can potentially be used to generate alerts to identify patients at high risk of hospital mortality. PMID:28630952

  7. Population-based 3D genome structure analysis reveals driving forces in spatial genome organization

    PubMed Central

    Li, Wenyuan; Kalhor, Reza; Dai, Chao; Hao, Shengli; Gong, Ke; Zhou, Yonggang; Li, Haochen; Zhou, Xianghong Jasmine; Le Gros, Mark A.; Larabell, Carolyn A.; Chen, Lin; Alber, Frank

    2016-01-01

    Conformation capture technologies (e.g., Hi-C) chart physical interactions between chromatin regions on a genome-wide scale. However, the structural variability of the genome between cells poses a great challenge to interpreting ensemble-averaged Hi-C data, particularly for long-range and interchromosomal interactions. Here, we present a probabilistic approach for deconvoluting Hi-C data into a model population of distinct diploid 3D genome structures, which facilitates the detection of chromatin interactions likely to co-occur in individual cells. Our approach incorporates the stochastic nature of chromosome conformations and allows a detailed analysis of alternative chromatin structure states. For example, we predict and experimentally confirm the presence of large centromere clusters with distinct chromosome compositions varying between individual cells. The stability of these clusters varies greatly with their chromosome identities. We show that these chromosome-specific clusters can play a key role in the overall chromosome positioning in the nucleus and stabilizing specific chromatin interactions. By explicitly considering genome structural variability, our population-based method provides an important tool for revealing novel insights into the key factors shaping the spatial genome organization. PMID:26951677

  8. Population-based 3D genome structure analysis reveals driving forces in spatial genome organization

    DOE PAGES

    Tjong, Harianto; Li, Wenyuan; Kalhor, Reza; ...

    2016-03-07

    Conformation capture technologies (e.g., Hi-C) chart physical interactions between chromatin regions on a genome-wide scale. However, the structural variability of the genome between cells poses a great challenge to interpreting ensemble-averaged Hi-C data, particularly for long-range and interchromosomal interactions. Here, we present a probabilistic approach for deconvoluting Hi-C data into a model population of distinct diploid 3D genome structures, which facilitates the detection of chromatin interactions likely to co-occur in individual cells. Here, our approach incorporates the stochastic nature of chromosome conformations and allows a detailed analysis of alternative chromatin structure states. For example, we predict and experimentally confirm themore » presence of large centromere clusters with distinct chromosome compositions varying between individual cells. The stability of these clusters varies greatly with their chromosome identities. We show that these chromosome-specific clusters can play a key role in the overall chromosome positioning in the nucleus and stabilizing specific chromatin interactions. By explicitly considering genome structural variability, our population-based method provides an important tool for revealing novel insights into the key factors shaping the spatial genome organization.« less

  9. Advances in Genomic Profiling and Analysis of 3D Chromatin Structure and Interaction.

    PubMed

    Tang, Binhua; Cheng, Xiaolong; Xi, Yunlong; Chen, Zixin; Zhou, Yufan; Jin, Victor X

    2017-09-08

    Recent sequence-based profiling technologies such as high-throughput sequencing to detect fragment nucleotide sequence (Hi-C) and chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) have revolutionized the field of three-dimensional (3D) chromatin architecture. It is now recognized that human genome functions as folded 3D chromatin units and looping paradigm is the basic principle of gene regulation. To better interpret the 3D data dramatically accumulating in past five years and to gain deep biological insights, huge efforts have been made in developing novel quantitative analysis methods. However, the full understanding of genome regulation requires thorough knowledge in both genomic technologies and their related data analyses. We summarize the recent advances in genomic technologies in identifying the 3D chromatin structure and interaction, and illustrate the quantitative analysis methods to infer functional domains and chromatin interactions, and further elucidate the emerging single-cell Hi-C technique and its computational analysis, and finally discuss the future directions such as advances of 3D chromatin techniques in diseases.

  10. Crustal stress and structure at Kīlauea Volcano inferred from seismic anisotropy: Chapter 12

    USGS Publications Warehouse

    Johnson, Jessica H.; Swanson, Donald; Roman, Diana C.; Poland, Michael P.; Thelen, Weston A.; Carey, Rebecca; Cayol, Valérie; Poland, Michael P.; Weis, Dominique

    2015-01-01

    Seismic anisotropy, measured through shear wave splitting (SWS) analysis, can be indicative of the state of stress in Earth's crust. Changes in SWS at Kīlauea Volcano, Hawai‘i, associated with the onset of summit eruptive activity in 2008 hint at the potential of the technique for tracking volcanic activity. To use SWS observations as a monitoring tool, however, it is important to understand the cause of seismic anisotropy at the volcano throughout the eruptive cycle. To address this need, we analyzed SWS results from across Kīlauea in combination with macroscopic surface structures (mapped fractures, faults, and fissures) and stress orientations inferred from fault plane solutions. Seismic anisotropy seems to be due to pervasive aligned structures in most regions of the volcano. The upper East and Southwest Rift Zones, however, show a bimodality in stress and SWS, suggesting a stress discontinuity with depth, perhaps related to magma conduits that trend obliquely to the dominant structure. Other areas in and around Kīlauea Caldera display principal stresses of similar magnitudes, indicating that small stress perturbations can rotate the maximum horizontal compressive stress direction by up to 90°. In these locations, static structures generally control SWS, but dynamic conditions due to magmatic activity can override the structural control. Monitoring of SWS may therefore provide important signs of impending volcanism.

  11. Crustal stress and structure at Kīlauea Volcano inferred from seismic anisotropy

    USGS Publications Warehouse

    Johnson, Jessica H.; Swanson, Donald; Roman, Diana C.; Poland, Michael P.; Thelen, Weston A.

    2015-01-01

    Seismic anisotropy, measured through shear wave splitting (SWS) analysis, can be indicative of the state of stress in Earth's crust. Changes in SWS at Kīlauea Volcano, Hawai‘i, associated with the onset of summit eruptive activity in 2008 hint at the potential of the technique for tracking volcanic activity. To use SWS observations as a monitoring tool, however, it is important to understand the cause of seismic anisotropy at the volcano throughout the eruptive cycle. To address this need, we analyzed SWS results from across Kīlauea in combination with macroscopic surface structures (mapped fractures, faults, and fissures) and stress orientations inferred from fault plane solutions. Seismic anisotropy seems to be due to pervasive aligned structures in most regions of the volcano. The upper East and Southwest Rift Zones, however, show a bimodality in stress and SWS, suggesting a stress discontinuity with depth, perhaps related to magma conduits that trend obliquely to the dominant structure. Other areas in and around Kīlauea Caldera display principal stresses of similar magnitudes, indicating that small stress perturbations can rotate the maximum horizontal compressive stress direction by up to 90°. In these locations, static structures generally control SWS, but dynamic conditions due to magmatic activity can override the structural control. Monitoring of SWS may therefore provide important signs of impending volcanism.

  12. Inferring the mesoscale structure of layered, edge-valued, and time-varying networks

    NASA Astrophysics Data System (ADS)

    Peixoto, Tiago P.

    2015-10-01

    Many network systems are composed of interdependent but distinct types of interactions, which cannot be fully understood in isolation. These different types of interactions are often represented as layers, attributes on the edges, or as a time dependence of the network structure. Although they are crucial for a more comprehensive scientific understanding, these representations offer substantial challenges. Namely, it is an open problem how to precisely characterize the large or mesoscale structure of network systems in relation to these additional aspects. Furthermore, the direct incorporation of these features invariably increases the effective dimension of the network description, and hence aggravates the problem of overfitting, i.e., the use of overly complex characterizations that mistake purely random fluctuations for actual structure. In this work, we propose a robust and principled method to tackle these problems, by constructing generative models of modular network structure, incorporating layered, attributed and time-varying properties, as well as a nonparametric Bayesian methodology to infer the parameters from data and select the most appropriate model according to statistical evidence. We show that the method is capable of revealing hidden structure in layered, edge-valued, and time-varying networks, and that the most appropriate level of granularity with respect to the additional dimensions can be reliably identified. We illustrate our approach on a variety of empirical systems, including a social network of physicians, the voting correlations of deputies in the Brazilian national congress, the global airport network, and a proximity network of high-school students.

  13. Structure of the germline genome of Tetrahymena thermophila and relationship to the massively rearranged somatic genome

    PubMed Central

    Hamilton, Eileen P; Kapusta, Aurélie; Huvos, Piroska E; Bidwell, Shelby L; Zafar, Nikhat; Tang, Haibao; Hadjithomas, Michalis; Krishnakumar, Vivek; Badger, Jonathan H; Caler, Elisabet V; Russ, Carsten; Zeng, Qiandong; Fan, Lin; Levin, Joshua Z; Shea, Terrance; Young, Sarah K; Hegarty, Ryan; Daza, Riza; Gujja, Sharvari; Wortman, Jennifer R; Birren, Bruce W; Nusbaum, Chad; Thomas, Jainy; Carey, Clayton M; Pritham, Ellen J; Feschotte, Cédric; Noto, Tomoko; Mochizuki, Kazufumi; Papazyan, Romeo; Taverna, Sean D; Dear, Paul H; Cassidy-Hanley, Donna M; Xiong, Jie; Miao, Wei; Orias, Eduardo; Coyne, Robert S

    2016-01-01

    The germline genome of the binucleated ciliate Tetrahymena thermophila undergoes programmed chromosome breakage and massive DNA elimination to generate the somatic genome. Here, we present a complete sequence assembly of the germline genome and analyze multiple features of its structure and its relationship to the somatic genome, shedding light on the mechanisms of genome rearrangement as well as the evolutionary history of this remarkable germline/soma differentiation. Our results strengthen the notion that a complex, dynamic, and ongoing interplay between mobile DNA elements and the host genome have shaped Tetrahymena chromosome structure, locally and globally. Non-standard outcomes of rearrangement events, including the generation of short-lived somatic chromosomes and excision of DNA interrupting protein-coding regions, may represent novel forms of developmental gene regulation. We also compare Tetrahymena’s germline/soma differentiation to that of other characterized ciliates, illustrating the wide diversity of adaptations that have occurred within this phylum. DOI: http://dx.doi.org/10.7554/eLife.19090.001 PMID:27892853

  14. Allelic genome structural variations in maize detected by array comparative genome hybridization.

    PubMed

    Beló, André; Beatty, Mary K; Hondred, David; Fengler, Kevin A; Li, Bailin; Rafalski, Antoni

    2010-01-01

    DNA polymorphisms such as insertion/deletions and duplications affecting genome segments larger than 1 kb are known as copy-number variations (CNVs) or structural variations (SVs). They have been recently studied in animals and humans by using array-comparative genome hybridization (aCGH), and have been associated with several human diseases. Their presence and phenotypic effects in plants have not been investigated on a genomic scale, although individual structural variations affecting traits have been described. We used aCGH to investigate the presence of CNVs in maize by comparing the genome of 13 maize inbred lines to B73. Analysis of hybridization signal ratios of 60,472 60-mer oligonucleotide probes between inbreds in relation to their location in the reference genome (B73) allowed us to identify clusters of probes that deviated from the ratio expected for equal copy-numbers. We found CNVs distributed along the maize genome in all chromosome arms. They occur with appreciable frequency in different germplasm subgroups, suggesting ancient origin. Validation of several CNV regions showed both insertion/deletions and copy-number differences. The nature of CNVs detected suggests CNVs might have a considerable impact on plant phenotypes, including disease response and heterosis.

  15. Structural Genomics and Drug Discovery for Infectious Diseases

    SciTech Connect

    Anderson, W.F.

    2010-09-03

    The application of structural genomics methods and approaches to proteins from organisms causing infectious diseases is making available the three dimensional structures of many proteins that are potential drug targets and laying the groundwork for structure aided drug discovery efforts. There are a number of structural genomics projects with a focus on pathogens that have been initiated worldwide. The Center for Structural Genomics of Infectious Diseases (CSGID) was recently established to apply state-of-the-art high throughput structural biology technologies to the characterization of proteins from the National Institute for Allergy and Infectious Diseases (NIAID) category A-C pathogens and organisms causing emerging, or re-emerging infectious diseases. The target selection process emphasizes potential biomedical benefits. Selected proteins include known drug targets and their homologs, essential enzymes, virulence factors and vaccine candidates. The Center also provides a structure determination service for the infectious disease scientific community. The ultimate goal is to generate a library of structures that are available to the scientific community and can serve as a starting point for further research and structure aided drug discovery for infectious diseases. To achieve this goal, the CSGID will determine protein crystal structures of 400 proteins and protein-ligand complexes using proven, rapid, highly integrated, and cost-effective methods for such determination, primarily by X-ray crystallography. High throughput crystallographic structure determination is greatly aided by frequent, convenient access to high-performance beamlines at third-generation synchrotron X-ray sources.

  16. Leveraging enzyme structure-function relationships for functional inference and experimental design: the structure-function linkage database.

    PubMed

    Pegg, Scott C-H; Brown, Shoshana D; Ojha, Sunil; Seffernick, Jennifer; Meng, Elaine C; Morris, John H; Chang, Patricia J; Huang, Conrad C; Ferrin, Thomas E; Babbitt, Patricia C

    2006-02-28

    The study of mechanistically diverse enzyme superfamilies-collections of enzymes that perform different overall reactions but share both a common fold and a distinct mechanistic step performed by key conserved residues-helps elucidate the structure-function relationships of enzymes. We have developed a resource, the structure-function linkage database (SFLD), to analyze these structure-function relationships. Unique to the SFLD is its hierarchical classification scheme based on linking the specific partial reactions (or other chemical capabilities) that are conserved at the superfamily, subgroup, and family levels with the conserved structural elements that mediate them. We present the results of analyses using the SFLD in correcting misannotations, guiding protein engineering experiments, and elucidating the function of recently solved enzyme structures from the structural genomics initiative. The SFLD is freely accessible at http://sfld.rbvi.ucsf.edu.

  17. Genomic structural variation in psychiatric disorders.

    PubMed

    Rucker, James J H; McGuffin, Peter

    2012-11-01

    Copy number variants (CNVs) are submicroscopic deletions and duplications of genomic material that were previously thought to be rare phenomena. They have now been robustly associated with a variety of disorders such as autism, schizophrenia, and attention-deficit/hyperactivity disorder through an emerging research base in affective disorders. A complex picture is emerging of a polygenic, heterogeneous model of disease, with CNVs conferring broad susceptibility to a variety of neurodevelopmental disorders, rather than specific disorders per se. Although the insights gleaned thus far only represent a small piece of a much larger puzzle, progress has been rapid and new technologies promise even more insights into these hitherto opaque brain disorders. We will discuss CNVs, the current state of evidence for their role in the pathogenesis of classical psychiatric disorders, and the application of such knowledge in clinical settings.

  18. Primary structure of the herpesvirus saimiri genome.

    PubMed Central

    Albrecht, J C; Nicholas, J; Biller, D; Cameron, K R; Biesinger, B; Newman, C; Wittmann, S; Craxton, M A; Coleman, H; Fleckenstein, B

    1992-01-01

    This report describes the complete nucleotide sequence of the genome of herpesvirus saimiri, the prototype of gammaherpesvirus subgroup 2 (rhadinoviruses). The unique low-G + C-content DNA region has 112,930 bp with an average base composition of 34.5% G + C and is flanked by about 35 noncoding high-G + C-content DNA repeats of 1,444 bp (70.8% G + C) in tandem orientation. We identified 76 major open reading frames and a set of seven U-RNA genes for a total of 83 potential genes. The genes are closely arranged, with only a few regions of sizable noncoding sequences. For 60 of the predicted proteins, homologous sequences are found in other herpesviruses. Genes conserved between herpesvirus saimiri and Epstein-Barr virus (gammaherpesvirus subgroup 1) show that their genomes are generally collinear, although conserved gene blocks are separated by unique genes that appear to determine the particular phenotype of these viruses. Several deduced protein sequences of herpesvirus saimiri without counterparts in most of the other sequenced herpesviruses exhibited significant homology with cellular proteins of known function. These include thymidylate synthase, dihydrofolate reductase, complement control proteins, the cell surface antigen CD59, cyclins, and G protein-coupled receptors. Searching for functional protein motifs revealed that the virus may encode a cytosine-specific methylase and a tyrosine-specific protein kinase. Several herpesvirus saimiri genes are potential candidates to cooperate with the gene for saimiri transformation-associated protein of subgroup A (STP-A) in T-lymphocyte growth stimulation. PMID:1321287

  19. Structural Determinants and Mechanism of HIV-1 Genome Packaging

    PubMed Central

    Lu, Kun; Heng, Xiao; Summers, Michael F.

    2011-01-01

    Like all retroviruses, the Human Immunodeficiency Virus (HIV) selectively packages two copies of its unspliced RNA genome, both of which are utilized for strand-transfer mediated recombination during reverse transcription – a process that enables rapid evolution under environmental and chemotherapeutic pressures. The viral RNA appears to be selected for packaging as a dimer, and there is evidence that dimerization and packaging are mechanistically coupled. Both processes are mediated by interactions between the nucleocapsid (NC) domains of a small number of assembling viral Gag polyproteins and RNA elements within the 5′-untranslated region (5′-UTR) of the genome. A number of secondary structures have been predicted for regions of the genome that are responsible for packaging, and high-resolution structures have been determined for a few small RNA fragments and protein-RNA complexes. However, major questions remain open regarding the RNA structures, and potentially the structural changes, that are responsible for dimeric genome selection. Here we review efforts that have been made to identify the molecular determinants and mechanism of HIV-1 genome packaging. PMID:21762803

  20. Insights into archaeal evolution and symbiosis from the genomes of a nanoarchaeon and its inferred crenarchaeal host from Obsidian Pool, Yellowstone National Park.

    PubMed

    Podar, Mircea; Makarova, Kira S; Graham, David E; Wolf, Yuri I; Koonin, Eugene V; Reysenbach, Anna-Louise

    2013-04-22

    A single cultured marine organism, Nanoarchaeum equitans, represents the Nanoarchaeota branch of symbiotic Archaea, with a highly reduced genome and unusual features such as multiple split genes. The first terrestrial hyperthermophilic member of the Nanoarchaeota was collected from Obsidian Pool, a thermal feature in Yellowstone National Park, separated by single cell isolation, and sequenced together with its putative host, a Sulfolobales archaeon. Both the new Nanoarchaeota (Nst1) and N. equitans lack most biosynthetic capabilities, and phylogenetic analysis of ribosomal RNA and protein sequences indicates that the two form a deep-branching archaeal lineage. However, the Nst1 genome is more than 20% larger, and encodes a complete gluconeogenesis pathway as well as the full complement of archaeal flagellum proteins. With a larger genome, a smaller repertoire of split protein encoding genes and no split non-contiguous tRNAs, Nst1 appears to have experienced less severe genome reduction than N. equitans. These findings imply that, rather than representing ancestral characters, the extremely compact genomes and multiple split genes of Nanoarchaeota are derived characters associated with their symbiotic or parasitic lifestyle. The inferred host of Nst1 is potentially autotrophic, with a streamlined genome and simplified central and energetic metabolism as compared to other Sulfolobales. Comparison of the N. equitans and Nst1 genomes suggests that the marine and terrestrial lineages of Nanoarchaeota share a common ancestor that was already a symbiont of another archaeon. The two distinct Nanoarchaeota-host genomic data sets offer novel insights into the evolution of archaeal symbiosis and parasitism, enabling further studies of the cellular and molecular mechanisms of these relationships. This article was reviewed by Patrick Forterre, Bettina Siebers (nominated by Michael Galperin) and Purification Lopez-Garcia.

  1. Insights into archaeal evolution and symbiosis from the genomes of a nanoarchaeon and its inferred crenarchaeal host from Obsidian Pool, Yellowstone National Park

    PubMed Central

    2013-01-01

    Background A single cultured marine organism, Nanoarchaeum equitans, represents the Nanoarchaeota branch of symbiotic Archaea, with a highly reduced genome and unusual features such as multiple split genes. Results The first terrestrial hyperthermophilic member of the Nanoarchaeota was collected from Obsidian Pool, a thermal feature in Yellowstone National Park, separated by single cell isolation, and sequenced together with its putative host, a Sulfolobales archaeon. Both the new Nanoarchaeota (Nst1) and N. equitans lack most biosynthetic capabilities, and phylogenetic analysis of ribosomal RNA and protein sequences indicates that the two form a deep-branching archaeal lineage. However, the Nst1 genome is more than 20% larger, and encodes a complete gluconeogenesis pathway as well as the full complement of archaeal flagellum proteins. With a larger genome, a smaller repertoire of split protein encoding genes and no split non-contiguous tRNAs, Nst1 appears to have experienced less severe genome reduction than N. equitans. These findings imply that, rather than representing ancestral characters, the extremely compact genomes and multiple split genes of Nanoarchaeota are derived characters associated with their symbiotic or parasitic lifestyle. The inferred host of Nst1 is potentially autotrophic, with a streamlined genome and simplified central and energetic metabolism as compared to other Sulfolobales. Conclusions Comparison of the N. equitans and Nst1 genomes suggests that the marine and terrestrial lineages of Nanoarchaeota share a common ancestor that was already a symbiont of another archaeon. The two distinct Nanoarchaeota-host genomic data sets offer novel insights into the evolution of archaeal symbiosis and parasitism, enabling further studies of the cellular and molecular mechanisms of these relationships. Reviewers This article was reviewed by Patrick Forterre, Bettina Siebers (nominated by Michael Galperin) and Purification Lopez-Garcia PMID:23607440

  2. Geographic population structure analysis of worldwide human populations infers their biogeographical origins.

    PubMed

    Elhaik, Eran; Tatarinova, Tatiana; Chebotarev, Dmitri; Piras, Ignazio S; Maria Calò, Carla; De Montis, Antonella; Atzori, Manuela; Marini, Monica; Tofanelli, Sergio; Francalacci, Paolo; Pagani, Luca; Tyler-Smith, Chris; Xue, Yali; Cucca, Francesco; Schurr, Theodore G; Gaieski, Jill B; Melendez, Carlalynne; Vilar, Miguel G; Owings, Amanda C; Gómez, Rocío; Fujita, Ricardo; Santos, Fabrício R; Comas, David; Balanovsky, Oleg; Balanovska, Elena; Zalloua, Pierre; Soodyall, Himla; Pitchappan, Ramasamy; Ganeshprasad, Arunkumar; Hammer, Michael; Matisoo-Smith, Lisa; Wells, R Spencer

    2014-04-29

    The search for a method that utilizes biological information to predict humans' place of origin has occupied scientists for millennia. Over the past four decades, scientists have employed genetic data in an effort to achieve this goal but with limited success. While biogeographical algorithms using next-generation sequencing data have achieved an accuracy of 700 km in Europe, they were inaccurate elsewhere. Here we describe the Geographic Population Structure (GPS) algorithm and demonstrate its accuracy with three data sets using 40,000-130,000 SNPs. GPS placed 83% of worldwide individuals in their country of origin. Applied to over 200 Sardinians villagers, GPS placed a quarter of them in their villages and most of the rest within 50 km of their villages. GPS's accuracy and power to infer the biogeography of worldwide individuals down to their country or, in some cases, village, of origin, underscores the promise of admixture-based methods for biogeography and has ramifications for genetic ancestry testing.

  3. The Impact of Structural Genomics: Expectations and Outcomes

    SciTech Connect

    Chandonia, John-Marc; Brenner, Steven E.

    2005-12-21

    Structural Genomics (SG) projects aim to expand our structural knowledge of biological macromolecules, while lowering the average costs of structure determination. We quantitatively analyzed the novelty, cost, and impact of structures solved by SG centers, and contrast these results with traditional structural biology. The first structure from a protein family is particularly important to reveal the fold and ancient relationships to other proteins. In the last year, approximately half of such structures were solved at a SG center rather than in a traditional laboratory. Furthermore, the cost of solving a structure at the most efficient U.S. center has now dropped to one-quarter the estimated cost of solving a structure by traditional methods. However, top structural biology laboratories are much more efficient than the average, and comparable to SG centers despite working on very challenging structures. Moreover, traditional structural biology papers are cited significantly more often, suggesting greater current impact.

  4. The use of structural modelling to infer structure and function in biocontrol agents.

    PubMed

    Berry, Colin; Board, Jason

    2017-01-01

    Homology modelling can provide important insights into the structures of proteins when a related protein structure has already been solved. However, for many proteins, including a number of invertebrate-active toxins and accessory proteins, no such templates exist. In these cases, techniques of ab initio, template-independent modelling can be employed to generate models that may give insight into structure and function. In this overview, examples of both the problems and the potential benefits of ab initio techniques are illustrated. Consistent modelling results may indicate useful approximations to actual protein structures and can thus allow the generation of hypotheses regarding activity that can be tested experimentally.

  5. Life-history traits of the Miocene Hipparion concudense (Spain) inferred from bone histological structure.

    PubMed

    Martinez-Maza, Cayetana; Alberdi, Maria Teresa; Nieto-Diaz, Manuel; Prado, José Luis

    2014-01-01

    Histological analyses of fossil bones have provided clues on the growth patterns and life history traits of several extinct vertebrates that would be unavailable for classical morphological studies. We analyzed the bone histology of Hipparion to infer features of its life history traits and growth pattern. Microscope analysis of thin sections of a large sample of humeri, femora, tibiae and metapodials of Hipparion concudense from the upper Miocene site of Los Valles de Fuentidueña (Segovia, Spain) has shown that the number of growth marks is similar among the different limb bones, suggesting that equivalent skeletochronological inferences for this Hipparion population might be achieved by means of any of the elements studied. Considering their abundance, we conducted a skeletechronological study based on the large sample of third metapodials from Los Valles de Fuentidueña together with another large sample from the Upper Miocene locality of Concud (Teruel, Spain). The data obtained enabled us to distinguish four age groups in both samples and to determine that Hipparion concudense tended to reach skeletal maturity during its third year of life. Integration of bone microstructure and skeletochronological data allowed us to identify ontogenetic changes in bone structure and growth rate and to distinguish three histologic ontogenetic stages corresponding to immature, subadult and adult individuals. Data on secondary osteon density revealed an increase in bone remodeling throughout the ontogenetic stages and a lesser degree thereof in the Concud population, which indicates different biomechanical stresses in the two populations, likely due to environmental differences. Several individuals showed atypical growth patterns in the Concud sample, which may also reflect environmental differences between the two localities. Finally, classification of the specimens' age within groups enabled us to characterize the age structure of both samples, which is typical of

  6. Life-History Traits of the Miocene Hipparion concudense (Spain) Inferred from Bone Histological Structure

    PubMed Central

    Martinez-Maza, Cayetana; Alberdi, Maria Teresa; Nieto-Diaz, Manuel; Prado, José Luis

    2014-01-01

    Histological analyses of fossil bones have provided clues on the growth patterns and life history traits of several extinct vertebrates that would be unavailable for classical morphological studies. We analyzed the bone histology of Hipparion to infer features of its life history traits and growth pattern. Microscope analysis of thin sections of a large sample of humeri, femora, tibiae and metapodials of Hipparion concudense from the upper Miocene site of Los Valles de Fuentidueña (Segovia, Spain) has shown that the number of growth marks is similar among the different limb bones, suggesting that equivalent skeletochronological inferences for this Hipparion population might be achieved by means of any of the elements studied. Considering their abundance, we conducted a skeletechronological study based on the large sample of third metapodials from Los Valles de Fuentidueña together with another large sample from the Upper Miocene locality of Concud (Teruel, Spain). The data obtained enabled us to distinguish four age groups in both samples and to determine that Hipparion concudense tended to reach skeletal maturity during its third year of life. Integration of bone microstructure and skeletochronological data allowed us to identify ontogenetic changes in bone structure and growth rate and to distinguish three histologic ontogenetic stages corresponding to immature, subadult and adult individuals. Data on secondary osteon density revealed an increase in bone remodeling throughout the ontogenetic stages and a lesser degree thereof in the Concud population, which indicates different biomechanical stresses in the two populations, likely due to environmental differences. Several individuals showed atypical growth patterns in the Concud sample, which may also reflect environmental differences between the two localities. Finally, classification of the specimens’ age within groups enabled us to characterize the age structure of both samples, which is typical of

  7. Phylogeny of Oedogoniales, Chaetophorales and Chaetopeltidales (Chlorophyceae): inferences from sequence-structure analysis of ITS2

    PubMed Central

    Buchheim, Mark A.; Sutherland, Danica M.; Schleicher, Tina; Förster, Frank; Wolf, Matthias

    2012-01-01

    Background and Aims The green algal class Chlorophyceae comprises five orders (Chlamydomonadales, Sphaeropleales, Chaetophorales, Chaetopeltidales and Oedogoniales). Attempts to resolve the relationships among these groups have met with limited success. Studies of single genes (18S rRNA, 26S rRNA, rbcL or atpB) have largely failed to unambiguously resolve the relative positions of Oedogoniales, Chaetophorales and Chaetopeltidales (the OCC taxa). In contrast, recent genomics analyses of plastid data from OCC exemplars provided a robust phylogenetic analysis that supports a monophyletic OCC alliance. Methods An ITS2 data set was assembled to independently test the OCC hypothesis and to evaluate the performance of these data in assessing green algal phylogeny at the ordinal or class level. Sequence-structure analysis designed for use with ITS2 data was employed for phylogenetic reconstruction. Key Results Results of this study yielded trees that were, in general, topologically congruent with the results from the genomic analyses, including support for the monophyly of the OCC alliance. Conclusions Not all nodes from the ITS2 analyses exhibited robust support, but our investigation demonstrates that sequence-structure analyses of ITS2 provide a taxon-rich means of testing phylogenetic hypotheses at high taxonomic levels. Thus, the ITS2 data, in the context of sequence-structure analysis, provide an economical supplement or alternative to the single-marker approaches used in green algal phylogeny. PMID:22028463

  8. Multi-scale structural community organisation of the human genome.

    PubMed

    Boulos, Rasha E; Tremblay, Nicolas; Arneodo, Alain; Borgnat, Pierre; Audit, Benjamin

    2017-04-11

    Structural interaction frequency matrices between all genome loci are now experimentally achievable thanks to high-throughput chromosome conformation capture technologies. This ensues a new methodological challenge for computational biology which consists in objectively extracting from these data the structural motifs characteristic of genome organisation. We deployed the fast multi-scale community mining algorithm based on spectral graph wavelets to characterise the networks of intra-chromosomal interactions in human cell lines. We observed that there exist structural domains of all sizes up to chromosome length and demonstrated that the set of structural communities forms a hierarchy of chromosome segments. Hence, at all scales, chromosome folding predominantly involves interactions between neighbouring sites rather than the formation of links between distant loci. Multi-scale structural decomposition of human chromosomes provides an original framework to question structural organisation and its relationship to functional regulation across the scales. By construction the proposed methodology is independent of the precise assembly of the reference genome and is thus directly applicable to genomes whose assembly is not fully determined.

  9. Structural Genomics of Bacterial Virulence Factors

    DTIC Science & Technology

    2004-05-01

    drug design . In this first year of funding we have focused our attention on plasmid annotation, target selection, protein expression, purification and crystallization of proteins encoded by the Bacillus anthracis pXOl plasmid. We have cloned and expressed a total of 35 new proteins, and structural analysis of several of these is underway. Currently, 3 new crystal structures are essentially complete, and 6 crystal structures of anthrax Lethal Factor in complex with small molecule inhibitors provided by our collaborators have been determined, and lodged in the public data

  10. Benefits of Structural Genomics for Drug Discovery Research

    SciTech Connect

    Grabowski, M.; Chruszcz, M; Zimmerman, M; Kirillova, O; Minor, W

    2009-01-01

    While three dimensional structures have long been used to search for new drug targets, only a fraction of new drugs coming to the market has been developed with the use of a structure-based drug discovery approach. However, the recent years have brought not only an avalanche of new macromolecular structures, but also significant advances in the protein structure determination methodology only now making their way into structure-based drug discovery. In this paper, we review recent developments resulting from the Structural Genomics (SG) programs, focusing on the methods and results most likely to improve our understanding of the molecular foundation of human diseases. SG programs have been around for almost a decade, and in that time, have contributed a significant part of the structural coverage of both the genomes of pathogens causing infectious diseases and structurally uncharacterized biological processes in general. Perhaps most importantly, SG programs have developed new methodology at all steps of the structure determination process, not only to determine new structures highly efficiently, but also to screen protein/ligand interactions. We describe the methodologies, experience and technologies developed by SG, which range from improvements to cloning protocols to improved procedures for crystallographic structure solution that may be applied in 'traditional' structural biology laboratories particularly those performing drug discovery. We also discuss the conditions that must be met to convert the present high-throughput structure determination pipeline into a high-output structure-based drug discovery system.

  11. Benefits of Structural Genomics for Drug Discovery Research

    PubMed Central

    Grabowski, Marek; Chruszcz, Maksymilian; Zimmerman, Matthew D.; Kirillova, Olga; Minor, Wladek

    2010-01-01

    While three dimensional structures have long been used to search for new drug targets, only a fraction of new drugs coming to the market has been developed with the use of a structure-based drug discovery approach. However, the recent years have brought not only an avalanche of new macromolecular structures, but also significant advances in the protein structure determination methodology only now making their way into structure-based drug discovery. In this paper, we review recent developments resulting from the Structural Genomics (SG) programs, focusing on the methods and results most likely to improve our understanding of the molecular foundation of human diseases. SG programs have been around for almost a decade, and in that time, have contributed a significant part of the structural coverage of both the genomes of pathogens causing infectious diseases and structurally uncharacterized biological processes in general. Perhaps most importantly, SG programs have developed new methodology at all steps of the structure determination process, not only to determine new structures highly efficiently, but also to screen protein/ligand interactions. We describe the methodologies, experience and technologies developed by SG, which range from improvements to cloning protocols to improved procedures for crystallographic structure solution that may be applied in “traditional” structural biology laboratories particularly those performing drug discovery. We also discuss the conditions that must be met to convert the present high-throughput structure determination pipeline into a high-output structure-based drug discovery system. PMID:19594422

  12. Megabase replication domains along the human genome: relation to chromatin structure and genome organisation.

    PubMed

    Audit, Benjamin; Zaghloul, Lamia; Baker, Antoine; Arneodo, Alain; Chen, Chun-Long; d'Aubenton-Carafa, Yves; Thermes, Claude

    2013-01-01

    In higher eukaryotes, the absence of specific sequence motifs, marking the origins of replication has been a serious hindrance to the understanding of (i) the mechanisms that regulate the spatio-temporal replication program, and (ii) the links between origins activation, chromatin structure and transcription. In this chapter, we review the partitioning of the human genome into megabased-size replication domains delineated as N-shaped motifs in the strand compositional asymmetry profiles. They collectively span 28.3% of the genome and are bordered by more than 1,000 putative replication origins. We recapitulate the comparison of this partition of the human genome with high-resolution experimental data that confirms that replication domain borders are likely to be preferential replication initiation zones in the germline. In addition, we highlight the specific distribution of experimental and numerical chromatin marks along replication domains. Domain borders correspond to particular open chromatin regions, possibly encoded in the DNA sequence, and around which replication and transcription are highly coordinated. These regions also present a high evolutionary breakpoint density, suggesting that susceptibility to breakage might be linked to local open chromatin fiber state. Altogether, this chapter presents a compartmentalization of the human genome into replication domains that are landmarks of the human genome organization and are likely to play a key role in genome dynamics during evolution and in pathological situations.

  13. Mitochondrial Genome of Palpitomonas bilix: Derived Genome Structure and Ancestral System for Cytochrome c Maturation

    PubMed Central

    Nishimura, Yuki; Tanifuji, Goro; Kamikawa, Ryoma; Yabuki, Akinori; Hashimoto, Tetsuo; Inagaki, Yuji

    2016-01-01

    We here reported the mitochondrial (mt) genome of one of the heterotrophic microeukaryotes related to cryptophytes, Palpitomonas bilix. The P. bilix mt genome was found to be a linear molecule composed of “single copy region” (∼16 kb) and repeat regions (∼30 kb) arranged in an inverse manner at both ends of the genome. Linear mt genomes with large inverted repeats are known for three distantly related eukaryotes (including P. bilix), suggesting that this particular mt genome structure has emerged at least three times in the eukaryotic tree of life. The P. bilix mt genome contains 47 protein-coding genes including ccmA, ccmB, ccmC, and ccmF, which encode protein subunits involved in the system for cytochrome c maturation inherited from a bacterium (System I). We present data indicating that the phylogenetic relatives of P. bilix, namely, cryptophytes, goniomonads, and kathablepharids, utilize an alternative system for cytochrome c maturation, which has most likely emerged during the evolution of eukaryotes (System III). To explain the distribution of Systems I and III in P. bilix and its phylogenetic relatives, two scenarios are possible: (i) System I was replaced by System III on the branch leading to the common ancestor of cryptophytes, goniomonads, and kathablepharids, and (ii) the two systems co-existed in their common ancestor, and lost differentially among the four descendants. PMID:27604877

  14. Evaluating the Influence of the Microsatellite Marker Set on the Genetic Structure Inferred in Pyrus communis L.

    PubMed Central

    Urrestarazu, Jorge; Royo, José B.; Santesteban, Luis G.; Miranda, Carlos

    2015-01-01

    Fingerprinting information can be used to elucidate in a robust manner the genetic structure of germplasm collections, allowing a more rational and fine assessment of genetic resources. Bayesian model-based approaches are nowadays majorly preferred to infer genetic structure, but it is still largely unresolved how marker sets should be built in order to obtain a robust inference. The objective was to evaluate, in Pyrus germplasm collections, the influence of the SSR marker set size on the genetic structure inferred, also evaluating the influence of the criterion used to select those markers. Inferences were performed considering an increasing number of SSR markers that ranged from just two up to 25, incorporated one at a time into the analysis. The influence of the number of SSR markers used was evaluated comparing the number of populations and the strength of the signal detected, and also the similarity of the genotype assignments to populations between analyses. In order to test if those results were influenced by the criterion used to select the SSRs, several choosing scenarios based on the discrimination power or the fixation index values of the SSRs were tested. Our results indicate that population structure could be inferred accurately once a certain SSR number threshold was reached, which depended on the underlying structure within the genotypes, but the method used to select the markers included on each set appeared not to be very relevant. The minimum number of SSRs required to provide robust structure inferences and adequate measurements of the differentiation, even when low differentiation levels exist within populations, was proved similar to that of the complete list of recommended markers for fingerprinting. When a SSR set size similar to the minimum marker sets recommended for fingerprinting it is used, only major divisions or moderate (FST>0.05) differentiation of the germplasm are detected. PMID:26382618

  15. Genomic Alteration in Head and Neck Squamous Cell Carcinoma (HNSCC) Cell Lines Inferred from Karyotyping, Molecular Cytogenetics, and Array Comparative Genomic Hybridization.

    PubMed

    Singchat, Worapong; Hitakomate, Ekarat; Rerkarmnuaychoke, Budsaba; Suntronpong, Aorarat; Fu, Beiyuan; Bodhisuwan, Winai; Peyachoknagul, Surin; Yang, Fengtang; Koontongkaew, Sittichai; Srikulnath, Kornsorn

    2016-01-01

    Genomic alteration in head and neck squamous cell carcinoma (HNSCC) was studied in two cell line pairs (HN30-HN31 and HN4-HN12) using conventional C-banding, multiplex fluorescence in situ hybridization (M-FISH), and array comparative genomic hybridization (array CGH). HN30 and HN4 were derived from primary lesions in the pharynx and base of tongue, respectively, and HN31 and HN12 were derived from lymph-node metastatic lesions belonging to the same patients. Gain of chromosome 1, 7, and 11 were shared in almost all cell lines. Hierarchical clustering revealed that HN31 was closely related to HN4, which shared eight chromosome alteration cases. Large C-positive heterochromatins were found in the centromeric region of chromosome 9 in HN31 and HN4, which suggests complex structural amplification of the repetitive sequence. Array CGH revealed amplification of 7p22.3p11.2, 8q11.23q12.1, and 14q32.33 in all cell lines involved with tumorigenesis and inflammation genes. The amplification of 2p21 (SIX3), 11p15.5 (H19), and 11q21q22.3 (MAML2, PGR, TRPC6, and MMP family) regions, and deletion of 9p23 (PTPRD) and 16q23.1 (WWOX) regions were identified in HN31 and HN12. Interestingly, partial loss of PTPRD (9p23) and WWOX (16q23.1) genes was identified in HN31 and HN12, and the level of gene expression tended to be the down-regulation of PTPRD, with no detectable expression of the WWOX gene. This suggests that the scarcity of PTPRD and WWOX genes might have played an important role in progression of HNSCC, and could be considered as a target for cancer therapy or a biomarker in molecular pathology.

  16. Genomic Alteration in Head and Neck Squamous Cell Carcinoma (HNSCC) Cell Lines Inferred from Karyotyping, Molecular Cytogenetics, and Array Comparative Genomic Hybridization

    PubMed Central

    Rerkarmnuaychoke, Budsaba; Suntronpong, Aorarat; Fu, Beiyuan; Bodhisuwan, Winai; Peyachoknagul, Surin; Yang, Fengtang; Koontongkaew, Sittichai; Srikulnath, Kornsorn

    2016-01-01

    Genomic alteration in head and neck squamous cell carcinoma (HNSCC) was studied in two cell line pairs (HN30-HN31 and HN4-HN12) using conventional C-banding, multiplex fluorescence in situ hybridization (M-FISH), and array comparative genomic hybridization (array CGH). HN30 and HN4 were derived from primary lesions in the pharynx and base of tongue, respectively, and HN31 and HN12 were derived from lymph-node metastatic lesions belonging to the same patients. Gain of chromosome 1, 7, and 11 were shared in almost all cell lines. Hierarchical clustering revealed that HN31 was closely related to HN4, which shared eight chromosome alteration cases. Large C-positive heterochromatins were found in the centromeric region of chromosome 9 in HN31 and HN4, which suggests complex structural amplification of the repetitive sequence. Array CGH revealed amplification of 7p22.3p11.2, 8q11.23q12.1, and 14q32.33 in all cell lines involved with tumorigenesis and inflammation genes. The amplification of 2p21 (SIX3), 11p15.5 (H19), and 11q21q22.3 (MAML2, PGR, TRPC6, and MMP family) regions, and deletion of 9p23 (PTPRD) and 16q23.1 (WWOX) regions were identified in HN31 and HN12. Interestingly, partial loss of PTPRD (9p23) and WWOX (16q23.1) genes was identified in HN31 and HN12, and the level of gene expression tended to be the down-regulation of PTPRD, with no detectable expression of the WWOX gene. This suggests that the scarcity of PTPRD and WWOX genes might have played an important role in progression of HNSCC, and could be considered as a target for cancer therapy or a biomarker in molecular pathology. PMID:27501229

  17. Characteristics of de novo structural changes in the human genome

    PubMed Central

    Kloosterman, Wigard P.; Francioli, Laurent C.; Hormozdiari, Fereydoun; Marschall, Tobias; Hehir-Kwa, Jayne Y.; Abdellaoui, Abdel; Lameijer, Eric-Wubbo; Moed, Matthijs H.; Koval, Vyacheslav; Renkens, Ivo; van Roosmalen, Markus J.; Arp, Pascal; Karssen, Lennart C.; Coe, Bradley P.; Handsaker, Robert E.; Suchiman, Eka D.; Cuppen, Edwin; Thung, Djie Tjwan; McVey, Mitch; Wendl, Michael C.; Uitterlinden, André; van Duijn, Cornelia M.; Swertz, Morris A.; Wijmenga, Cisca; van Ommen, GertJan B.; Slagboom, P. Eline; Boomsma, Dorret I.; Schönhuth, Alexander; Eichler, Evan E.; de Bakker, Paul I.W.; Ye, Kai; Guryev, Victor

    2015-01-01

    Small insertions and deletions (indels) and large structural variations (SVs) are major contributors to human genetic diversity and disease. However, mutation rates and characteristics of de novo indels and SVs in the general population have remained largely unexplored. We report 332 validated de novo structural changes identified in whole genomes of 250 families, including complex indels, retrotransposon insertions, and interchromosomal events. These data indicate a mutation rate of 2.94 indels (1–20 bp) and 0.16 SVs (>20 bp) per generation. De novo structural changes affect on average 4.1 kbp of genomic sequence and 29 coding bases per generation, which is 91 and 52 times more nucleotides than de novo substitutions, respectively. This contrasts with the equal genomic footprint of inherited SVs and substitutions. An excess of structural changes originated on paternal haplotypes. Additionally, we observed a nonuniform distribution of de novo SVs across offspring. These results reveal the importance of different mutational mechanisms to changes in human genome structure across generations. PMID:25883321

  18. RNA structure inference through chemical mapping after accidental or intentional mutations.

    PubMed

    Cheng, Clarence Y; Kladwang, Wipapat; Yesselman, Joseph D; Das, Rhiju

    2017-09-12

    Despite the critical roles RNA structures play in regulating gene expression, sequencing-based methods for experimentally determining RNA base pairs have remained inaccurate. Here, we describe a multidimensional chemical-mapping method called "mutate-and-map read out through next-generation sequencing" (M2-seq) that takes advantage of sparsely mutated nucleotides to induce structural perturbations at partner nucleotides and then detects these events through dimethyl sulfate (DMS) probing and mutational profiling. In special cases, fortuitous errors introduced during DNA template preparation and RNA transcription are sufficient to give M2-seq helix signatures; these signals were previously overlooked or mistaken for correlated double-DMS events. When mutations are enhanced through error-prone PCR, in vitro M2-seq experimentally resolves 33 of 68 helices in diverse structured RNAs including ribozyme domains, riboswitch aptamers, and viral RNA domains with a single false positive. These inferences do not require energy minimization algorithms and can be made by either direct visual inspection or by a neural-network-inspired algorithm called M2-net. Measurements on the P4-P6 domain of the Tetrahymena group I ribozyme embedded in Xenopus egg extract demonstrate the ability of M2-seq to detect RNA helices in a complex biological environment.

  19. Hippocampal Structure Predicts Statistical Learning and Associative Inference Abilities during Development.

    PubMed

    Schlichting, Margaret L; Guarino, Katharine F; Schapiro, Anna C; Turk-Browne, Nicholas B; Preston, Alison R

    2017-01-01

    Despite the importance of learning and remembering across the lifespan, little is known about how the episodic memory system develops to support the extraction of associative structure from the environment. Here, we relate individual differences in volumes along the hippocampal long axis to performance on statistical learning and associative inference tasks-both of which require encoding associations that span multiple episodes-in a developmental sample ranging from ages 6 to 30 years. Relating age to volume, we found dissociable patterns across the hippocampal long axis, with opposite nonlinear volume changes in the head and body. These structural differences were paralleled by performance gains across the age range on both tasks, suggesting improvements in the cross-episode binding ability from childhood to adulthood. Controlling for age, we also found that smaller hippocampal heads were associated with superior behavioral performance on both tasks, consistent with this region's hypothesized role in forming generalized codes spanning events. Collectively, these results highlight the importance of examining hippocampal development as a function of position along the hippocampal axis and suggest that the hippocampal head is particularly important in encoding associative structure across development.

  20. Developing JSequitur to Study the Hierarchical Structure of Biological Sequences in a Grammatical Inference Framework of String Compression Algorithms.

    PubMed

    Galbadrakh, Bulgan; Lee, Kyung-Eun; Park, Hyun-Seok

    2012-12-01

    Grammatical inference methods are expected to find grammatical structures hidden in biological sequences. One hopes that studies of grammar serve as an appropriate tool for theory formation. Thus, we have developed JSequitur for automatically generating the grammatical structure of biological sequences in an inference framework of string compression algorithms. Our original motivation was to find any grammatical traits of several cancer genes that can be detected by string compression algorithms. Through this research, we could not find any meaningful unique traits of the cancer genes yet, but we could observe some interesting traits in regards to the relationship among gene length, similarity of sequences, the patterns of the generated grammar, and compression rate.

  1. On the Structure of Cortical Microcircuits Inferred from Small Sample Sizes.

    PubMed

    Vegué, Marina; Perin, Rodrigo; Roxin, Alex

    2017-08-30

    The structure in cortical microcircuits deviates from what would be expected in a purely random network, which has been seen as evidence of clustering. To address this issue, we sought to reproduce the nonrandom features of cortical circuits by considering several distinct classes of network topology, including clustered networks, networks with distance-dependent connectivity, and those with broad degree distributions. To our surprise, we found that all of these qualitatively distinct topologies could account equally well for all reported nonrandom features despite being easily distinguishable from one another at the network level. This apparent paradox was a consequence of estimating network properties given only small sample sizes. In other words, networks that differ markedly in their global structure can look quite similar locally. This makes inferring network structure from small sample sizes, a necessity given the technical difficulty inherent in simultaneous intracellular recordings, problematic. We found that a network statistic called the sample degree correlation (SDC) overcomes this difficulty. The SDC depends only on parameters that can be estimated reliably given small sample sizes and is an accurate fingerprint of every topological family. We applied the SDC criterion to data from rat visual and somatosensory cortex and discovered that the connectivity was not consistent with any of these main topological classes. However, we were able to fit the experimental data with a more general network class, of which all previous topologies were special cases. The resulting network topology could be interpreted as a combination of physical spatial dependence and nonspatial, hierarchical clustering.SIGNIFICANCE STATEMENT The connectivity of cortical microcircuits exhibits features that are inconsistent with a simple random network. Here, we show that several classes of network models can account for this nonrandom structure despite qualitative differences in

  2. Phylogenetic inference and SSR characterization of tropical woody bamboos tribe Bambuseae (Poaceae: Bambusoideae) based on complete plastid genome sequences.

    PubMed

    Vieira, Leila do Nascimento; Dos Anjos, Karina Goulart; Faoro, Helisson; Fraga, Hugo Pacheco de Freitas; Greco, Thiago Machado; Pedrosa, Fábio de Oliveira; de Souza, Emanuel Maltempi; Rogalski, Marcelo; de Souza, Robson Francisco; Guerra, Miguel Pedro

    2016-05-01

    The complete plastome sequencing is an efficient option for increasing phylogenetic resolution and evolutionary studies, as well as may greatly facilitate the use of plastid DNA markers in plant population genetic studies. Merostachys and Guadua stand out as the most common and the highest potential utilization bamboos indigenous of Brazil. Here, we sequenced the complete plastome sequences of the Brazilian Guadua chacoensis and Merostachys sp. to perform full plastome phylogeny and characterize the occurrence, type, and distribution of SRRs using 20 Bambuseae species. The determined plastome sequence of Merostachys sp. and G. chacoensis is 136,334 and 135,403 bp in size, respectively, with an identical gene content and typical quadripartite structure consisting of a pair of IRs separated by the LSC and SSC regions. The Maximum Likelihood and Bayesian Inference analyses produced phylogenomic trees identical in topology. These trees supported monophyly of Paleotropical and Neotropical Bamboos clades. The Neotropical bamboos segregated into three well-supported lineages, Chusqueinae, Guaduinae, and Arthrostylidiinae, with the last two forming a well-supported sister relationship. Paleotropical bamboos segregated into two well-supported lineages, Hickeliinae and Bambusinae + Melocanninae. We identified 141.8 cpSSR in Bambuseae plastomes and an inferior value (38.15) for plastome coding sequences. Among them, we identified 16 polymorphic SSR loci, with number of alleles varying from 3 to 10. These 16 polymorphic cpSSR loci in Bambuseae plastome can be assessed for the intraspecific level of polymorphism, leading to innovative highly sensitive phylogeographic and population genetics studies for this tribe.

  3. Coevolution of the Organization and Structure of Prokaryotic Genomes.

    PubMed

    Touchon, Marie; Rocha, Eduardo P C

    2016-01-04

    The cytoplasm of prokaryotes contains many molecular machines interacting directly with the chromosome. These vital interactions depend on the chromosome structure, as a molecule, and on the genome organization, as a unit of genetic information. Strong selection for the organization of the genetic elements implicated in these interactions drives replicon ploidy, gene distribution, operon conservation, and the formation of replication-associated traits. The genomes of prokaryotes are also very plastic with high rates of horizontal gene transfer and gene loss. The evolutionary conflicts between plasticity and organization lead to the formation of regions with high genetic diversity whose impact on chromosome structure is poorly understood. Prokaryotic genomes are remarkable documents of natural history because they carry the imprint of all of these selective and mutational forces. Their study allows a better understanding of molecular mechanisms, their impact on microbial evolution, and how they can be tinkered in synthetic biology.

  4. Phylogeography and population structure of the biologically invasive phytopathogen Erwinia amylovora inferred using minisatellites.

    PubMed

    Bühlmann, Andreas; Dreo, Tanja; Rezzonico, Fabio; Pothier, Joël F; Smits, Theo H M; Ravnikar, Maja; Frey, Jürg E; Duffy, Brion

    2014-07-01

    Erwinia amylovora causes a major disease of pome fruit trees worldwide, and is regulated as a quarantine organism in many countries. While some diversity of isolates has been observed, molecular epidemiology of this bacterium is hindered by a lack of simple molecular typing techniques with sufficiently high resolution. We report a molecular typing system of E. amylovora based on variable number of tandem repeats (VNTR) analysis. Repeats in the E. amylovora genome were identified with comparative genomic tools, and VNTR markers were developed and validated. A Multiple-Locus VNTR Analysis (MLVA) was applied to E. amylovora isolates from bacterial collections representing global and regional distribution of the pathogen. Based on six repeats, MLVA allowed the distinction of 227 haplotypes among a collection of 833 isolates of worldwide origin. Three geographically separated groups were recognized among global isolates using Bayesian clustering methods. Analysis of regional outbreaks confirmed presence of diverse haplotypes but also high representation of certain haplotypes during outbreaks. MLVA analysis is a practical method for epidemiological studies of E. amylovora, identifying previously unresolved population structure within outbreaks. Knowledge of such structure can increase our understanding on how plant diseases emerge and spread over a given geographical region.

  5. Structures of the CRISPR genome integration complex.

    PubMed

    Wright, Addison V; Liu, Jun-Jie; Knott, Gavin J; Doxzen, Kevin W; Nogales, Eva; Doudna, Jennifer A

    2017-09-15

    CRISPR-Cas systems depend on the Cas1-Cas2 integrase to capture and integrate short foreign DNA fragments into the CRISPR locus, enabling adaptation to new viruses. We present crystal structures of Cas1-Cas2 bound to both donor and target DNA in intermediate and product integration complexes, as well as a cryo-electron microscopy structure of the full CRISPR locus integration complex, including the accessory protein IHF (integration host factor). The structures show unexpectedly that indirect sequence recognition dictates integration site selection by favoring deformation of the repeat and the flanking sequences. IHF binding bends the DNA sharply, bringing an upstream recognition motif into contact with Cas1 to increase both the specificity and efficiency of integration. These results explain how the Cas1-Cas2 CRISPR integrase recognizes a sequence-dependent DNA structure to ensure site-selective CRISPR array expansion during the initial step of bacterial adaptive immunity. Copyright © 2017, American Association for the Advancement of Science.

  6. Inferring phenotypic causal structures among meat quality traits and the application of a structural equation model in Japanese Black cattle.

    PubMed

    Inoue, K; Valente, B D; Shoji, N; Honda, T; Oyama, K; Rosa, G J M

    2016-10-01

    Meat quality is one of the most important traits determining carcass price in the Japanese beef market. Optimized breeding goals and management practices for the improvement of meat quality traits requires knowledge regarding any potential functional relationships between them. In this context, the objective of this research was to infer phenotypic causal networks involving beef marbling score (BMS), beef color score (BCL), firmness of beef (FIR), texture of beef (TEX), beef fat color score (BFS), and the ratio of MUFA to SFA (MUS) from 11,855 Japanese Black cattle. The inductive causation (IC) algorithm was implemented to search for causal links among these traits and was conditionally applied to their joint distribution on genetic effects. This information was obtained from the posterior distribution of the residual (co)variance matrix of a standard Bayesian multiple trait model (MTM). Apart from BFS, the IC algorithm implemented with 95% highest posterior density (HPD) intervals detected only undirected links among the traits. However, as a result of the application of 80% HPD intervals, more links were recovered and the undirected links were changed into directed ones, except between FIR and TEX. Therefore, 2 competing causal networks resulting from the IC algorithm, with either the arrow FIR → TEX or the arrow FIR ← TEX, were fitted using a structural equation model () to infer causal structure coefficients between the selected traits. Results indicated similar genetic and residual variances as well as genetic correlation estimates from both structural equation models. The genetic variances in BMS, FIR, and TEX from the structural equation models were smaller than those obtained from the MTM. In contrast, the variances in BCL, BFS, and MUS, which were not conditioned on any of the other traits in the causal structures, had no significant differences between the structural equation model and MTM. The structural coefficient for the path from MUS (BCL) to BMS

  7. Symbolic extensions applied to multiscale structure of genomes.

    PubMed

    Downarowicz, Tomasz; Travisany, Dante; Montecino, Martin; Maass, Alejandro

    2014-06-01

    A genome of a living organism consists of a long string of symbols over a finite alphabet carrying critical information for the organism. This includes its ability to control post natal growth, homeostasis, adaptation to changes in the surrounding environment, or to biochemically respond at the cellular level to various specific regulatory signals. In this sense, a genome represents a symbolic encoding of a highly organized system of information whose functioning may be revealed as a natural multilayer structure in terms of complexity and prominence. In this paper we use the mathematical theory of symbolic extensions as a framework to shed light onto how this multilayer organization is reflected in the symbolic coding of the genome. The distribution of data in an element of a standard symbolic extension of a dynamical system has a specific form: the symbolic sequence is divided into several subsequences (which we call layers) encoding the dynamics on various "scales". We propose that a similar structure resides within the genomes, building our analogy on some of the most recent findings in the field of regulation of genomic DNA functioning.

  8. Structure and variation of the mitochondrial genome of fishes.

    PubMed

    Satoh, Takashi P; Miya, Masaki; Mabuchi, Kohji; Nishida, Mutsumi

    2016-09-07

    The mitochondrial (mt) genome has been used as an effective tool for phylogenetic and population genetic analyses in vertebrates. However, the structure and variability of the vertebrate mt genome are not well understood. A potential strategy for improving our understanding is to conduct a comprehensive comparative study of large mt genome data. The aim of this study was to characterize the structure and variability of the fish mt genome through comparative analysis of large datasets. An analysis of the secondary structure of proteins for 250 fish species (248 ray-finned and 2 cartilaginous fishes) illustrated that cytochrome c oxidase subunits (COI, COII, and COIII) and a cytochrome bc1 complex subunit (Cyt b) had substantial amino acid conservation. Among the four proteins, COI was the most conserved, as more than half of all amino acid sites were invariable among the 250 species. Our models identified 43 and 58 stems within 12S rRNA and 16S rRNA, respectively, with larger numbers than proposed previously for vertebrates. The models also identified 149 and 319 invariable sites in 12S rRNA and 16S rRNA, respectively, in all fishes. In particular, the present result verified that a region corresponding to the peptidyl transferase center in prokaryotic 23S rRNA, which is homologous to mt 16S rRNA, is also conserved in fish mt 16S rRNA. Concerning the gene order, we found 35 variations (in 32 families) that deviated from the common gene order in vertebrates. These gene rearrangements were mostly observed in the area spanning the ND5 gene to the control region as well as two tRNA gene cluster regions (IQM and WANCY regions). Although many of such gene rearrangements were unique to a specific taxon, some were shared polyphyletically between distantly related species. Through a large-scale comparative analysis of 250 fish species mt genomes, we elucidated various structural aspects of the fish mt genome and the encoded genes. The present results will be important for

  9. The Evolutionary History of Plasmodium vivax as Inferred from Mitochondrial Genomes: Parasite Genetic Diversity in the Americas

    PubMed Central

    Taylor, Jesse E.; Pacheco, M. Andreína; Bacon, David J.; Beg, Mohammad A.; Machado, Ricardo Luiz; Fairhurst, Rick M.; Herrera, Socrates; Kim, Jung-Yeon; Menard, Didier; Póvoa, Marinete Marins; Villegas, Leopoldo; Mulyanto; Snounou, Georges; Cui, Liwang; Zeyrek, Fadile Yildiz; Escalante, Ananias A.

    2013-01-01

    Plasmodium vivax is the most prevalent human malaria parasite in the Americas. Previous studies have contrasted the genetic diversity of parasite populations in the Americas with those in Asia and Oceania, concluding that New World populations exhibit low genetic diversity consistent with a recent introduction. Here we used an expanded sample of complete mitochondrial genome sequences to investigate the diversity of P. vivax in the Americas as well as in other continental populations. We show that the diversity of P. vivax in the Americas is comparable to that in Asia and Oceania, and we identify several divergent clades circulating in South America that may have resulted from independent introductions. In particular, we show that several haplotypes sampled in Venezuela and northeastern Brazil belong to a clade that diverged from the other P. vivax lineages at least 30,000 years ago, albeit not necessarily in the Americas. We propose that, unlike in Asia where human migration increases local genetic diversity, the combined effects of the geographical structure and the low incidence of vivax malaria in the Americas has resulted in patterns of low local but high regional genetic diversity. This could explain previous views that P. vivax in the Americas has low genetic diversity because these were based on studies carried out in limited areas. Further elucidation of the complex geographical pattern of P. vivax variation will be important both for diversity assessments of genes encoding candidate vaccine antigens and in the formulation of control and surveillance measures aimed at malaria elimination. PMID:23733143

  10. The evolutionary history of Plasmodium vivax as inferred from mitochondrial genomes: parasite genetic diversity in the Americas.

    PubMed

    Taylor, Jesse E; Pacheco, M Andreína; Bacon, David J; Beg, Mohammad A; Machado, Ricardo Luiz; Fairhurst, Rick M; Herrera, Socrates; Kim, Jung-Yeon; Menard, Didier; Póvoa, Marinete Marins; Villegas, Leopoldo; Mulyanto; Snounou, Georges; Cui, Liwang; Zeyrek, Fadile Yildiz; Escalante, Ananias A

    2013-09-01

    Plasmodium vivax is the most prevalent human malaria parasite in the Americas. Previous studies have contrasted the genetic diversity of parasite populations in the Americas with those in Asia and Oceania, concluding that New World populations exhibit low genetic diversity consistent with a recent introduction. Here we used an expanded sample of complete mitochondrial genome sequences to investigate the diversity of P. vivax in the Americas as well as in other continental populations. We show that the diversity of P. vivax in the Americas is comparable to that in Asia and Oceania, and we identify several divergent clades circulating in South America that may have resulted from independent introductions. In particular, we show that several haplotypes sampled in Venezuela and northeastern Brazil belong to a clade that diverged from the other P. vivax lineages at least 30,000 years ago, albeit not necessarily in the Americas. We propose that, unlike in Asia where human migration increases local genetic diversity, the combined effects of the geographical structure and the low incidence of vivax malaria in the Americas has resulted in patterns of low local but high regional genetic diversity. This could explain previous views that P. vivax in the Americas has low genetic diversity because these were based on studies carried out in limited areas. Further elucidation of the complex geographical pattern of P. vivax variation will be important both for diversity assessments of genes encoding candidate vaccine antigens and in the formulation of control and surveillance measures aimed at malaria elimination.

  11. Unleashing the power of meta-threading for evolution/structure-based function inference of proteins.

    PubMed

    Brylinski, Michal

    2013-01-01

    Protein threading is widely used in the prediction of protein structure and the subsequent functional annotation. Most threading approaches employ similar criteria for the template identification for use in both protein structure and function modeling. Using structure similarity alone might result in a high false positive rate in protein function inference, which suggests that selecting functional templates should be subject to a different set of constraints. In this study, we extend the functionality of eThread, a recently developed approach to meta-threading, focusing on the optimal selection of functional templates. We optimized the selection of template proteins to cover a broad spectrum of protein molecular function: ligand, metal, inorganic cluster, protein, and nucleic acid binding. In large-scale benchmarks, we demonstrate that the recognition rates in identifying templates that bind molecular partners in similar locations are very high, typically 70-80%, at the expense of a relatively low false positive rate. eThread also provides useful insights into the chemical properties of binding molecules and the structural features of binding. For instance, the sensitivity in recognizing similar protein-binding interfaces is 58% at only 18% false positive rate. Furthermore, in comparative analysis, we demonstrate that meta-threading supported by machine learning outperforms single-threading approaches in functional template selection. We show that meta-threading effectively detects many facets of protein molecular function, even in a low-sequence identity regime. The enhanced version of eThread is freely available as a webserver and stand-alone software at http://www.brylinski.org/ethread.

  12. Function inferences from a molecular structural model of bacterial ParE toxin.

    PubMed

    Barbosa, Luiz Carlos Bertucci; Garrido, Saulo Santesso; Garcia, Anderson; Delfino, Davi Barbosa; Marchetto, Reinaldo

    2010-04-30

    Toxin-antitoxin (TA) systems contribute to plasmid stability by a mechanism that relies on the differential stabilities of the toxin and antitoxin proteins and leads to the killing of daughter bacteria that did not receive a plasmid copy at the cell division. ParE is the toxic component of a TA system that constitutes along with RelE an important class of bacterial toxin called RelE/ParE superfamily. For ParE toxin, no crystallographic structure is available so far and rare in vitro studies demonstrated that the target of toxin activity is E. coli DNA gyrase. Here, a 3D Model for E. coli ParE toxin by molecular homology modeling was built using MODELLER, a program for comparative modeling. The Model was energy minimized by CHARMM and validated using PROCHECK and VERIFY3D programs. Resulting Ramachandran plot analysis it was found that the portion residues failing into the most favored and allowed regions was 96.8%. Structural similarity search employing DALI server showed as the best matches RelE and YoeB families. The Model also showed similarities with other microbial ribonucleases but in a small score. A possible homologous deep cleft active site was identified in the Model using CASTp program. Additional studies to investigate the nuclease activity in members of ParE family as well as to confirm the inhibitory replication activity are needed. The predicted Model allows initial inferences about the unexplored 3D structure of the ParE toxin and may be further used in rational design of molecules for structure-function studies.

  13. Function inferences from a molecular structural model of bacterial ParE toxin

    PubMed Central

    Barbosa, Luiz Carlos Bertucci; Garrido, Saulo Santesso; Garcia, Anderson; Delfino, Davi Barbosa; Marchetto, Reinaldo

    2010-01-01

    Toxin-antitoxin (TA) systems contribute to plasmid stability by a mechanism that relies on the differential stabilities of the toxin and antitoxin proteins and leads to the killing of daughter bacteria that did not receive a plasmid copy at the cell division. ParE is the toxic component of a TA system that constitutes along with RelE an important class of bacterial toxin called RelE/ParE superfamily. For ParE toxin, no crystallographic structure is available so far and rare in vitro studies demonstrated that the target of toxin activity is E. coli DNA gyrase. Here, a 3D Model for E. coli ParE toxin by molecular homology modeling was built using MODELLER, a program for comparative modeling. The Model was energy minimized by CHARMM and validated using PROCHECK and VERIFY3D programs. Resulting Ramachandran plot analysis it was found that the portion residues failing into the most favored and allowed regions was 96.8%. Structural similarity search employing DALI server showed as the best matches RelE and YoeB families. The Model also showed similarities with other microbial ribonucleases but in a small score. A possible homologous deep cleft active site was identified in the Model using CASTp program. Additional studies to investigate the nuclease activity in members of ParE family as well as to confirm the inhibitory replication activity are needed. The predicted Model allows initial inferences about the unexplored 3D structure of the ParE toxin and may be further used in rational design of molecules for structure­function studies. PMID:20975905

  14. Inferring the population structure of Myzus persicae in diverse agroecosystems using microsatellite markers.

    PubMed

    Sanchez, Juan Antonio; La-Spina, Michelangelo; Guirao, Pedro; Cánovas, Fernando

    2013-08-01

    Diverse agroecosystems offer phytophagous insects a wide choice of host plants. Myzus persicae is a polyphagous aphid common in moderate climates. During its life cycle it alternates between primary and secondary hosts. A spatial genetic population structure may arise due to environmental factors and reproduction modes. The aim of this work was to determine the spatial and temporal genetic population structure of M. persicae in relation to host plants and climatic conditions. For this, 923 individuals of M. persicae collected from six plant families between 2005 and 2008 in south-eastern Spain were genotyped for eight microsatellite loci. The population structure was inferred by neighbour-joining, analysis of molecular variance (AMOVA) and Bayesian analyses. Moderate polymorphism was observed for the eight loci in almost al