Sample records for multiple inference methods

  1. Network inference from multimodal data: A review of approaches from infectious disease transmission.

    PubMed

    Ray, Bisakha; Ghedin, Elodie; Chunara, Rumi

    2016-12-01

    Networks inference problems are commonly found in multiple biomedical subfields such as genomics, metagenomics, neuroscience, and epidemiology. Networks are useful for representing a wide range of complex interactions ranging from those between molecular biomarkers, neurons, and microbial communities, to those found in human or animal populations. Recent technological advances have resulted in an increasing amount of healthcare data in multiple modalities, increasing the preponderance of network inference problems. Multi-domain data can now be used to improve the robustness and reliability of recovered networks from unimodal data. For infectious diseases in particular, there is a body of knowledge that has been focused on combining multiple pieces of linked information. Combining or analyzing disparate modalities in concert has demonstrated greater insight into disease transmission than could be obtained from any single modality in isolation. This has been particularly helpful in understanding incidence and transmission at early stages of infections that have pandemic potential. Novel pieces of linked information in the form of spatial, temporal, and other covariates including high-throughput sequence data, clinical visits, social network information, pharmaceutical prescriptions, and clinical symptoms (reported as free-text data) also encourage further investigation of these methods. The purpose of this review is to provide an in-depth analysis of multimodal infectious disease transmission network inference methods with a specific focus on Bayesian inference. We focus on analytical Bayesian inference-based methods as this enables recovering multiple parameters simultaneously, for example, not just the disease transmission network, but also parameters of epidemic dynamics. Our review studies their assumptions, key inference parameters and limitations, and ultimately provides insights about improving future network inference methods in multiple applications. Copyright © 2016 Elsevier Inc. All rights reserved.

  2. A new fast method for inferring multiple consensus trees using k-medoids.

    PubMed

    Tahiri, Nadia; Willems, Matthieu; Makarenkov, Vladimir

    2018-04-05

    Gene trees carry important information about specific evolutionary patterns which characterize the evolution of the corresponding gene families. However, a reliable species consensus tree cannot be inferred from a multiple sequence alignment of a single gene family or from the concatenation of alignments corresponding to gene families having different evolutionary histories. These evolutionary histories can be quite different due to horizontal transfer events or to ancient gene duplications which cause the emergence of paralogs within a genome. Many methods have been proposed to infer a single consensus tree from a collection of gene trees. Still, the application of these tree merging methods can lead to the loss of specific evolutionary patterns which characterize some gene families or some groups of gene families. Thus, the problem of inferring multiple consensus trees from a given set of gene trees becomes relevant. We describe a new fast method for inferring multiple consensus trees from a given set of phylogenetic trees (i.e. additive trees or X-trees) defined on the same set of species (i.e. objects or taxa). The traditional consensus approach yields a single consensus tree. We use the popular k-medoids partitioning algorithm to divide a given set of trees into several clusters of trees. We propose novel versions of the well-known Silhouette and Caliński-Harabasz cluster validity indices that are adapted for tree clustering with k-medoids. The efficiency of the new method was assessed using both synthetic and real data, such as a well-known phylogenetic dataset consisting of 47 gene trees inferred for 14 archaeal organisms. The method described here allows inference of multiple consensus trees from a given set of gene trees. It can be used to identify groups of gene trees having similar intragroup and different intergroup evolutionary histories. The main advantage of our method is that it is much faster than the existing tree clustering approaches, while providing similar or better clustering results in most cases. This makes it particularly well suited for the analysis of large genomic and phylogenetic datasets.

  3. Data Analysis Techniques for Physical Scientists

    NASA Astrophysics Data System (ADS)

    Pruneau, Claude A.

    2017-10-01

    Preface; How to read this book; 1. The scientific method; Part I. Foundation in Probability and Statistics: 2. Probability; 3. Probability models; 4. Classical inference I: estimators; 5. Classical inference II: optimization; 6. Classical inference III: confidence intervals and statistical tests; 7. Bayesian inference; Part II. Measurement Techniques: 8. Basic measurements; 9. Event reconstruction; 10. Correlation functions; 11. The multiple facets of correlation functions; 12. Data correction methods; Part III. Simulation Techniques: 13. Monte Carlo methods; 14. Collision and detector modeling; List of references; Index.

  4. A group LASSO-based method for robustly inferring gene regulatory networks from multiple time-course datasets.

    PubMed

    Liu, Li-Zhi; Wu, Fang-Xiang; Zhang, Wen-Jun

    2014-01-01

    As an abstract mapping of the gene regulations in the cell, gene regulatory network is important to both biological research study and practical applications. The reverse engineering of gene regulatory networks from microarray gene expression data is a challenging research problem in systems biology. With the development of biological technologies, multiple time-course gene expression datasets might be collected for a specific gene network under different circumstances. The inference of a gene regulatory network can be improved by integrating these multiple datasets. It is also known that gene expression data may be contaminated with large errors or outliers, which may affect the inference results. A novel method, Huber group LASSO, is proposed to infer the same underlying network topology from multiple time-course gene expression datasets as well as to take the robustness to large error or outliers into account. To solve the optimization problem involved in the proposed method, an efficient algorithm which combines the ideas of auxiliary function minimization and block descent is developed. A stability selection method is adapted to our method to find a network topology consisting of edges with scores. The proposed method is applied to both simulation datasets and real experimental datasets. It shows that Huber group LASSO outperforms the group LASSO in terms of both areas under receiver operating characteristic curves and areas under the precision-recall curves. The convergence analysis of the algorithm theoretically shows that the sequence generated from the algorithm converges to the optimal solution of the problem. The simulation and real data examples demonstrate the effectiveness of the Huber group LASSO in integrating multiple time-course gene expression datasets and improving the resistance to large errors or outliers.

  5. NIMEFI: gene regulatory network inference using multiple ensemble feature importance algorithms.

    PubMed

    Ruyssinck, Joeri; Huynh-Thu, Vân Anh; Geurts, Pierre; Dhaene, Tom; Demeester, Piet; Saeys, Yvan

    2014-01-01

    One of the long-standing open challenges in computational systems biology is the topology inference of gene regulatory networks from high-throughput omics data. Recently, two community-wide efforts, DREAM4 and DREAM5, have been established to benchmark network inference techniques using gene expression measurements. In these challenges the overall top performer was the GENIE3 algorithm. This method decomposes the network inference task into separate regression problems for each gene in the network in which the expression values of a particular target gene are predicted using all other genes as possible predictors. Next, using tree-based ensemble methods, an importance measure for each predictor gene is calculated with respect to the target gene and a high feature importance is considered as putative evidence of a regulatory link existing between both genes. The contribution of this work is twofold. First, we generalize the regression decomposition strategy of GENIE3 to other feature importance methods. We compare the performance of support vector regression, the elastic net, random forest regression, symbolic regression and their ensemble variants in this setting to the original GENIE3 algorithm. To create the ensemble variants, we propose a subsampling approach which allows us to cast any feature selection algorithm that produces a feature ranking into an ensemble feature importance algorithm. We demonstrate that the ensemble setting is key to the network inference task, as only ensemble variants achieve top performance. As second contribution, we explore the effect of using rankwise averaged predictions of multiple ensemble algorithms as opposed to only one. We name this approach NIMEFI (Network Inference using Multiple Ensemble Feature Importance algorithms) and show that this approach outperforms all individual methods in general, although on a specific network a single method can perform better. An implementation of NIMEFI has been made publicly available.

  6. Wisdom of crowds for robust gene network inference

    PubMed Central

    Marbach, Daniel; Costello, James C.; Küffner, Robert; Vega, Nicci; Prill, Robert J.; Camacho, Diogo M.; Allison, Kyle R.; Kellis, Manolis; Collins, James J.; Stolovitzky, Gustavo

    2012-01-01

    Reconstructing gene regulatory networks from high-throughput data is a long-standing problem. Through the DREAM project (Dialogue on Reverse Engineering Assessment and Methods), we performed a comprehensive blind assessment of over thirty network inference methods on Escherichia coli, Staphylococcus aureus, Saccharomyces cerevisiae, and in silico microarray data. We characterize performance, data requirements, and inherent biases of different inference approaches offering guidelines for both algorithm application and development. We observe that no single inference method performs optimally across all datasets. In contrast, integration of predictions from multiple inference methods shows robust and high performance across diverse datasets. Thereby, we construct high-confidence networks for E. coli and S. aureus, each comprising ~1700 transcriptional interactions at an estimated precision of 50%. We experimentally test 53 novel interactions in E. coli, of which 23 were supported (43%). Our results establish community-based methods as a powerful and robust tool for the inference of transcriptional gene regulatory networks. PMID:22796662

  7. Semi-Supervised Multi-View Learning for Gene Network Reconstruction

    PubMed Central

    Ceci, Michelangelo; Pio, Gianvito; Kuzmanovski, Vladimir; Džeroski, Sašo

    2015-01-01

    The task of gene regulatory network reconstruction from high-throughput data is receiving increasing attention in recent years. As a consequence, many inference methods for solving this task have been proposed in the literature. It has been recently observed, however, that no single inference method performs optimally across all datasets. It has also been shown that the integration of predictions from multiple inference methods is more robust and shows high performance across diverse datasets. Inspired by this research, in this paper, we propose a machine learning solution which learns to combine predictions from multiple inference methods. While this approach adds additional complexity to the inference process, we expect it would also carry substantial benefits. These would come from the automatic adaptation to patterns on the outputs of individual inference methods, so that it is possible to identify regulatory interactions more reliably when these patterns occur. This article demonstrates the benefits (in terms of accuracy of the reconstructed networks) of the proposed method, which exploits an iterative, semi-supervised ensemble-based algorithm. The algorithm learns to combine the interactions predicted by many different inference methods in the multi-view learning setting. The empirical evaluation of the proposed algorithm on a prokaryotic model organism (E. coli) and on a eukaryotic model organism (S. cerevisiae) clearly shows improved performance over the state of the art methods. The results indicate that gene regulatory network reconstruction for the real datasets is more difficult for S. cerevisiae than for E. coli. The software, all the datasets used in the experiments and all the results are available for download at the following link: http://figshare.com/articles/Semi_supervised_Multi_View_Learning_for_Gene_Network_Reconstruction/1604827. PMID:26641091

  8. A method of alignment masking for refining the phylogenetic signal of multiple sequence alignments.

    PubMed

    Rajan, Vaibhav

    2013-03-01

    Inaccurate inference of positional homologies in multiple sequence alignments and systematic errors introduced by alignment heuristics obfuscate phylogenetic inference. Alignment masking, the elimination of phylogenetically uninformative or misleading sites from an alignment before phylogenetic analysis, is a common practice in phylogenetic analysis. Although masking is often done manually, automated methods are necessary to handle the much larger data sets being prepared today. In this study, we introduce the concept of subsplits and demonstrate their use in extracting phylogenetic signal from alignments. We design a clustering approach for alignment masking where each cluster contains similar columns-similarity being defined on the basis of compatible subsplits; our approach then identifies noisy clusters and eliminates them. Trees inferred from the columns in the retained clusters are found to be topologically closer to the reference trees. We test our method on numerous standard benchmarks (both synthetic and biological data sets) and compare its performance with other methods of alignment masking. We find that our method can eliminate sites more accurately than other methods, particularly on divergent data, and can improve the topologies of the inferred trees in likelihood-based analyses. Software available upon request from the author.

  9. Ensemble stacking mitigates biases in inference of synaptic connectivity.

    PubMed

    Chambers, Brendan; Levy, Maayan; Dechery, Joseph B; MacLean, Jason N

    2018-01-01

    A promising alternative to directly measuring the anatomical connections in a neuronal population is inferring the connections from the activity. We employ simulated spiking neuronal networks to compare and contrast commonly used inference methods that identify likely excitatory synaptic connections using statistical regularities in spike timing. We find that simple adjustments to standard algorithms improve inference accuracy: A signing procedure improves the power of unsigned mutual-information-based approaches and a correction that accounts for differences in mean and variance of background timing relationships, such as those expected to be induced by heterogeneous firing rates, increases the sensitivity of frequency-based methods. We also find that different inference methods reveal distinct subsets of the synaptic network and each method exhibits different biases in the accurate detection of reciprocity and local clustering. To correct for errors and biases specific to single inference algorithms, we combine methods into an ensemble. Ensemble predictions, generated as a linear combination of multiple inference algorithms, are more sensitive than the best individual measures alone, and are more faithful to ground-truth statistics of connectivity, mitigating biases specific to single inference methods. These weightings generalize across simulated datasets, emphasizing the potential for the broad utility of ensemble-based approaches.

  10. Concerns regarding a call for pluralism of information theory and hypothesis testing

    USGS Publications Warehouse

    Lukacs, P.M.; Thompson, W.L.; Kendall, W.L.; Gould, W.R.; Doherty, P.F.; Burnham, K.P.; Anderson, D.R.

    2007-01-01

    1. Stephens et al . (2005) argue for `pluralism? in statistical analysis, combining null hypothesis testing and information-theoretic (I-T) methods. We show that I-T methods are more informative even in single variable problems and we provide an ecological example. 2. I-T methods allow inferences to be made from multiple models simultaneously. We believe multimodel inference is the future of data analysis, which cannot be achieved with null hypothesis-testing approaches. 3. We argue for a stronger emphasis on critical thinking in science in general and less reliance on exploratory data analysis and data dredging. Deriving alternative hypotheses is central to science; deriving a single interesting science hypothesis and then comparing it to a default null hypothesis (e.g. `no difference?) is not an efficient strategy for gaining knowledge. We think this single-hypothesis strategy has been relied upon too often in the past. 4. We clarify misconceptions presented by Stephens et al . (2005). 5. We think inference should be made about models, directly linked to scientific hypotheses, and their parameters conditioned on data, Prob(Hj| data). I-T methods provide a basis for this inference. Null hypothesis testing merely provides a probability statement about the data conditioned on a null model, Prob(data |H0). 6. Synthesis and applications. I-T methods provide a more informative approach to inference. I-T methods provide a direct measure of evidence for or against hypotheses and a means to consider simultaneously multiple hypotheses as a basis for rigorous inference. Progress in our science can be accelerated if modern methods can be used intelligently; this includes various I-T and Bayesian methods.

  11. NIMEFI: Gene Regulatory Network Inference using Multiple Ensemble Feature Importance Algorithms

    PubMed Central

    Ruyssinck, Joeri; Huynh-Thu, Vân Anh; Geurts, Pierre; Dhaene, Tom; Demeester, Piet; Saeys, Yvan

    2014-01-01

    One of the long-standing open challenges in computational systems biology is the topology inference of gene regulatory networks from high-throughput omics data. Recently, two community-wide efforts, DREAM4 and DREAM5, have been established to benchmark network inference techniques using gene expression measurements. In these challenges the overall top performer was the GENIE3 algorithm. This method decomposes the network inference task into separate regression problems for each gene in the network in which the expression values of a particular target gene are predicted using all other genes as possible predictors. Next, using tree-based ensemble methods, an importance measure for each predictor gene is calculated with respect to the target gene and a high feature importance is considered as putative evidence of a regulatory link existing between both genes. The contribution of this work is twofold. First, we generalize the regression decomposition strategy of GENIE3 to other feature importance methods. We compare the performance of support vector regression, the elastic net, random forest regression, symbolic regression and their ensemble variants in this setting to the original GENIE3 algorithm. To create the ensemble variants, we propose a subsampling approach which allows us to cast any feature selection algorithm that produces a feature ranking into an ensemble feature importance algorithm. We demonstrate that the ensemble setting is key to the network inference task, as only ensemble variants achieve top performance. As second contribution, we explore the effect of using rankwise averaged predictions of multiple ensemble algorithms as opposed to only one. We name this approach NIMEFI (Network Inference using Multiple Ensemble Feature Importance algorithms) and show that this approach outperforms all individual methods in general, although on a specific network a single method can perform better. An implementation of NIMEFI has been made publicly available. PMID:24667482

  12. Multi-chain Markov chain Monte Carlo methods for computationally expensive models

    NASA Astrophysics Data System (ADS)

    Huang, M.; Ray, J.; Ren, H.; Hou, Z.; Bao, J.

    2017-12-01

    Markov chain Monte Carlo (MCMC) methods are used to infer model parameters from observational data. The parameters are inferred as probability densities, thus capturing estimation error due to sparsity of the data, and the shortcomings of the model. Multiple communicating chains executing the MCMC method have the potential to explore the parameter space better, and conceivably accelerate the convergence to the final distribution. We present results from tests conducted with the multi-chain method to show how the acceleration occurs i.e., for loose convergence tolerances, the multiple chains do not make much of a difference. The ensemble of chains also seems to have the ability to accelerate the convergence of a few chains that might start from suboptimal starting points. Finally, we show the performance of the chains in the estimation of O(10) parameters using computationally expensive forward models such as the Community Land Model, where the sampling burden is distributed over multiple chains.

  13. Benchmarking Relatedness Inference Methods with Genome-Wide Data from Thousands of Relatives.

    PubMed

    Ramstetter, Monica D; Dyer, Thomas D; Lehman, Donna M; Curran, Joanne E; Duggirala, Ravindranath; Blangero, John; Mezey, Jason G; Williams, Amy L

    2017-09-01

    Inferring relatedness from genomic data is an essential component of genetic association studies, population genetics, forensics, and genealogy. While numerous methods exist for inferring relatedness, thorough evaluation of these approaches in real data has been lacking. Here, we report an assessment of 12 state-of-the-art pairwise relatedness inference methods using a data set with 2485 individuals contained in several large pedigrees that span up to six generations. We find that all methods have high accuracy (92-99%) when detecting first- and second-degree relationships, but their accuracy dwindles to <43% for seventh-degree relationships. However, most identical by descent (IBD) segment-based methods inferred seventh-degree relatives correct to within one relatedness degree for >76% of relative pairs. Overall, the most accurate methods are Estimation of Recent Shared Ancestry (ERSA) and approaches that compute total IBD sharing using the output from GERMLINE and Refined IBD to infer relatedness. Combining information from the most accurate methods provides little accuracy improvement, indicating that novel approaches, such as new methods that leverage relatedness signals from multiple samples, are needed to achieve a sizeable jump in performance. Copyright © 2017 Ramstetter et al.

  14. Evaluation of a new inference method for estimating ammonia volatilisation from multiple agronomic plots

    NASA Astrophysics Data System (ADS)

    Loubet, Benjamin; Carozzi, Marco; Voylokov, Polina; Cohan, Jean-Pierre; Trochard, Robert; Génermont, Sophie

    2018-06-01

    Tropospheric ammonia (NH3) is a threat to the environment and human health and is mainly emitted by agriculture. Ammonia volatilisation following application of nitrogen in the field accounts for more than 40 % of the total NH3 emissions in France. This represents a major loss of nitrogen use efficiency which needs to be reduced by appropriate agricultural practices. In this study we evaluate a novel method to infer NH3 volatilisation from small agronomic plots consisting of multiple treatments with repetition. The method is based on the combination of a set of NH3 diffusion sensors exposed for durations of 3 h to 1 week and a short-range atmospheric dispersion model, used to retrieve the emissions from each plot. The method is evaluated by mimicking NH3 emissions from an ensemble of nine plots with a resistance analogue-compensation point-surface exchange scheme over a yearly meteorological database separated into 28-day periods. A multifactorial simulation scheme is used to test the effects of sensor numbers and heights, plot dimensions, source strengths, and background concentrations on the quality of the inference method. We further demonstrate by theoretical considerations in the case of an isolated plot that inferring emissions with diffusion sensors integrating over daily periods will always lead to underestimations due to correlations between emissions and atmospheric transfer. We evaluated these underestimations as -8 % ± 6 % of the emissions for a typical western European climate. For multiple plots, we find that this method would lead to median underestimations of -16 % with an interquartile [-8-22 %] for two treatments differing by a factor of up to 20 and a control treatment with no emissions. We further evaluate the methodology for varying background concentrations and NH3 emissions patterns and demonstrate the low sensitivity of the method to these factors. The method was also tested in a real case and proved to provide sound evaluations of NH3 losses from surface applied and incorporated slurry. We hence showed that this novel method should be robust and suitable for estimating NH3 emissions from agronomic plots. We believe that the method could be further improved by using Bayesian inference and inferring surface concentrations rather than surface fluxes. Validating against controlled source is also a remaining challenge.

  15. Time-dependent summary receiver operating characteristics for meta-analysis of prognostic studies.

    PubMed

    Hattori, Satoshi; Zhou, Xiao-Hua

    2016-11-20

    Prognostic studies are widely conducted to examine whether biomarkers are associated with patient's prognoses and play important roles in medical decisions. Because findings from one prognostic study may be very limited, meta-analyses may be useful to obtain sound evidence. However, prognostic studies are often analyzed by relying on a study-specific cut-off value, which can lead to difficulty in applying the standard meta-analysis techniques. In this paper, we propose two methods to estimate a time-dependent version of the summary receiver operating characteristics curve for meta-analyses of prognostic studies with a right-censored time-to-event outcome. We introduce a bivariate normal model for the pair of time-dependent sensitivity and specificity and propose a method to form inferences based on summary statistics reported in published papers. This method provides a valid inference asymptotically. In addition, we consider a bivariate binomial model. To draw inferences from this bivariate binomial model, we introduce a multiple imputation method. The multiple imputation is found to be approximately proper multiple imputation, and thus the standard Rubin's variance formula is justified from a Bayesian view point. Our simulation study and application to a real dataset revealed that both methods work well with a moderate or large number of studies and the bivariate binomial model coupled with the multiple imputation outperforms the bivariate normal model with a small number of studies. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.

  16. Unraveling multiple changes in complex climate time series using Bayesian inference

    NASA Astrophysics Data System (ADS)

    Berner, Nadine; Trauth, Martin H.; Holschneider, Matthias

    2016-04-01

    Change points in time series are perceived as heterogeneities in the statistical or dynamical characteristics of observations. Unraveling such transitions yields essential information for the understanding of the observed system. The precise detection and basic characterization of underlying changes is therefore of particular importance in environmental sciences. We present a kernel-based Bayesian inference approach to investigate direct as well as indirect climate observations for multiple generic transition events. In order to develop a diagnostic approach designed to capture a variety of natural processes, the basic statistical features of central tendency and dispersion are used to locally approximate a complex time series by a generic transition model. A Bayesian inversion approach is developed to robustly infer on the location and the generic patterns of such a transition. To systematically investigate time series for multiple changes occurring at different temporal scales, the Bayesian inversion is extended to a kernel-based inference approach. By introducing basic kernel measures, the kernel inference results are composed into a proxy probability to a posterior distribution of multiple transitions. Thus, based on a generic transition model a probability expression is derived that is capable to indicate multiple changes within a complex time series. We discuss the method's performance by investigating direct and indirect climate observations. The approach is applied to environmental time series (about 100 a), from the weather station in Tuscaloosa, Alabama, and confirms documented instrumentation changes. Moreover, the approach is used to investigate a set of complex terrigenous dust records from the ODP sites 659, 721/722 and 967 interpreted as climate indicators of the African region of the Plio-Pleistocene period (about 5 Ma). The detailed inference unravels multiple transitions underlying the indirect climate observations coinciding with established global climate events.

  17. Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data.

    PubMed

    Bhaskar, Anand; Wang, Y X Rachel; Song, Yun S

    2015-02-01

    With the recent increase in study sample sizes in human genetics, there has been growing interest in inferring historical population demography from genomic variation data. Here, we present an efficient inference method that can scale up to very large samples, with tens or hundreds of thousands of individuals. Specifically, by utilizing analytic results on the expected frequency spectrum under the coalescent and by leveraging the technique of automatic differentiation, which allows us to compute gradients exactly, we develop a very efficient algorithm to infer piecewise-exponential models of the historical effective population size from the distribution of sample allele frequencies. Our method is orders of magnitude faster than previous demographic inference methods based on the frequency spectrum. In addition to inferring demography, our method can also accurately estimate locus-specific mutation rates. We perform extensive validation of our method on simulated data and show that it can accurately infer multiple recent epochs of rapid exponential growth, a signal that is difficult to pick up with small sample sizes. Lastly, we use our method to analyze data from recent sequencing studies, including a large-sample exome-sequencing data set of tens of thousands of individuals assayed at a few hundred genic regions. © 2015 Bhaskar et al.; Published by Cold Spring Harbor Laboratory Press.

  18. Inferring human population size and separation history from multiple genome sequences.

    PubMed

    Schiffels, Stephan; Durbin, Richard

    2014-08-01

    The availability of complete human genome sequences from populations across the world has given rise to new population genetic inference methods that explicitly model ancestral relationships under recombination and mutation. So far, application of these methods to evolutionary history more recent than 20,000-30,000 years ago and to population separations has been limited. Here we present a new method that overcomes these shortcomings. The multiple sequentially Markovian coalescent (MSMC) analyzes the observed pattern of mutations in multiple individuals, focusing on the first coalescence between any two individuals. Results from applying MSMC to genome sequences from nine populations across the world suggest that the genetic separation of non-African ancestors from African Yoruban ancestors started long before 50,000 years ago and give information about human population history as recent as 2,000 years ago, including the bottleneck in the peopling of the Americas and separations within Africa, East Asia and Europe.

  19. A prior-based integrative framework for functional transcriptional regulatory network inference

    PubMed Central

    Siahpirani, Alireza F.

    2017-01-01

    Abstract Transcriptional regulatory networks specify regulatory proteins controlling the context-specific expression levels of genes. Inference of genome-wide regulatory networks is central to understanding gene regulation, but remains an open challenge. Expression-based network inference is among the most popular methods to infer regulatory networks, however, networks inferred from such methods have low overlap with experimentally derived (e.g. ChIP-chip and transcription factor (TF) knockouts) networks. Currently we have a limited understanding of this discrepancy. To address this gap, we first develop a regulatory network inference algorithm, based on probabilistic graphical models, to integrate expression with auxiliary datasets supporting a regulatory edge. Second, we comprehensively analyze our and other state-of-the-art methods on different expression perturbation datasets. Networks inferred by integrating sequence-specific motifs with expression have substantially greater agreement with experimentally derived networks, while remaining more predictive of expression than motif-based networks. Our analysis suggests natural genetic variation as the most informative perturbation for network inference, and, identifies core TFs whose targets are predictable from expression. Multiple reasons make the identification of targets of other TFs difficult, including network architecture and insufficient variation of TF mRNA level. Finally, we demonstrate the utility of our inference algorithm to infer stress-specific regulatory networks and for regulator prioritization. PMID:27794550

  20. Alternative Multiple Imputation Inference for Mean and Covariance Structure Modeling

    ERIC Educational Resources Information Center

    Lee, Taehun; Cai, Li

    2012-01-01

    Model-based multiple imputation has become an indispensable method in the educational and behavioral sciences. Mean and covariance structure models are often fitted to multiply imputed data sets. However, the presence of multiple random imputations complicates model fit testing, which is an important aspect of mean and covariance structure…

  1. In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.

    PubMed

    Audain, Enrique; Uszkoreit, Julian; Sachsenberg, Timo; Pfeuffer, Julianus; Liang, Xiao; Hermjakob, Henning; Sanchez, Aniel; Eisenacher, Martin; Reinert, Knut; Tabb, David L; Kohlbacher, Oliver; Perez-Riverol, Yasset

    2017-01-06

    In mass spectrometry-based shotgun proteomics, protein identifications are usually the desired result. However, most of the analytical methods are based on the identification of reliable peptides and not the direct identification of intact proteins. Thus, assembling peptides identified from tandem mass spectra into a list of proteins, referred to as protein inference, is a critical step in proteomics research. Currently, different protein inference algorithms and tools are available for the proteomics community. Here, we evaluated five software tools for protein inference (PIA, ProteinProphet, Fido, ProteinLP, MSBayesPro) using three popular database search engines: Mascot, X!Tandem, and MS-GF+. All the algorithms were evaluated using a highly customizable KNIME workflow using four different public datasets with varying complexities (different sample preparation, species and analytical instruments). We defined a set of quality control metrics to evaluate the performance of each combination of search engines, protein inference algorithm, and parameters on each dataset. We show that the results for complex samples vary not only regarding the actual numbers of reported protein groups but also concerning the actual composition of groups. Furthermore, the robustness of reported proteins when using databases of differing complexities is strongly dependant on the applied inference algorithm. Finally, merging the identifications of multiple search engines does not necessarily increase the number of reported proteins, but does increase the number of peptides per protein and thus can generally be recommended. Protein inference is one of the major challenges in MS-based proteomics nowadays. Currently, there are a vast number of protein inference algorithms and implementations available for the proteomics community. Protein assembly impacts in the final results of the research, the quantitation values and the final claims in the research manuscript. Even though protein inference is a crucial step in proteomics data analysis, a comprehensive evaluation of the many different inference methods has never been performed. Previously Journal of proteomics has published multiple studies about other benchmark of bioinformatics algorithms (PMID: 26585461; PMID: 22728601) in proteomics studies making clear the importance of those studies for the proteomics community and the journal audience. This manuscript presents a new bioinformatics solution based on the KNIME/OpenMS platform that aims at providing a fair comparison of protein inference algorithms (https://github.com/KNIME-OMICS). Six different algorithms - ProteinProphet, MSBayesPro, ProteinLP, Fido and PIA- were evaluated using the highly customizable workflow on four public datasets with varying complexities. Five popular database search engines Mascot, X!Tandem, MS-GF+ and combinations thereof were evaluated for every protein inference tool. In total >186 proteins lists were analyzed and carefully compare using three metrics for quality assessments of the protein inference results: 1) the numbers of reported proteins, 2) peptides per protein, and the 3) number of uniquely reported proteins per inference method, to address the quality of each inference method. We also examined how many proteins were reported by choosing each combination of search engines, protein inference algorithms and parameters on each dataset. The results show that using 1) PIA or Fido seems to be a good choice when studying the results of the analyzed workflow, regarding not only the reported proteins and the high-quality identifications, but also the required runtime. 2) Merging the identifications of multiple search engines gives almost always more confident results and increases the number of peptides per protein group. 3) The usage of databases containing not only the canonical, but also known isoforms of proteins has a small impact on the number of reported proteins. The detection of specific isoforms could, concerning the question behind the study, compensate for slightly shorter reports using the parsimonious reports. 4) The current workflow can be easily extended to support new algorithms and search engine combinations. Copyright © 2016. Published by Elsevier B.V.

  2. Integration of multi-omics data for integrative gene regulatory network inference.

    PubMed

    Zarayeneh, Neda; Ko, Euiseong; Oh, Jung Hun; Suh, Sang; Liu, Chunyu; Gao, Jean; Kim, Donghyun; Kang, Mingon

    2017-01-01

    Gene regulatory networks provide comprehensive insights and indepth understanding of complex biological processes. The molecular interactions of gene regulatory networks are inferred from a single type of genomic data, e.g., gene expression data in most research. However, gene expression is a product of sequential interactions of multiple biological processes, such as DNA sequence variations, copy number variations, histone modifications, transcription factors, and DNA methylations. The recent rapid advances of high-throughput omics technologies enable one to measure multiple types of omics data, called 'multi-omics data', that represent the various biological processes. In this paper, we propose an Integrative Gene Regulatory Network inference method (iGRN) that incorporates multi-omics data and their interactions in gene regulatory networks. In addition to gene expressions, copy number variations and DNA methylations were considered for multi-omics data in this paper. The intensive experiments were carried out with simulation data, where iGRN's capability that infers the integrative gene regulatory network is assessed. Through the experiments, iGRN shows its better performance on model representation and interpretation than other integrative methods in gene regulatory network inference. iGRN was also applied to a human brain dataset of psychiatric disorders, and the biological network of psychiatric disorders was analysed.

  3. Integration of multi-omics data for integrative gene regulatory network inference

    PubMed Central

    Zarayeneh, Neda; Ko, Euiseong; Oh, Jung Hun; Suh, Sang; Liu, Chunyu; Gao, Jean; Kim, Donghyun

    2017-01-01

    Gene regulatory networks provide comprehensive insights and indepth understanding of complex biological processes. The molecular interactions of gene regulatory networks are inferred from a single type of genomic data, e.g., gene expression data in most research. However, gene expression is a product of sequential interactions of multiple biological processes, such as DNA sequence variations, copy number variations, histone modifications, transcription factors, and DNA methylations. The recent rapid advances of high-throughput omics technologies enable one to measure multiple types of omics data, called ‘multi-omics data’, that represent the various biological processes. In this paper, we propose an Integrative Gene Regulatory Network inference method (iGRN) that incorporates multi-omics data and their interactions in gene regulatory networks. In addition to gene expressions, copy number variations and DNA methylations were considered for multi-omics data in this paper. The intensive experiments were carried out with simulation data, where iGRN’s capability that infers the integrative gene regulatory network is assessed. Through the experiments, iGRN shows its better performance on model representation and interpretation than other integrative methods in gene regulatory network inference. iGRN was also applied to a human brain dataset of psychiatric disorders, and the biological network of psychiatric disorders was analysed. PMID:29354189

  4. A new method for inferring carbon monoxide concentrations from gas filter radiometer data

    NASA Technical Reports Server (NTRS)

    Wallio, H. A.; Reichle, H. G., Jr.; Casas, J. C.; Gormsen, B. B.

    1981-01-01

    A method for inferring carbon monoxide concentrations from gas filter radiometer data is presented. The technique can closely approximate the results of more costly line-by-line radiative transfer calculations over a wide range of altitudes, ground temperatures, and carbon monoxide concentrations. The technique can also be used over a larger range of conditions than those used for the regression analysis. Because the influence of the carbon monoxide mixing ratio requires only addition, multiplication and a minimum of logic, the method can be implemented on very small computers or microprocessors.

  5. Assimilating Flow Data into Complex Multiple-Point Statistical Facies Models Using Pilot Points Method

    NASA Astrophysics Data System (ADS)

    Ma, W.; Jafarpour, B.

    2017-12-01

    We develop a new pilot points method for conditioning discrete multiple-point statistical (MPS) facies simulation on dynamic flow data. While conditioning MPS simulation on static hard data is straightforward, their calibration against nonlinear flow data is nontrivial. The proposed method generates conditional models from a conceptual model of geologic connectivity, known as a training image (TI), by strategically placing and estimating pilot points. To place pilot points, a score map is generated based on three sources of information:: (i) the uncertainty in facies distribution, (ii) the model response sensitivity information, and (iii) the observed flow data. Once the pilot points are placed, the facies values at these points are inferred from production data and are used, along with available hard data at well locations, to simulate a new set of conditional facies realizations. While facies estimation at the pilot points can be performed using different inversion algorithms, in this study the ensemble smoother (ES) and its multiple data assimilation variant (ES-MDA) are adopted to update permeability maps from production data, which are then used to statistically infer facies types at the pilot point locations. The developed method combines the information in the flow data and the TI by using the former to infer facies values at select locations away from the wells and the latter to ensure consistent facies structure and connectivity where away from measurement locations. Several numerical experiments are used to evaluate the performance of the developed method and to discuss its important properties.

  6. Efficient computation of the joint sample frequency spectra for multiple populations.

    PubMed

    Kamm, John A; Terhorst, Jonathan; Song, Yun S

    2017-01-01

    A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences and provides a highly efficient dimensional reduction of large-scale population genomic variation data. Recently, there has been much interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. SFS-based inference methods require accurate computation of the expected SFS under a given demographic model. Although much methodological progress has been made, existing methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this paper, we present new analytic formulas and algorithms that enable accurate, efficient computation of the expected joint SFS for thousands of individuals sampled from hundreds of populations related by a complex demographic model with arbitrary population size histories (including piecewise-exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study we demonstrate our improvements to numerical stability and computational complexity.

  7. Efficient computation of the joint sample frequency spectra for multiple populations

    PubMed Central

    Kamm, John A.; Terhorst, Jonathan; Song, Yun S.

    2016-01-01

    A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences and provides a highly efficient dimensional reduction of large-scale population genomic variation data. Recently, there has been much interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. SFS-based inference methods require accurate computation of the expected SFS under a given demographic model. Although much methodological progress has been made, existing methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this paper, we present new analytic formulas and algorithms that enable accurate, efficient computation of the expected joint SFS for thousands of individuals sampled from hundreds of populations related by a complex demographic model with arbitrary population size histories (including piecewise-exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study we demonstrate our improvements to numerical stability and computational complexity. PMID:28239248

  8. Solving the problem of comparing whole bacterial genomes across different sequencing platforms.

    PubMed

    Kaas, Rolf S; Leekitcharoenphon, Pimlapas; Aarestrup, Frank M; Lund, Ole

    2014-01-01

    Whole genome sequencing (WGS) shows great potential for real-time monitoring and identification of infectious disease outbreaks. However, rapid and reliable comparison of data generated in multiple laboratories and using multiple technologies is essential. So far studies have focused on using one technology because each technology has a systematic bias making integration of data generated from different platforms difficult. We developed two different procedures for identifying variable sites and inferring phylogenies in WGS data across multiple platforms. The methods were evaluated on three bacterial data sets and sequenced on three different platforms (Illumina, 454, Ion Torrent). We show that the methods are able to overcome the systematic biases caused by the sequencers and infer the expected phylogenies. It is concluded that the cause of the success of these new procedures is due to a validation of all informative sites that are included in the analysis. The procedures are available as web tools.

  9. Minimal-assumption inference from population-genomic data

    NASA Astrophysics Data System (ADS)

    Weissman, Daniel; Hallatschek, Oskar

    Samples of multiple complete genome sequences contain vast amounts of information about the evolutionary history of populations, much of it in the associations among polymorphisms at different loci. Current methods that take advantage of this linkage information rely on models of recombination and coalescence, limiting the sample sizes and populations that they can analyze. We introduce a method, Minimal-Assumption Genomic Inference of Coalescence (MAGIC), that reconstructs key features of the evolutionary history, including the distribution of coalescence times, by integrating information across genomic length scales without using an explicit model of recombination, demography or selection. Using simulated data, we show that MAGIC's performance is comparable to PSMC' on single diploid samples generated with standard coalescent and recombination models. More importantly, MAGIC can also analyze arbitrarily large samples and is robust to changes in the coalescent and recombination processes. Using MAGIC, we show that the inferred coalescence time histories of samples of multiple human genomes exhibit inconsistencies with a description in terms of an effective population size based on single-genome data.

  10. Kernel canonical-correlation Granger causality for multiple time series

    NASA Astrophysics Data System (ADS)

    Wu, Guorong; Duan, Xujun; Liao, Wei; Gao, Qing; Chen, Huafu

    2011-04-01

    Canonical-correlation analysis as a multivariate statistical technique has been applied to multivariate Granger causality analysis to infer information flow in complex systems. It shows unique appeal and great superiority over the traditional vector autoregressive method, due to the simplified procedure that detects causal interaction between multiple time series, and the avoidance of potential model estimation problems. However, it is limited to the linear case. Here, we extend the framework of canonical correlation to include the estimation of multivariate nonlinear Granger causality for drawing inference about directed interaction. Its feasibility and effectiveness are verified on simulated data.

  11. Measuring the distance between multiple sequence alignments.

    PubMed

    Blackburne, Benjamin P; Whelan, Simon

    2012-02-15

    Multiple sequence alignment (MSA) is a core method in bioinformatics. The accuracy of such alignments may influence the success of downstream analyses such as phylogenetic inference, protein structure prediction, and functional prediction. The importance of MSA has lead to the proliferation of MSA methods, with different objective functions and heuristics to search for the optimal MSA. Different methods of inferring MSAs produce different results in all but the most trivial cases. By measuring the differences between inferred alignments, we may be able to develop an understanding of how these differences (i) relate to the objective functions and heuristics used in MSA methods, and (ii) affect downstream analyses. We introduce four metrics to compare MSAs, which include the position in a sequence where a gap occurs or the location on a phylogenetic tree where an insertion or deletion (indel) event occurs. We use both real and synthetic data to explore the information given by these metrics and demonstrate how the different metrics in combination can yield more information about MSA methods and the differences between them. MetAl is a free software implementation of these metrics in Haskell. Source and binaries for Windows, Linux and Mac OS X are available from http://kumiho.smith.man.ac.uk/whelan/software/metal/.

  12. DendroBLAST: approximate phylogenetic trees in the absence of multiple sequence alignments.

    PubMed

    Kelly, Steven; Maini, Philip K

    2013-01-01

    The rapidly growing availability of genome information has created considerable demand for both fast and accurate phylogenetic inference algorithms. We present a novel method called DendroBLAST for reconstructing phylogenetic dendrograms/trees from protein sequences using BLAST. This method differs from other methods by incorporating a simple model of sequence evolution to test the effect of introducing sequence changes on the reliability of the bipartitions in the inferred tree. Using realistic simulated sequence data we demonstrate that this method produces phylogenetic trees that are more accurate than other commonly-used distance based methods though not as accurate as maximum likelihood methods from good quality multiple sequence alignments. In addition to tests on simulated data, we use DendroBLAST to generate input trees for a supertree reconstruction of the phylogeny of the Archaea. This independent analysis produces an approximate phylogeny of the Archaea that has both high precision and recall when compared to previously published analysis of the same dataset using conventional methods. Taken together these results demonstrate that approximate phylogenetic trees can be produced in the absence of multiple sequence alignments, and we propose that these trees will provide a platform for improving and informing downstream bioinformatic analysis. A web implementation of the DendroBLAST method is freely available for use at http://www.dendroblast.com/.

  13. Efficient methods for joint estimation of multiple fundamental frequencies in music signals

    NASA Astrophysics Data System (ADS)

    Pertusa, Antonio; Iñesta, José M.

    2012-12-01

    This study presents efficient techniques for multiple fundamental frequency estimation in music signals. The proposed methodology can infer harmonic patterns from a mixture considering interactions with other sources and evaluate them in a joint estimation scheme. For this purpose, a set of fundamental frequency candidates are first selected at each frame, and several hypothetical combinations of them are generated. Combinations are independently evaluated, and the most likely is selected taking into account the intensity and spectral smoothness of its inferred patterns. The method is extended considering adjacent frames in order to smooth the detection in time, and a pitch tracking stage is finally performed to increase the temporal coherence. The proposed algorithms were evaluated in MIREX contests yielding state of the art results with a very low computational burden.

  14. RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination

    PubMed Central

    Mirzaei, Sajad; Wu, Yufeng

    2017-01-01

    Abstract Motivation: Haplotypes from one or multiple related populations share a common genealogical history. If this shared genealogy can be inferred from haplotypes, it can be very useful for many population genetics problems. However, with the presence of recombination, the genealogical history of haplotypes is complex and cannot be represented by a single genealogical tree. Therefore, inference of genealogical history with recombination is much more challenging than the case of no recombination. Results: In this paper, we present a new approach called RENT+ for the inference of local genealogical trees from haplotypes with the presence of recombination. RENT+ builds on a previous genealogy inference approach called RENT, which infers a set of related genealogical trees at different genomic positions. RENT+ represents a significant improvement over RENT in the sense that it is more effective in extracting information contained in the haplotype data about the underlying genealogy than RENT. The key components of RENT+ are several greatly enhanced genealogy inference rules. Through simulation, we show that RENT+ is more efficient and accurate than several existing genealogy inference methods. As an application, we apply RENT+ in the inference of population demographic history from haplotypes, which outperforms several existing methods. Availability and Implementation: RENT+ is implemented in Java, and is freely available for download from: https://github.com/SajadMirzaei/RentPlus. Contacts: sajad@engr.uconn.edu or ywu@engr.uconn.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:28065901

  15. Inference regarding multiple structural changes in linear models with endogenous regressors☆

    PubMed Central

    Hall, Alastair R.; Han, Sanggohn; Boldea, Otilia

    2012-01-01

    This paper considers the linear model with endogenous regressors and multiple changes in the parameters at unknown times. It is shown that minimization of a Generalized Method of Moments criterion yields inconsistent estimators of the break fractions, but minimization of the Two Stage Least Squares (2SLS) criterion yields consistent estimators of these parameters. We develop a methodology for estimation and inference of the parameters of the model based on 2SLS. The analysis covers the cases where the reduced form is either stable or unstable. The methodology is illustrated via an application to the New Keynesian Phillips Curve for the US. PMID:23805021

  16. Quantitative Proteomics via High Resolution MS Quantification: Capabilities and Limitations

    PubMed Central

    Higgs, Richard E.; Butler, Jon P.; Han, Bomie; Knierman, Michael D.

    2013-01-01

    Recent improvements in the mass accuracy and resolution of mass spectrometers have led to renewed interest in label-free quantification using data from the primary mass spectrum (MS1) acquired from data-dependent proteomics experiments. The capacity for higher specificity quantification of peptides from samples enriched for proteins of biological interest offers distinct advantages for hypothesis generating experiments relative to immunoassay detection methods or prespecified peptide ions measured by multiple reaction monitoring (MRM) approaches. Here we describe an evaluation of different methods to post-process peptide level quantification information to support protein level inference. We characterize the methods by examining their ability to recover a known dilution of a standard protein in background matrices of varying complexity. Additionally, the MS1 quantification results are compared to a standard, targeted, MRM approach on the same samples under equivalent instrument conditions. We show the existence of multiple peptides with MS1 quantification sensitivity similar to the best MRM peptides for each of the background matrices studied. Based on these results we provide recommendations on preferred approaches to leveraging quantitative measurements of multiple peptides to improve protein level inference. PMID:23710359

  17. Reinforce: An Ensemble Approach for Inferring PPI Network from AP-MS Data.

    PubMed

    Tian, Bo; Duan, Qiong; Zhao, Can; Teng, Ben; He, Zengyou

    2017-05-17

    Affinity Purification-Mass Spectrometry (AP-MS) is one of the most important technologies for constructing protein-protein interaction (PPI) networks. In this paper, we propose an ensemble method, Reinforce, for inferring PPI network from AP-MS data set. The new algorithm named Reinforce is based on rank aggregation and false discovery rate control. Under the null hypothesis that the interaction scores from different scoring methods are randomly generated, Reinforce follows three steps to integrate multiple ranking results from different algorithms or different data sets. The experimental results show that Reinforce can get more stable and accurate inference results than existing algorithms. The source codes of Reinforce and data sets used in the experiments are available at: https://sourceforge.net/projects/reinforce/.

  18. On using summary statistics from an external calibration sample to correct for covariate measurement error.

    PubMed

    Guo, Ying; Little, Roderick J; McConnell, Daniel S

    2012-01-01

    Covariate measurement error is common in epidemiologic studies. Current methods for correcting measurement error with information from external calibration samples are insufficient to provide valid adjusted inferences. We consider the problem of estimating the regression of an outcome Y on covariates X and Z, where Y and Z are observed, X is unobserved, but a variable W that measures X with error is observed. Information about measurement error is provided in an external calibration sample where data on X and W (but not Y and Z) are recorded. We describe a method that uses summary statistics from the calibration sample to create multiple imputations of the missing values of X in the regression sample, so that the regression coefficients of Y on X and Z and associated standard errors can be estimated using simple multiple imputation combining rules, yielding valid statistical inferences under the assumption of a multivariate normal distribution. The proposed method is shown by simulation to provide better inferences than existing methods, namely the naive method, classical calibration, and regression calibration, particularly for correction for bias and achieving nominal confidence levels. We also illustrate our method with an example using linear regression to examine the relation between serum reproductive hormone concentrations and bone mineral density loss in midlife women in the Michigan Bone Health and Metabolism Study. Existing methods fail to adjust appropriately for bias due to measurement error in the regression setting, particularly when measurement error is substantial. The proposed method corrects this deficiency.

  19. Is multiple-sequence alignment required for accurate inference of phylogeny?

    PubMed

    Höhl, Michael; Ragan, Mark A

    2007-04-01

    The process of inferring phylogenetic trees from molecular sequences almost always starts with a multiple alignment of these sequences but can also be based on methods that do not involve multiple sequence alignment. Very little is known about the accuracy with which such alignment-free methods recover the correct phylogeny or about the potential for increasing their accuracy. We conducted a large-scale comparison of ten alignment-free methods, among them one new approach that does not calculate distances and a faster variant of our pattern-based approach; all distance-based alignment-free methods are freely available from http://www.bioinformatics.org.au (as Python package decaf+py). We show that most methods exhibit a higher overall reconstruction accuracy in the presence of high among-site rate variation. Under all conditions that we considered, variants of the pattern-based approach were significantly better than the other alignment-free methods. The new pattern-based variant achieved a speed-up of an order of magnitude in the distance calculation step, accompanied by a small loss of tree reconstruction accuracy. A method of Bayesian inference from k-mers did not improve on classical alignment-free (and distance-based) methods but may still offer other advantages due to its Bayesian nature. We found the optimal word length k of word-based methods to be stable across various data sets, and we provide parameter ranges for two different alphabets. The influence of these alphabets was analyzed to reveal a trade-off in reconstruction accuracy between long and short branches. We have mapped the phylogenetic accuracy for many alignment-free methods, among them several recently introduced ones, and increased our understanding of their behavior in response to biologically important parameters. In all experiments, the pattern-based approach emerged as superior, at the expense of higher resource consumption. Nonetheless, no alignment-free method that we examined recovers the correct phylogeny as accurately as does an approach based on maximum-likelihood distance estimates of multiply aligned sequences.

  20. Multiple imputation methods for nonparametric inference on cumulative incidence with missing cause of failure

    PubMed Central

    Lee, Minjung; Dignam, James J.; Han, Junhee

    2014-01-01

    We propose a nonparametric approach for cumulative incidence estimation when causes of failure are unknown or missing for some subjects. Under the missing at random assumption, we estimate the cumulative incidence function using multiple imputation methods. We develop asymptotic theory for the cumulative incidence estimators obtained from multiple imputation methods. We also discuss how to construct confidence intervals for the cumulative incidence function and perform a test for comparing the cumulative incidence functions in two samples with missing cause of failure. Through simulation studies, we show that the proposed methods perform well. The methods are illustrated with data from a randomized clinical trial in early stage breast cancer. PMID:25043107

  1. A mixed methods study of multiple health behaviors among individuals with stroke.

    PubMed

    Plow, Matthew; Moore, Shirley M; Sajatovic, Martha; Katzan, Irene

    2017-01-01

    Individuals with stroke often have multiple cardiovascular risk factors that necessitate promoting engagement in multiple health behaviors. However, observational studies of individuals with stroke have typically focused on promoting a single health behavior. Thus, there is a poor understanding of linkages between healthy behaviors and the circumstances in which factors, such as stroke impairments, may influence a single or multiple health behaviors. We conducted a mixed methods convergent parallel study of 25 individuals with stroke to examine the relationships between stroke impairments and physical activity, sleep, and nutrition. Our goal was to gain further insight into possible strategies to promote multiple health behaviors among individuals with stroke. This study focused on physical activity, sleep, and nutrition because of their importance in achieving energy balance, maintaining a healthy weight, and reducing cardiovascular risks. Qualitative and quantitative data were collected concurrently, with the former being prioritized over the latter. Qualitative data was prioritized in order to develop a conceptual model of engagement in multiple health behaviors among individuals with stroke. Qualitative and quantitative data were analyzed independently and then were integrated during the inference stage to develop meta-inferences. The 25 individuals with stroke completed closed-ended questionnaires on healthy behaviors and physical function. They also participated in face-to-face focus groups and one-to-one phone interviews. We found statistically significant and moderate correlations between hand function and healthy eating habits ( r  = 0.45), sleep disturbances and limitations in activities of daily living ( r  =  - 0.55), BMI and limitations in activities of daily living ( r  =  - 0.49), physical activity and limitations in activities of daily living ( r  = 0.41), mobility impairments and BMI ( r  =  - 0.41), sleep disturbances and physical activity ( r  =  - 0.48), sleep disturbances and BMI ( r  = 0.48), and physical activity and BMI ( r  =  - 0.45). We identified five qualitative themes: (1) Impairments: reduced autonomy, (2) Environmental forces: caregivers and information, (3) Re-evaluation: priorities and attributions, (4) Resiliency: finding motivation and solutions, and (5) Negative affectivity: stress and self-consciousness. Three meta-inferences and a conceptual model described circumstances in which factors could influence single or multiple health behaviors. This is the first mixed methods study of individuals with stroke to elaborate on relationships between multiple health behaviors, BMI, and physical function. A conceptual model illustrates addressing sleep disturbances, activity limitations, self-image, and emotions to promote multiple health behaviors. We discuss the relevance of the meta-inferences in designing multiple behavior change interventions for individuals with stroke.

  2. A simulation-based evaluation of methods for inferring linear barriers to gene flow

    Treesearch

    Christopher Blair; Dana E. Weigel; Matthew Balazik; Annika T. H. Keeley; Faith M. Walker; Erin Landguth; Sam Cushman; Melanie Murphy; Lisette Waits; Niko Balkenhol

    2012-01-01

    Different analytical techniques used on the same data set may lead to different conclusions about the existence and strength of genetic structure. Therefore, reliable interpretation of the results from different methods depends on the efficacy and reliability of different statistical methods. In this paper, we evaluated the performance of multiple analytical methods to...

  3. From cacti to carnivores: Improved phylotranscriptomic sampling and hierarchical homology inference provide further insight into the evolution of Caryophyllales.

    PubMed

    Walker, Joseph F; Yang, Ya; Feng, Tao; Timoneda, Alfonso; Mikenas, Jessica; Hutchison, Vera; Edwards, Caroline; Wang, Ning; Ahluwalia, Sonia; Olivieri, Julia; Walker-Hale, Nathanael; Majure, Lucas C; Puente, Raúl; Kadereit, Gudrun; Lauterbach, Maximilian; Eggli, Urs; Flores-Olvera, Hilda; Ochoterena, Helga; Brockington, Samuel F; Moore, Michael J; Smith, Stephen A

    2018-03-01

    The Caryophyllales contain ~12,500 species and are known for their cosmopolitan distribution, convergence of trait evolution, and extreme adaptations. Some relationships within the Caryophyllales, like those of many large plant clades, remain unclear, and phylogenetic studies often recover alternative hypotheses. We explore the utility of broad and dense transcriptome sampling across the order for resolving evolutionary relationships in Caryophyllales. We generated 84 transcriptomes and combined these with 224 publicly available transcriptomes to perform a phylogenomic analysis of Caryophyllales. To overcome the computational challenge of ortholog detection in such a large data set, we developed an approach for clustering gene families that allowed us to analyze >300 transcriptomes and genomes. We then inferred the species relationships using multiple methods and performed gene-tree conflict analyses. Our phylogenetic analyses resolved many clades with strong support, but also showed significant gene-tree discordance. This discordance is not only a common feature of phylogenomic studies, but also represents an opportunity to understand processes that have structured phylogenies. We also found taxon sampling influences species-tree inference, highlighting the importance of more focused studies with additional taxon sampling. Transcriptomes are useful both for species-tree inference and for uncovering evolutionary complexity within lineages. Through analyses of gene-tree conflict and multiple methods of species-tree inference, we demonstrate that phylogenomic data can provide unparalleled insight into the evolutionary history of Caryophyllales. We also discuss a method for overcoming computational challenges associated with homolog clustering in large data sets. © 2018 The Authors. American Journal of Botany is published by Wiley Periodicals, Inc. on behalf of the Botanical Society of America.

  4. RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination.

    PubMed

    Mirzaei, Sajad; Wu, Yufeng

    2017-04-01

    : Haplotypes from one or multiple related populations share a common genealogical history. If this shared genealogy can be inferred from haplotypes, it can be very useful for many population genetics problems. However, with the presence of recombination, the genealogical history of haplotypes is complex and cannot be represented by a single genealogical tree. Therefore, inference of genealogical history with recombination is much more challenging than the case of no recombination. : In this paper, we present a new approach called RENT+  for the inference of local genealogical trees from haplotypes with the presence of recombination. RENT+  builds on a previous genealogy inference approach called RENT , which infers a set of related genealogical trees at different genomic positions. RENT+  represents a significant improvement over RENT in the sense that it is more effective in extracting information contained in the haplotype data about the underlying genealogy than RENT . The key components of RENT+  are several greatly enhanced genealogy inference rules. Through simulation, we show that RENT+  is more efficient and accurate than several existing genealogy inference methods. As an application, we apply RENT+  in the inference of population demographic history from haplotypes, which outperforms several existing methods. : RENT+  is implemented in Java, and is freely available for download from: https://github.com/SajadMirzaei/RentPlus . : sajad@engr.uconn.edu or ywu@engr.uconn.edu. : Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  5. Discovering time-lagged rules from microarray data using gene profile classifiers

    PubMed Central

    2011-01-01

    Background Gene regulatory networks have an essential role in every process of life. In this regard, the amount of genome-wide time series data is becoming increasingly available, providing the opportunity to discover the time-delayed gene regulatory networks that govern the majority of these molecular processes. Results This paper aims at reconstructing gene regulatory networks from multiple genome-wide microarray time series datasets. In this sense, a new model-free algorithm called GRNCOP2 (Gene Regulatory Network inference by Combinatorial OPtimization 2), which is a significant evolution of the GRNCOP algorithm, was developed using combinatorial optimization of gene profile classifiers. The method is capable of inferring potential time-delay relationships with any span of time between genes from various time series datasets given as input. The proposed algorithm was applied to time series data composed of twenty yeast genes that are highly relevant for the cell-cycle study, and the results were compared against several related approaches. The outcomes have shown that GRNCOP2 outperforms the contrasted methods in terms of the proposed metrics, and that the results are consistent with previous biological knowledge. Additionally, a genome-wide study on multiple publicly available time series data was performed. In this case, the experimentation has exhibited the soundness and scalability of the new method which inferred highly-related statistically-significant gene associations. Conclusions A novel method for inferring time-delayed gene regulatory networks from genome-wide time series datasets is proposed in this paper. The method was carefully validated with several publicly available data sets. The results have demonstrated that the algorithm constitutes a usable model-free approach capable of predicting meaningful relationships between genes, revealing the time-trends of gene regulation. PMID:21524308

  6. Network inference using informative priors.

    PubMed

    Mukherjee, Sach; Speed, Terence P

    2008-09-23

    Recent years have seen much interest in the study of systems characterized by multiple interacting components. A class of statistical models called graphical models, in which graphs are used to represent probabilistic relationships between variables, provides a framework for formal inference regarding such systems. In many settings, the object of inference is the network structure itself. This problem of "network inference" is well known to be a challenging one. However, in scientific settings there is very often existing information regarding network connectivity. A natural idea then is to take account of such information during inference. This article addresses the question of incorporating prior information into network inference. We focus on directed models called Bayesian networks, and use Markov chain Monte Carlo to draw samples from posterior distributions over network structures. We introduce prior distributions on graphs capable of capturing information regarding network features including edges, classes of edges, degree distributions, and sparsity. We illustrate our approach in the context of systems biology, applying our methods to network inference in cancer signaling.

  7. Non-parametric combination and related permutation tests for neuroimaging.

    PubMed

    Winkler, Anderson M; Webster, Matthew A; Brooks, Jonathan C; Tracey, Irene; Smith, Stephen M; Nichols, Thomas E

    2016-04-01

    In this work, we show how permutation methods can be applied to combination analyses such as those that include multiple imaging modalities, multiple data acquisitions of the same modality, or simply multiple hypotheses on the same data. Using the well-known definition of union-intersection tests and closed testing procedures, we use synchronized permutations to correct for such multiplicity of tests, allowing flexibility to integrate imaging data with different spatial resolutions, surface and/or volume-based representations of the brain, including non-imaging data. For the problem of joint inference, we propose and evaluate a modification of the recently introduced non-parametric combination (NPC) methodology, such that instead of a two-phase algorithm and large data storage requirements, the inference can be performed in a single phase, with reasonable computational demands. The method compares favorably to classical multivariate tests (such as MANCOVA), even when the latter is assessed using permutations. We also evaluate, in the context of permutation tests, various combining methods that have been proposed in the past decades, and identify those that provide the best control over error rate and power across a range of situations. We show that one of these, the method of Tippett, provides a link between correction for the multiplicity of tests and their combination. Finally, we discuss how the correction can solve certain problems of multiple comparisons in one-way ANOVA designs, and how the combination is distinguished from conjunctions, even though both can be assessed using permutation tests. We also provide a common algorithm that accommodates combination and correction. © 2016 The Authors Human Brain Mapping Published by Wiley Periodicals, Inc.

  8. Sensitivity to imputation models and assumptions in receiver operating characteristic analysis with incomplete data

    PubMed Central

    Karakaya, Jale; Karabulut, Erdem; Yucel, Recai M.

    2015-01-01

    Modern statistical methods using incomplete data have been increasingly applied in a wide variety of substantive problems. Similarly, receiver operating characteristic (ROC) analysis, a method used in evaluating diagnostic tests or biomarkers in medical research, has also been increasingly popular problem in both its development and application. While missing-data methods have been applied in ROC analysis, the impact of model mis-specification and/or assumptions (e.g. missing at random) underlying the missing data has not been thoroughly studied. In this work, we study the performance of multiple imputation (MI) inference in ROC analysis. Particularly, we investigate parametric and non-parametric techniques for MI inference under common missingness mechanisms. Depending on the coherency of the imputation model with the underlying data generation mechanism, our results show that MI generally leads to well-calibrated inferences under ignorable missingness mechanisms. PMID:26379316

  9. Inferring Phylogenetic Networks Using PhyloNet.

    PubMed

    Wen, Dingqiao; Yu, Yun; Zhu, Jiafan; Nakhleh, Luay

    2018-07-01

    PhyloNet was released in 2008 as a software package for representing and analyzing phylogenetic networks. At the time of its release, the main functionalities in PhyloNet consisted of measures for comparing network topologies and a single heuristic for reconciling gene trees with a species tree. Since then, PhyloNet has grown significantly. The software package now includes a wide array of methods for inferring phylogenetic networks from data sets of unlinked loci while accounting for both reticulation (e.g., hybridization) and incomplete lineage sorting. In particular, PhyloNet now allows for maximum parsimony, maximum likelihood, and Bayesian inference of phylogenetic networks from gene tree estimates. Furthermore, Bayesian inference directly from sequence data (sequence alignments or biallelic markers) is implemented. Maximum parsimony is based on an extension of the "minimizing deep coalescences" criterion to phylogenetic networks, whereas maximum likelihood and Bayesian inference are based on the multispecies network coalescent. All methods allow for multiple individuals per species. As computing the likelihood of a phylogenetic network is computationally hard, PhyloNet allows for evaluation and inference of networks using a pseudolikelihood measure. PhyloNet summarizes the results of the various analyzes and generates phylogenetic networks in the extended Newick format that is readily viewable by existing visualization software.

  10. Does polyandry really pay off? The effects of multiple mating and number of fathers on morphological traits and survival in clutches of nesting green turtles at Tortuguero.

    PubMed

    Alfaro-Núñez, Alonzo; Jensen, Michael P; Abreu-Grobois, F Alberto

    2015-01-01

    Despite the long debate of whether or not multiple mating benefits the offspring, studies still show contradictory results. Multiple mating takes time and energy. Thus, if females fertilize their eggs with a single mating, why to mate more than once? We investigated and inferred paternal identity and number of sires in 12 clutches (240 hatchlings) of green turtles (Chelonia mydas) nests at Tortuguero, Costa Rica. Paternal alleles were inferred through comparison of maternal and hatchling genotypes, and indicated multiple paternity in at least 11 of the clutches (92%). The inferred average number of fathers was three (ranging from 1 to 5). Moreover, regression analyses were used to investigate for correlation of inferred clutch paternity with morphological traits of hatchlings fitness (emergence success, length, weight and crawling speed), the size of the mother, and an environmental variable (incubation temperature). We suggest and propose two different comparative approaches for evaluating morphological traits and clutch paternity, in order to infer greater offspring survival. First, clutches coded by the exact number of fathers and second by the exact paternal contribution (fathers who gives greater proportion of the offspring per nest). We found significant differences (P < 0.05) in clutches coded by the exact number of fathers for all morphological traits. A general tendency of higher values in offspring sired by two to three fathers was observed for the length and weight traits. However, emergence success and crawling speed showed different trends which unable us to reach any further conclusion. The second approach analysing the paternal contribution showed no significant difference (P > 0.05) for any of the traits. We conclude that multiple paternity does not provide any extra benefit in the morphological fitness traits or the survival of the offspring, when analysed following the proposed comparative statistical methods.

  11. In defence of model-based inference in phylogeography

    PubMed Central

    Beaumont, Mark A.; Nielsen, Rasmus; Robert, Christian; Hey, Jody; Gaggiotti, Oscar; Knowles, Lacey; Estoup, Arnaud; Panchal, Mahesh; Corander, Jukka; Hickerson, Mike; Sisson, Scott A.; Fagundes, Nelson; Chikhi, Lounès; Beerli, Peter; Vitalis, Renaud; Cornuet, Jean-Marie; Huelsenbeck, John; Foll, Matthieu; Yang, Ziheng; Rousset, Francois; Balding, David; Excoffier, Laurent

    2017-01-01

    Recent papers have promoted the view that model-based methods in general, and those based on Approximate Bayesian Computation (ABC) in particular, are flawed in a number of ways, and are therefore inappropriate for the analysis of phylogeographic data. These papers further argue that Nested Clade Phylogeographic Analysis (NCPA) offers the best approach in statistical phylogeography. In order to remove the confusion and misconceptions introduced by these papers, we justify and explain the reasoning behind model-based inference. We argue that ABC is a statistically valid approach, alongside other computational statistical techniques that have been successfully used to infer parameters and compare models in population genetics. We also examine the NCPA method and highlight numerous deficiencies, either when used with single or multiple loci. We further show that the ages of clades are carelessly used to infer ages of demographic events, that these ages are estimated under a simple model of panmixia and population stationarity but are then used under different and unspecified models to test hypotheses, a usage the invalidates these testing procedures. We conclude by encouraging researchers to study and use model-based inference in population genetics. PMID:29284924

  12. Boolean dynamics of genetic regulatory networks inferred from microarray time series data

    DOE PAGES

    Martin, Shawn; Zhang, Zhaoduo; Martino, Anthony; ...

    2007-01-31

    Methods available for the inference of genetic regulatory networks strive to produce a single network, usually by optimizing some quantity to fit the experimental observations. In this paper we investigate the possibility that multiple networks can be inferred, all resulting in similar dynamics. This idea is motivated by theoretical work which suggests that biological networks are robust and adaptable to change, and that the overall behavior of a genetic regulatory network might be captured in terms of dynamical basins of attraction. We have developed and implemented a method for inferring genetic regulatory networks for time series microarray data. Our methodmore » first clusters and discretizes the gene expression data using k-means and support vector regression. We then enumerate Boolean activation–inhibition networks to match the discretized data. In conclusion, the dynamics of the Boolean networks are examined. We have tested our method on two immunology microarray datasets: an IL-2-stimulated T cell response dataset and a LPS-stimulated macrophage response dataset. In both cases, we discovered that many networks matched the data, and that most of these networks had similar dynamics.« less

  13. Synthetic observations of protostellar multiple systems

    NASA Astrophysics Data System (ADS)

    Lomax, O.; Whitworth, A. P.

    2018-04-01

    Observations of protostars are often compared with synthetic observations of models in order to infer the underlying physical properties of the protostars. The majority of these models have a single protostar, attended by a disc and an envelope. However, observational and numerical evidence suggests that a large fraction of protostars form as multiple systems. This means that fitting models of single protostars to observations may be inappropriate. We produce synthetic observations of protostellar multiple systems undergoing realistic, non-continuous accretion. These systems consist of multiple protostars with episodic luminosities, embedded self-consistently in discs and envelopes. We model the gas dynamics of these systems using smoothed particle hydrodynamics and we generate synthetic observations by post-processing the snapshots using the SPAMCART Monte Carlo radiative transfer code. We present simulation results of three model protostellar multiple systems. For each of these, we generate 4 × 104 synthetic spectra at different points in time and from different viewing angles. We propose a Bayesian method, using similar calculations to those presented here, but in greater numbers, to infer the physical properties of protostellar multiple systems from observations.

  14. Non-monophyly and intricate morphological evolution within the avian family Cettiidae revealed by multilocus analysis of a taxonomically densely sampled dataset

    PubMed Central

    2011-01-01

    Background The avian family Cettiidae, including the genera Cettia, Urosphena, Tesia, Abroscopus and Tickellia and Orthotomus cucullatus, has recently been proposed based on analysis of a small number of loci and species. The close relationship of most of these taxa was unexpected, and called for a comprehensive study based on multiple loci and dense taxon sampling. In the present study, we infer the relationships of all except one of the species in this family using one mitochondrial and three nuclear loci. We use traditional gene tree methods (Bayesian inference, maximum likelihood bootstrapping, parsimony bootstrapping), as well as a recently developed Bayesian species tree approach (*BEAST) that accounts for lineage sorting processes that might produce discordance between gene trees. We also analyse mitochondrial DNA for a larger sample, comprising multiple individuals and a large number of subspecies of polytypic species. Results There are many topological incongruences among the single-locus trees, although none of these is strongly supported. The multi-locus tree inferred using concatenated sequences and the species tree agree well with each other, and are overall well resolved and well supported by the data. The main discrepancy between these trees concerns the most basal split. Both methods infer the genus Cettia to be highly non-monophyletic, as it is scattered across the entire family tree. Deep intraspecific divergences are revealed, and one or two species and one subspecies are inferred to be non-monophyletic (differences between methods). Conclusions The molecular phylogeny presented here is strongly inconsistent with the traditional, morphology-based classification. The remarkably high degree of non-monophyly in the genus Cettia is likely to be one of the most extraordinary examples of misconceived relationships in an avian genus. The phylogeny suggests instances of parallel evolution, as well as highly unequal rates of morphological divergence in different lineages. This complex morphological evolution apparently misled earlier taxonomists. These results underscore the well-known but still often neglected problem of basing classifications on overall morphological similarity. Based on the molecular data, a revised taxonomy is proposed. Although the traditional and species tree methods inferred much the same tree in the present study, the assumption by species tree methods that all species are monophyletic is a limitation in these methods, as some currently recognized species might have more complex histories. PMID:22142197

  15. Multiple Illuminant Colour Estimation via Statistical Inference on Factor Graphs.

    PubMed

    Mutimbu, Lawrence; Robles-Kelly, Antonio

    2016-08-31

    This paper presents a method to recover a spatially varying illuminant colour estimate from scenes lit by multiple light sources. Starting with the image formation process, we formulate the illuminant recovery problem in a statistically datadriven setting. To do this, we use a factor graph defined across the scale space of the input image. In the graph, we utilise a set of illuminant prototypes computed using a data driven approach. As a result, our method delivers a pixelwise illuminant colour estimate being devoid of libraries or user input. The use of a factor graph also allows for the illuminant estimates to be recovered making use of a maximum a posteriori (MAP) inference process. Moreover, we compute the probability marginals by performing a Delaunay triangulation on our factor graph. We illustrate the utility of our method for pixelwise illuminant colour recovery on widely available datasets and compare against a number of alternatives. We also show sample colour correction results on real-world images.

  16. The HCUP SID Imputation Project: Improving Statistical Inferences for Health Disparities Research by Imputing Missing Race Data.

    PubMed

    Ma, Yan; Zhang, Wei; Lyman, Stephen; Huang, Yihe

    2018-06-01

    To identify the most appropriate imputation method for missing data in the HCUP State Inpatient Databases (SID) and assess the impact of different missing data methods on racial disparities research. HCUP SID. A novel simulation study compared four imputation methods (random draw, hot deck, joint multiple imputation [MI], conditional MI) for missing values for multiple variables, including race, gender, admission source, median household income, and total charges. The simulation was built on real data from the SID to retain their hierarchical data structures and missing data patterns. Additional predictive information from the U.S. Census and American Hospital Association (AHA) database was incorporated into the imputation. Conditional MI prediction was equivalent or superior to the best performing alternatives for all missing data structures and substantially outperformed each of the alternatives in various scenarios. Conditional MI substantially improved statistical inferences for racial health disparities research with the SID. © Health Research and Educational Trust.

  17. Gaussian process surrogates for failure detection: A Bayesian experimental design approach

    NASA Astrophysics Data System (ADS)

    Wang, Hongqiao; Lin, Guang; Li, Jinglai

    2016-05-01

    An important task of uncertainty quantification is to identify the probability of undesired events, in particular, system failures, caused by various sources of uncertainties. In this work we consider the construction of Gaussian process surrogates for failure detection and failure probability estimation. In particular, we consider the situation that the underlying computer models are extremely expensive, and in this setting, determining the sampling points in the state space is of essential importance. We formulate the problem as an optimal experimental design for Bayesian inferences of the limit state (i.e., the failure boundary) and propose an efficient numerical scheme to solve the resulting optimization problem. In particular, the proposed limit-state inference method is capable of determining multiple sampling points at a time, and thus it is well suited for problems where multiple computer simulations can be performed in parallel. The accuracy and performance of the proposed method is demonstrated by both academic and practical examples.

  18. Stochastic determination of matrix determinants

    NASA Astrophysics Data System (ADS)

    Dorn, Sebastian; Enßlin, Torsten A.

    2015-07-01

    Matrix determinants play an important role in data analysis, in particular when Gaussian processes are involved. Due to currently exploding data volumes, linear operations—matrices—acting on the data are often not accessible directly but are only represented indirectly in form of a computer routine. Such a routine implements the transformation a data vector undergoes under matrix multiplication. While efficient probing routines to estimate a matrix's diagonal or trace, based solely on such computationally affordable matrix-vector multiplications, are well known and frequently used in signal inference, there is no stochastic estimate for its determinant. We introduce a probing method for the logarithm of a determinant of a linear operator. Our method rests upon a reformulation of the log-determinant by an integral representation and the transformation of the involved terms into stochastic expressions. This stochastic determinant determination enables large-size applications in Bayesian inference, in particular evidence calculations, model comparison, and posterior determination.

  19. Stochastic determination of matrix determinants.

    PubMed

    Dorn, Sebastian; Ensslin, Torsten A

    2015-07-01

    Matrix determinants play an important role in data analysis, in particular when Gaussian processes are involved. Due to currently exploding data volumes, linear operations-matrices-acting on the data are often not accessible directly but are only represented indirectly in form of a computer routine. Such a routine implements the transformation a data vector undergoes under matrix multiplication. While efficient probing routines to estimate a matrix's diagonal or trace, based solely on such computationally affordable matrix-vector multiplications, are well known and frequently used in signal inference, there is no stochastic estimate for its determinant. We introduce a probing method for the logarithm of a determinant of a linear operator. Our method rests upon a reformulation of the log-determinant by an integral representation and the transformation of the involved terms into stochastic expressions. This stochastic determinant determination enables large-size applications in Bayesian inference, in particular evidence calculations, model comparison, and posterior determination.

  20. A method for detecting IBD regions simultaneously in multiple individuals—with applications to disease genetics

    PubMed Central

    Moltke, Ida; Albrechtsen, Anders; Hansen, Thomas v.O.; Nielsen, Finn C.; Nielsen, Rasmus

    2011-01-01

    All individuals in a finite population are related if traced back long enough and will, therefore, share regions of their genomes identical by descent (IBD). Detection of such regions has several important applications—from answering questions about human evolution to locating regions in the human genome containing disease-causing variants. However, IBD regions can be difficult to detect, especially in the common case where no pedigree information is available. In particular, all existing non-pedigree based methods can only infer IBD sharing between two individuals. Here, we present a new Markov Chain Monte Carlo method for detection of IBD regions, which does not rely on any pedigree information. It is based on a probabilistic model applicable to unphased SNP data. It can take inbreeding, allele frequencies, genotyping errors, and genomic distances into account. And most importantly, it can simultaneously infer IBD sharing among multiple individuals. Through simulations, we show that the simultaneous modeling of multiple individuals makes the method more powerful and accurate than several other non-pedigree based methods. We illustrate the potential of the method by applying it to data from individuals with breast and/or ovarian cancer, and show that a known disease-causing mutation can be mapped to a 2.2-Mb region using SNP data from only five seemingly unrelated affected individuals. This would not be possible using classical linkage mapping or association mapping. PMID:21493780

  1. Identifying and exploiting trait-relevant tissues with multiple functional annotations in genome-wide association studies

    PubMed Central

    Zhang, Shujun

    2018-01-01

    Genome-wide association studies (GWASs) have identified many disease associated loci, the majority of which have unknown biological functions. Understanding the mechanism underlying trait associations requires identifying trait-relevant tissues and investigating associations in a trait-specific fashion. Here, we extend the widely used linear mixed model to incorporate multiple SNP functional annotations from omics studies with GWAS summary statistics to facilitate the identification of trait-relevant tissues, with which to further construct powerful association tests. Specifically, we rely on a generalized estimating equation based algorithm for parameter inference, a mixture modeling framework for trait-tissue relevance classification, and a weighted sequence kernel association test constructed based on the identified trait-relevant tissues for powerful association analysis. We refer to our analytic procedure as the Scalable Multiple Annotation integration for trait-Relevant Tissue identification and usage (SMART). With extensive simulations, we show how our method can make use of multiple complementary annotations to improve the accuracy for identifying trait-relevant tissues. In addition, our procedure allows us to make use of the inferred trait-relevant tissues, for the first time, to construct more powerful SNP set tests. We apply our method for an in-depth analysis of 43 traits from 28 GWASs using tissue-specific annotations in 105 tissues derived from ENCODE and Roadmap. Our results reveal new trait-tissue relevance, pinpoint important annotations that are informative of trait-tissue relationship, and illustrate how we can use the inferred trait-relevant tissues to construct more powerful association tests in the Wellcome trust case control consortium study. PMID:29377896

  2. Pilot points method for conditioning multiple-point statistical facies simulation on flow data

    NASA Astrophysics Data System (ADS)

    Ma, Wei; Jafarpour, Behnam

    2018-05-01

    We propose a new pilot points method for conditioning discrete multiple-point statistical (MPS) facies simulation on dynamic flow data. While conditioning MPS simulation on static hard data is straightforward, their calibration against nonlinear flow data is nontrivial. The proposed method generates conditional models from a conceptual model of geologic connectivity, known as a training image (TI), by strategically placing and estimating pilot points. To place pilot points, a score map is generated based on three sources of information: (i) the uncertainty in facies distribution, (ii) the model response sensitivity information, and (iii) the observed flow data. Once the pilot points are placed, the facies values at these points are inferred from production data and then are used, along with available hard data at well locations, to simulate a new set of conditional facies realizations. While facies estimation at the pilot points can be performed using different inversion algorithms, in this study the ensemble smoother (ES) is adopted to update permeability maps from production data, which are then used to statistically infer facies types at the pilot point locations. The developed method combines the information in the flow data and the TI by using the former to infer facies values at selected locations away from the wells and the latter to ensure consistent facies structure and connectivity where away from measurement locations. Several numerical experiments are used to evaluate the performance of the developed method and to discuss its important properties.

  3. Network inference using informative priors

    PubMed Central

    Mukherjee, Sach; Speed, Terence P.

    2008-01-01

    Recent years have seen much interest in the study of systems characterized by multiple interacting components. A class of statistical models called graphical models, in which graphs are used to represent probabilistic relationships between variables, provides a framework for formal inference regarding such systems. In many settings, the object of inference is the network structure itself. This problem of “network inference” is well known to be a challenging one. However, in scientific settings there is very often existing information regarding network connectivity. A natural idea then is to take account of such information during inference. This article addresses the question of incorporating prior information into network inference. We focus on directed models called Bayesian networks, and use Markov chain Monte Carlo to draw samples from posterior distributions over network structures. We introduce prior distributions on graphs capable of capturing information regarding network features including edges, classes of edges, degree distributions, and sparsity. We illustrate our approach in the context of systems biology, applying our methods to network inference in cancer signaling. PMID:18799736

  4. Integrating Information in Biological Ontologies and Molecular Networks to Infer Novel Terms.

    PubMed

    Li, Le; Yip, Kevin Y

    2016-12-15

    Currently most terms and term-term relationships in Gene Ontology (GO) are defined manually, which creates cost, consistency and completeness issues. Recent studies have demonstrated the feasibility of inferring GO automatically from biological networks, which represents an important complementary approach to GO construction. These methods (NeXO and CliXO) are unsupervised, which means 1) they cannot use the information contained in existing GO, 2) the way they integrate biological networks may not optimize the accuracy, and 3) they are not customized to infer the three different sub-ontologies of GO. Here we present a semi-supervised method called Unicorn that extends these previous methods to tackle the three problems. Unicorn uses a sub-tree of an existing GO sub-ontology as training part to learn parameters in integrating multiple networks. Cross-validation results show that Unicorn reliably inferred the left-out parts of each specific GO sub-ontology. In addition, by training Unicorn with an old version of GO together with biological networks, it successfully re-discovered some terms and term-term relationships present only in a new version of GO. Unicorn also successfully inferred some novel terms that were not contained in GO but have biological meanings well-supported by the literature. Source code of Unicorn is available at http://yiplab.cse.cuhk.edu.hk/unicorn/.

  5. Using Genotype Abundance to Improve Phylogenetic Inference

    PubMed Central

    Mesin, Luka; Victora, Gabriel D; Minin, Vladimir N; Matsen, Frederick A

    2018-01-01

    Abstract Modern biological techniques enable very dense genetic sampling of unfolding evolutionary histories, and thus frequently sample some genotypes multiple times. This motivates strategies to incorporate genotype abundance information in phylogenetic inference. In this article, we synthesize a stochastic process model with standard sequence-based phylogenetic optimality, and show that tree estimation is substantially improved by doing so. Our method is validated with extensive simulations and an experimental single-cell lineage tracing study of germinal center B cell receptor affinity maturation. PMID:29474671

  6. Non‐parametric combination and related permutation tests for neuroimaging

    PubMed Central

    Webster, Matthew A.; Brooks, Jonathan C.; Tracey, Irene; Smith, Stephen M.; Nichols, Thomas E.

    2016-01-01

    Abstract In this work, we show how permutation methods can be applied to combination analyses such as those that include multiple imaging modalities, multiple data acquisitions of the same modality, or simply multiple hypotheses on the same data. Using the well‐known definition of union‐intersection tests and closed testing procedures, we use synchronized permutations to correct for such multiplicity of tests, allowing flexibility to integrate imaging data with different spatial resolutions, surface and/or volume‐based representations of the brain, including non‐imaging data. For the problem of joint inference, we propose and evaluate a modification of the recently introduced non‐parametric combination (NPC) methodology, such that instead of a two‐phase algorithm and large data storage requirements, the inference can be performed in a single phase, with reasonable computational demands. The method compares favorably to classical multivariate tests (such as MANCOVA), even when the latter is assessed using permutations. We also evaluate, in the context of permutation tests, various combining methods that have been proposed in the past decades, and identify those that provide the best control over error rate and power across a range of situations. We show that one of these, the method of Tippett, provides a link between correction for the multiplicity of tests and their combination. Finally, we discuss how the correction can solve certain problems of multiple comparisons in one‐way ANOVA designs, and how the combination is distinguished from conjunctions, even though both can be assessed using permutation tests. We also provide a common algorithm that accommodates combination and correction. Hum Brain Mapp 37:1486‐1511, 2016. © 2016 Wiley Periodicals, Inc. PMID:26848101

  7. iASeq: integrative analysis of allele-specificity of protein-DNA interactions in multiple ChIP-seq datasets

    PubMed Central

    2012-01-01

    Background ChIP-seq provides new opportunities to study allele-specific protein-DNA binding (ASB). However, detecting allelic imbalance from a single ChIP-seq dataset often has low statistical power since only sequence reads mapped to heterozygote SNPs are informative for discriminating two alleles. Results We develop a new method iASeq to address this issue by jointly analyzing multiple ChIP-seq datasets. iASeq uses a Bayesian hierarchical mixture model to learn correlation patterns of allele-specificity among multiple proteins. Using the discovered correlation patterns, the model allows one to borrow information across datasets to improve detection of allelic imbalance. Application of iASeq to 77 ChIP-seq samples from 40 ENCODE datasets and 1 genomic DNA sample in GM12878 cells reveals that allele-specificity of multiple proteins are highly correlated, and demonstrates the ability of iASeq to improve allelic inference compared to analyzing each individual dataset separately. Conclusions iASeq illustrates the value of integrating multiple datasets in the allele-specificity inference and offers a new tool to better analyze ASB. PMID:23194258

  8. Terminal location planning in intermodal transportation with Bayesian inference method.

    DOT National Transportation Integrated Search

    2014-01-01

    In this project, we consider the planning of terminal locations for intermodal transportation systems. For a given number of potential terminals and coexisted multiple service pairs, we find the set of appropriate terminals and their locations that p...

  9. P values are only an index to evidence: 20th- vs. 21st-century statistical science.

    PubMed

    Burnham, K P; Anderson, D R

    2014-03-01

    Early statistical methods focused on pre-data probability statements (i.e., data as random variables) such as P values; these are not really inferences nor are P values evidential. Statistical science clung to these principles throughout much of the 20th century as a wide variety of methods were developed for special cases. Looking back, it is clear that the underlying paradigm (i.e., testing and P values) was weak. As Kuhn (1970) suggests, new paradigms have taken the place of earlier ones: this is a goal of good science. New methods have been developed and older methods extended and these allow proper measures of strength of evidence and multimodel inference. It is time to move forward with sound theory and practice for the difficult practical problems that lie ahead. Given data the useful foundation shifts to post-data probability statements such as model probabilities (Akaike weights) or related quantities such as odds ratios and likelihood intervals. These new methods allow formal inference from multiple models in the a prior set. These quantities are properly evidential. The past century was aimed at finding the "best" model and making inferences from it. The goal in the 21st century is to base inference on all the models weighted by their model probabilities (model averaging). Estimates of precision can include model selection uncertainty leading to variances conditional on the model set. The 21st century will be about the quantification of information, proper measures of evidence, and multi-model inference. Nelder (1999:261) concludes, "The most important task before us in developing statistical science is to demolish the P-value culture, which has taken root to a frightening extent in many areas of both pure and applied science and technology".

  10. Biological intuition in alignment-free methods: response to Posada.

    PubMed

    Ragan, Mark A; Chan, Cheong Xin

    2013-08-01

    A recent editorial in Journal of Molecular Evolution highlights opportunities and challenges facing molecular evolution in the era of next-generation sequencing. Abundant sequence data should allow more-complex models to be fit at higher confidence, making phylogenetic inference more reliable and improving our understanding of evolution at the molecular level. However, concern that approaches based on multiple sequence alignment may be computationally infeasible for large datasets is driving the development of so-called alignment-free methods for sequence comparison and phylogenetic inference. The recent editorial characterized these approaches as model-free, not based on the concept of homology, and lacking in biological intuition. We argue here that alignment-free methods have not abandoned models or homology, and can be biologically intuitive.

  11. Presenting evidence and summary measures to best inform societal decisions when comparing multiple strategies.

    PubMed

    Eckermann, Simon; Willan, Andrew R

    2011-07-01

    Multiple strategy comparisons in health technology assessment (HTA) are becoming increasingly important, with multiple alternative therapeutic actions, combinations of therapies and diagnostic and genetic testing alternatives. Comparison under uncertainty of incremental cost, effects and cost effectiveness across more than two strategies is conceptually and practically very different from that for two strategies, where all evidence can be summarized in a single bivariate distribution on the incremental cost-effectiveness plane. Alternative methods for comparing multiple strategies in HTA have been developed in (i) presenting cost and effects on the cost-disutility plane and (ii) summarizing evidence with multiple strategy cost-effectiveness acceptability (CEA) and expected net loss (ENL) curves and frontiers. However, critical questions remain for the analyst and decision maker of how these techniques can be best employed across multiple strategies to (i) inform clinical and cost inference in presenting evidence, and (ii) summarize evidence of cost effectiveness to inform societal reimbursement decisions where preferences may be risk neutral or somewhat risk averse under the Arrow-Lind theorem. We critically consider how evidence across multiple strategies can be best presented and summarized to inform inference and societal reimbursement decisions, given currently available methods. In the process, we make a number of important original findings. First, in presenting evidence for multiple strategies, the joint distribution of costs and effects on the cost-disutility plane with associated flexible comparators varying across replicates for cost and effect axes ensure full cost and effect inference. Such inference is usually confounded on the cost-effectiveness plane with comparison relative to a fixed origin and axes. Second, in summarizing evidence for risk-neutral societal decision making, ENL curves and frontiers are shown to have advantages over the CEA frontier in directly presenting differences in expected net benefit (ENB). The CEA frontier, while identifying strategies that maximize ENB, only presents their probability of maximizing net benefit (NB) and, hence, fails to explain why strategies maximize ENB at any given threshold value. Third, in summarizing evidence for somewhat risk-averse societal decision making, trade-offs between the strategy maximizing ENB and other potentially optimal strategies with higher probability of maximizing NB should be presented over discrete threshold values where they arise. However, the probabilities informing these trade-offs and associated discrete threshold value regions should be derived from bilateral CEA curves to prevent confounding by other strategies inherent in multiple strategy CEA curves. Based on these findings, a series of recommendations are made for best presenting and summarizing cost-effectiveness evidence for reimbursement decisions when comparing multiple strategies, which are contrasted with advice for comparing two strategies. Implications for joint research and reimbursement decisions are also discussed.

  12. Ancestral sequence reconstruction in primate mitochondrial DNA: compositional bias and effect on functional inference.

    PubMed

    Krishnan, Neeraja M; Seligmann, Hervé; Stewart, Caro-Beth; De Koning, A P Jason; Pollock, David D

    2004-10-01

    Reconstruction of ancestral DNA and amino acid sequences is an important means of inferring information about past evolutionary events. Such reconstructions suggest changes in molecular function and evolutionary processes over the course of evolution and are used to infer adaptation and convergence. Maximum likelihood (ML) is generally thought to provide relatively accurate reconstructed sequences compared to parsimony, but both methods lead to the inference of multiple directional changes in nucleotide frequencies in primate mitochondrial DNA (mtDNA). To better understand this surprising result, as well as to better understand how parsimony and ML differ, we constructed a series of computationally simple "conditional pathway" methods that differed in the number of substitutions allowed per site along each branch, and we also evaluated the entire Bayesian posterior frequency distribution of reconstructed ancestral states. We analyzed primate mitochondrial cytochrome b (Cyt-b) and cytochrome oxidase subunit I (COI) genes and found that ML reconstructs ancestral frequencies that are often more different from tip sequences than are parsimony reconstructions. In contrast, frequency reconstructions based on the posterior ensemble more closely resemble extant nucleotide frequencies. Simulations indicate that these differences in ancestral sequence inference are probably due to deterministic bias caused by high uncertainty in the optimization-based ancestral reconstruction methods (parsimony, ML, Bayesian maximum a posteriori). In contrast, ancestral nucleotide frequencies based on an average of the Bayesian set of credible ancestral sequences are much less biased. The methods involving simpler conditional pathway calculations have slightly reduced likelihood values compared to full likelihood calculations, but they can provide fairly unbiased nucleotide reconstructions and may be useful in more complex phylogenetic analyses than considered here due to their speed and flexibility. To determine whether biased reconstructions using optimization methods might affect inferences of functional properties, ancestral primate mitochondrial tRNA sequences were inferred and helix-forming propensities for conserved pairs were evaluated in silico. For ambiguously reconstructed nucleotides at sites with high base composition variability, ancestral tRNA sequences from Bayesian analyses were more compatible with canonical base pairing than were those inferred by other methods. Thus, nucleotide bias in reconstructed sequences apparently can lead to serious bias and inaccuracies in functional predictions.

  13. Pairing field methods to improve inference in wildlife surveys while accommodating detection covariance

    USGS Publications Warehouse

    Clare, John; McKinney, Shawn T.; DePue, John E.; Loftin, Cynthia S.

    2017-01-01

    It is common to use multiple field sampling methods when implementing wildlife surveys to compare method efficacy or cost efficiency, integrate distinct pieces of information provided by separate methods, or evaluate method-specific biases and misclassification error. Existing models that combine information from multiple field methods or sampling devices permit rigorous comparison of method-specific detection parameters, enable estimation of additional parameters such as false-positive detection probability, and improve occurrence or abundance estimates, but with the assumption that the separate sampling methods produce detections independently of one another. This assumption is tenuous if methods are paired or deployed in close proximity simultaneously, a common practice that reduces the additional effort required to implement multiple methods and reduces the risk that differences between method-specific detection parameters are confounded by other environmental factors. We develop occupancy and spatial capture–recapture models that permit covariance between the detections produced by different methods, use simulation to compare estimator performance of the new models to models assuming independence, and provide an empirical application based on American marten (Martes americana) surveys using paired remote cameras, hair catches, and snow tracking. Simulation results indicate existing models that assume that methods independently detect organisms produce biased parameter estimates and substantially understate estimate uncertainty when this assumption is violated, while our reformulated models are robust to either methodological independence or covariance. Empirical results suggested that remote cameras and snow tracking had comparable probability of detecting present martens, but that snow tracking also produced false-positive marten detections that could potentially substantially bias distribution estimates if not corrected for. Remote cameras detected marten individuals more readily than passive hair catches. Inability to photographically distinguish individual sex did not appear to induce negative bias in camera density estimates; instead, hair catches appeared to produce detection competition between individuals that may have been a source of negative bias. Our model reformulations broaden the range of circumstances in which analyses incorporating multiple sources of information can be robustly used, and our empirical results demonstrate that using multiple field-methods can enhance inferences regarding ecological parameters of interest and improve understanding of how reliably survey methods sample these parameters.

  14. Road Traffic Anomaly Detection via Collaborative Path Inference from GPS Snippets

    PubMed Central

    Wang, Hongtao; Wen, Hui; Yi, Feng; Zhu, Hongsong; Sun, Limin

    2017-01-01

    Road traffic anomaly denotes a road segment that is anomalous in terms of traffic flow of vehicles. Detecting road traffic anomalies from GPS (Global Position System) snippets data is becoming critical in urban computing since they often suggest underlying events. However, the noisy and sparse nature of GPS snippets data have ushered multiple problems, which have prompted the detection of road traffic anomalies to be very challenging. To address these issues, we propose a two-stage solution which consists of two components: a Collaborative Path Inference (CPI) model and a Road Anomaly Test (RAT) model. CPI model performs path inference incorporating both static and dynamic features into a Conditional Random Field (CRF). Dynamic context features are learned collaboratively from large GPS snippets via a tensor decomposition technique. Then RAT calculates the anomalous degree for each road segment from the inferred fine-grained trajectories in given time intervals. We evaluated our method using a large scale real world dataset, which includes one-month GPS location data from more than eight thousand taxicabs in Beijing. The evaluation results show the advantages of our method beyond other baseline techniques. PMID:28282948

  15. Integrating Information in Biological Ontologies and Molecular Networks to Infer Novel Terms

    PubMed Central

    Li, Le; Yip, Kevin Y.

    2016-01-01

    Currently most terms and term-term relationships in Gene Ontology (GO) are defined manually, which creates cost, consistency and completeness issues. Recent studies have demonstrated the feasibility of inferring GO automatically from biological networks, which represents an important complementary approach to GO construction. These methods (NeXO and CliXO) are unsupervised, which means 1) they cannot use the information contained in existing GO, 2) the way they integrate biological networks may not optimize the accuracy, and 3) they are not customized to infer the three different sub-ontologies of GO. Here we present a semi-supervised method called Unicorn that extends these previous methods to tackle the three problems. Unicorn uses a sub-tree of an existing GO sub-ontology as training part to learn parameters in integrating multiple networks. Cross-validation results show that Unicorn reliably inferred the left-out parts of each specific GO sub-ontology. In addition, by training Unicorn with an old version of GO together with biological networks, it successfully re-discovered some terms and term-term relationships present only in a new version of GO. Unicorn also successfully inferred some novel terms that were not contained in GO but have biological meanings well-supported by the literature.Availability: Source code of Unicorn is available at http://yiplab.cse.cuhk.edu.hk/unicorn/. PMID:27976738

  16. Binary variable multiple-model multiple imputation to address missing data mechanism uncertainty: Application to a smoking cessation trial

    PubMed Central

    Siddique, Juned; Harel, Ofer; Crespi, Catherine M.; Hedeker, Donald

    2014-01-01

    The true missing data mechanism is never known in practice. We present a method for generating multiple imputations for binary variables that formally incorporates missing data mechanism uncertainty. Imputations are generated from a distribution of imputation models rather than a single model, with the distribution reflecting subjective notions of missing data mechanism uncertainty. Parameter estimates and standard errors are obtained using rules for nested multiple imputation. Using simulation, we investigate the impact of missing data mechanism uncertainty on post-imputation inferences and show that incorporating this uncertainty can increase the coverage of parameter estimates. We apply our method to a longitudinal smoking cessation trial where nonignorably missing data were a concern. Our method provides a simple approach for formalizing subjective notions regarding nonresponse and can be implemented using existing imputation software. PMID:24634315

  17. A Kernel Embedding-Based Approach for Nonstationary Causal Model Inference.

    PubMed

    Hu, Shoubo; Chen, Zhitang; Chan, Laiwan

    2018-05-01

    Although nonstationary data are more common in the real world, most existing causal discovery methods do not take nonstationarity into consideration. In this letter, we propose a kernel embedding-based approach, ENCI, for nonstationary causal model inference where data are collected from multiple domains with varying distributions. In ENCI, we transform the complicated relation of a cause-effect pair into a linear model of variables of which observations correspond to the kernel embeddings of the cause-and-effect distributions in different domains. In this way, we are able to estimate the causal direction by exploiting the causal asymmetry of the transformed linear model. Furthermore, we extend ENCI to causal graph discovery for multiple variables by transforming the relations among them into a linear nongaussian acyclic model. We show that by exploiting the nonstationarity of distributions, both cause-effect pairs and two kinds of causal graphs are identifiable under mild conditions. Experiments on synthetic and real-world data are conducted to justify the efficacy of ENCI over major existing methods.

  18. MEGA-CC: computing core of molecular evolutionary genetics analysis program for automated and iterative data analysis.

    PubMed

    Kumar, Sudhir; Stecher, Glen; Peterson, Daniel; Tamura, Koichiro

    2012-10-15

    There is a growing need in the research community to apply the molecular evolutionary genetics analysis (MEGA) software tool for batch processing a large number of datasets and to integrate it into analysis workflows. Therefore, we now make available the computing core of the MEGA software as a stand-alone executable (MEGA-CC), along with an analysis prototyper (MEGA-Proto). MEGA-CC provides users with access to all the computational analyses available through MEGA's graphical user interface version. This includes methods for multiple sequence alignment, substitution model selection, evolutionary distance estimation, phylogeny inference, substitution rate and pattern estimation, tests of natural selection and ancestral sequence inference. Additionally, we have upgraded the source code for phylogenetic analysis using the maximum likelihood methods for parallel execution on multiple processors and cores. Here, we describe MEGA-CC and outline the steps for using MEGA-CC in tandem with MEGA-Proto for iterative and automated data analysis. http://www.megasoftware.net/.

  19. Inferring gene and protein interactions using PubMed citations and consensus Bayesian networks.

    PubMed

    Deeter, Anthony; Dalman, Mark; Haddad, Joseph; Duan, Zhong-Hui

    2017-01-01

    The PubMed database offers an extensive set of publication data that can be useful, yet inherently complex to use without automated computational techniques. Data repositories such as the Genomic Data Commons (GDC) and the Gene Expression Omnibus (GEO) offer experimental data storage and retrieval as well as curated gene expression profiles. Genetic interaction databases, including Reactome and Ingenuity Pathway Analysis, offer pathway and experiment data analysis using data curated from these publications and data repositories. We have created a method to generate and analyze consensus networks, inferring potential gene interactions, using large numbers of Bayesian networks generated by data mining publications in the PubMed database. Through the concept of network resolution, these consensus networks can be tailored to represent possible genetic interactions. We designed a set of experiments to confirm that our method is stable across variation in both sample and topological input sizes. Using gene product interactions from the KEGG pathway database and data mining PubMed publication abstracts, we verify that regardless of the network resolution or the inferred consensus network, our method is capable of inferring meaningful gene interactions through consensus Bayesian network generation with multiple, randomized topological orderings. Our method can not only confirm the existence of currently accepted interactions, but has the potential to hypothesize new ones as well. We show our method confirms the existence of known gene interactions such as JAK-STAT-PI3K-AKT-mTOR, infers novel gene interactions such as RAS- Bcl-2 and RAS-AKT, and found significant pathway-pathway interactions between the JAK-STAT signaling and Cardiac Muscle Contraction KEGG pathways.

  20. Multiple player tracking in sports video: a dual-mode two-way bayesian inference approach with progressive observation modeling.

    PubMed

    Xing, Junliang; Ai, Haizhou; Liu, Liwei; Lao, Shihong

    2011-06-01

    Multiple object tracking (MOT) is a very challenging task yet of fundamental importance for many practical applications. In this paper, we focus on the problem of tracking multiple players in sports video which is even more difficult due to the abrupt movements of players and their complex interactions. To handle the difficulties in this problem, we present a new MOT algorithm which contributes both in the observation modeling level and in the tracking strategy level. For the observation modeling, we develop a progressive observation modeling process that is able to provide strong tracking observations and greatly facilitate the tracking task. For the tracking strategy, we propose a dual-mode two-way Bayesian inference approach which dynamically switches between an offline general model and an online dedicated model to deal with single isolated object tracking and multiple occluded object tracking integrally by forward filtering and backward smoothing. Extensive experiments on different kinds of sports videos, including football, basketball, as well as hockey, demonstrate the effectiveness and efficiency of the proposed method.

  1. Mediation Analysis with Multiple Mediators

    PubMed Central

    VanderWeele, T.J.; Vansteelandt, S.

    2014-01-01

    Recent advances in the causal inference literature on mediation have extended traditional approaches to direct and indirect effects to settings that allow for interactions and non-linearities. In this paper, these approaches from causal inference are further extended to settings in which multiple mediators may be of interest. Two analytic approaches, one based on regression and one based on weighting are proposed to estimate the effect mediated through multiple mediators and the effects through other pathways. The approaches proposed here accommodate exposure-mediator interactions and, to a certain extent, mediator-mediator interactions as well. The methods handle binary or continuous mediators and binary, continuous or count outcomes. When the mediators affect one another, the strategy of trying to assess direct and indirect effects one mediator at a time will in general fail; the approach given in this paper can still be used. A characterization is moreover given as to when the sum of the mediated effects for multiple mediators considered separately will be equal to the mediated effect of all of the mediators considered jointly. The approach proposed in this paper is robust to unmeasured common causes of two or more mediators. PMID:25580377

  2. Mediation Analysis with Multiple Mediators.

    PubMed

    VanderWeele, T J; Vansteelandt, S

    2014-01-01

    Recent advances in the causal inference literature on mediation have extended traditional approaches to direct and indirect effects to settings that allow for interactions and non-linearities. In this paper, these approaches from causal inference are further extended to settings in which multiple mediators may be of interest. Two analytic approaches, one based on regression and one based on weighting are proposed to estimate the effect mediated through multiple mediators and the effects through other pathways. The approaches proposed here accommodate exposure-mediator interactions and, to a certain extent, mediator-mediator interactions as well. The methods handle binary or continuous mediators and binary, continuous or count outcomes. When the mediators affect one another, the strategy of trying to assess direct and indirect effects one mediator at a time will in general fail; the approach given in this paper can still be used. A characterization is moreover given as to when the sum of the mediated effects for multiple mediators considered separately will be equal to the mediated effect of all of the mediators considered jointly. The approach proposed in this paper is robust to unmeasured common causes of two or more mediators.

  3. Gene regulatory network inference using fused LASSO on multiple data sets

    PubMed Central

    Omranian, Nooshin; Eloundou-Mbebi, Jeanne M. O.; Mueller-Roeber, Bernd; Nikoloski, Zoran

    2016-01-01

    Devising computational methods to accurately reconstruct gene regulatory networks given gene expression data is key to systems biology applications. Here we propose a method for reconstructing gene regulatory networks by simultaneous consideration of data sets from different perturbation experiments and corresponding controls. The method imposes three biologically meaningful constraints: (1) expression levels of each gene should be explained by the expression levels of a small number of transcription factor coding genes, (2) networks inferred from different data sets should be similar with respect to the type and number of regulatory interactions, and (3) relationships between genes which exhibit similar differential behavior over the considered perturbations should be favored. We demonstrate that these constraints can be transformed in a fused LASSO formulation for the proposed method. The comparative analysis on transcriptomics time-series data from prokaryotic species, Escherichia coli and Mycobacterium tuberculosis, as well as a eukaryotic species, mouse, demonstrated that the proposed method has the advantages of the most recent approaches for regulatory network inference, while obtaining better performance and assigning higher scores to the true regulatory links. The study indicates that the combination of sparse regression techniques with other biologically meaningful constraints is a promising framework for gene regulatory network reconstructions. PMID:26864687

  4. Clustering Genes of Common Evolutionary History

    PubMed Central

    Gori, Kevin; Suchan, Tomasz; Alvarez, Nadir; Goldman, Nick; Dessimoz, Christophe

    2016-01-01

    Phylogenetic inference can potentially result in a more accurate tree using data from multiple loci. However, if the loci are incongruent—due to events such as incomplete lineage sorting or horizontal gene transfer—it can be misleading to infer a single tree. To address this, many previous contributions have taken a mechanistic approach, by modeling specific processes. Alternatively, one can cluster loci without assuming how these incongruencies might arise. Such “process-agnostic” approaches typically infer a tree for each locus and cluster these. There are, however, many possible combinations of tree distance and clustering methods; their comparative performance in the context of tree incongruence is largely unknown. Furthermore, because standard model selection criteria such as AIC cannot be applied to problems with a variable number of topologies, the issue of inferring the optimal number of clusters is poorly understood. Here, we perform a large-scale simulation study of phylogenetic distances and clustering methods to infer loci of common evolutionary history. We observe that the best-performing combinations are distances accounting for branch lengths followed by spectral clustering or Ward’s method. We also introduce two statistical tests to infer the optimal number of clusters and show that they strongly outperform the silhouette criterion, a general-purpose heuristic. We illustrate the usefulness of the approach by 1) identifying errors in a previous phylogenetic analysis of yeast species and 2) identifying topological incongruence among newly sequenced loci of the globeflower fly genus Chiastocheta. We release treeCl, a new program to cluster genes of common evolutionary history (http://git.io/treeCl). PMID:26893301

  5. DEVELOPMENTAL PALEOBIOLOGY OF THE VERTEBRATE SKELETON.

    PubMed

    Rücklin, Martin; Donoghue, Philip C J; Cunningham, John A; Marone, Federica; Stampanoni, Marco

    2014-07-01

    Studies of the development of organisms can reveal crucial information on homology of structures. Developmental data are not peculiar to living organisms, and they are routinely preserved in the mineralized tissues that comprise the vertebrate skeleton, allowing us to obtain direct insight into the developmental evolution of this most formative of vertebrate innovations. The pattern of developmental processes is recorded in fossils as successive stages inferred from the gross morphology of multiple specimens and, more reliably and routinely, through the ontogenetic stages of development seen in the skeletal histology of individuals. Traditional techniques are destructive and restricted to a 2-D plane with the third dimension inferred. Effective non-invasive methods of visualizing paleohistology to reconstruct developmental stages of the skeleton are necessary. In a brief survey of paleohistological techniques we discuss the pros and cons of these methods. The use of tomographic methods to reconstruct development of organs is exemplified by the study of the placoderm dentition. Testing evidence for the presence of teeth in placoderms, the first jawed vertebrates, we compare the methods that have been used. These include inferring the development from morphology, and using serial sectioning, microCT or synchrotron X-ray tomographic microscopy (SRXTM) to reconstruct growth stages and directions of growth. The ensuing developmental interpretations are biased by the methods and degree of inference. The most direct and reliable method is using SRXTM data to trace sclerochronology. The resulting developmental data can be used to resolve homology and test hypotheses on the origin of evolutionary novelties.

  6. Attenuation and source properties at the Coso Geothermal area, California

    USGS Publications Warehouse

    Hough, S.E.; Lees, J.M.; Monastero, F.

    1999-01-01

    We use a multiple-empirical Green's function method to determine source properties of small (M -0.4 to 1.3) earthquakes and P- and S-wave attenuation at the Coso Geothermal Field, California. Source properties of a previously identified set of clustered events from the Coso geothermal region are first analyzed using an empirical Green's function (EGF) method. Stress-drop values of at least 0.5-1 MPa are inferred for all of the events; in many cases, the corner frequency is outside the usable bandwidth, and the stress drop can only be constrained as being higher than 3 MPa. P- and S-wave stress-drop estimates are identical to the resolution limits of the data. These results are indistinguishable from numerous EGF studies of M 2-5 earthquakes, suggesting a similarity in rupture processes that extends to events that are both tiny and induced, providing further support for Byerlee's Law. Whole-path Q estimates for P and S waves are determined using the multiple-empirical Green's function (MEGF) method of Hough (1997), whereby spectra from clusters of colocated events at a given station are inverted for a single attenuation parameter, ??, with source parameters constrained from EGF analysis. The ?? estimates, which we infer to be resolved to within 0.01 sec or better, exhibit almost as much scatter as a function of hypocentral distance as do values from previous single-spectrum studies for which much higher uncertainties in individual ?? estimates are expected. The variability in ?? estimates determined here therefore suggests real lateral variability in Q structure. Although the ray-path coverage is too sparse to yield a complete three-dimensional attenuation tomographic image, we invert the inferred ?? value for three-dimensional structure using a damped least-squares method, and the results do reveal significant lateral variability in Q structure. The inferred attenuation variability corresponds to the heat-flow variations within the geothermal region. A central low-Q region corresponds well with the central high-heat flow region; additional detailed structure is also suggested.

  7. Fundamental limits on dynamic inference from single-cell snapshots

    PubMed Central

    Weinreb, Caleb; Tusi, Betsabeh K.; Socolovsky, Merav

    2018-01-01

    Single-cell expression profiling reveals the molecular states of individual cells with unprecedented detail. Because these methods destroy cells in the process of analysis, they cannot measure how gene expression changes over time. However, some information on dynamics is present in the data: the continuum of molecular states in the population can reflect the trajectory of a typical cell. Many methods for extracting single-cell dynamics from population data have been proposed. However, all such attempts face a common limitation: for any measured distribution of cell states, there are multiple dynamics that could give rise to it, and by extension, multiple possibilities for underlying mechanisms of gene regulation. Here, we describe the aspects of gene expression dynamics that cannot be inferred from a static snapshot alone and identify assumptions necessary to constrain a unique solution for cell dynamics from static snapshots. We translate these constraints into a practical algorithmic approach, population balance analysis (PBA), which makes use of a method from spectral graph theory to solve a class of high-dimensional differential equations. We use simulations to show the strengths and limitations of PBA, and then apply it to single-cell profiles of hematopoietic progenitor cells (HPCs). Cell state predictions from this analysis agree with HPC fate assays reported in several papers over the past two decades. By highlighting the fundamental limits on dynamic inference faced by any method, our framework provides a rigorous basis for dynamic interpretation of a gene expression continuum and clarifies best experimental designs for trajectory reconstruction from static snapshot measurements. PMID:29463712

  8. Inferences about landbird abundance from count data: recent advances and future directions

    USGS Publications Warehouse

    Nichols, J.D.; Thomas, L.; Conn, P.B.; Thomson, David L.; Cooch, Evan G.; Conroy, Michael J.

    2009-01-01

    We summarize results of a November 2006 workshop dealing with recent research on the estimation of landbird abundance from count data. Our conceptual framework includes a decomposition of the probability of detecting a bird potentially exposed to sampling efforts into four separate probabilities. Primary inference methods are described and include distance sampling, multiple observers, time of detection, and repeated counts. The detection parameters estimated by these different approaches differ, leading to different interpretations of resulting estimates of density and abundance. Simultaneous use of combinations of these different inference approaches can not only lead to increased precision but also provides the ability to decompose components of the detection process. Recent efforts to test the efficacy of these different approaches using natural systems and a new bird radio test system provide sobering conclusions about the ability of observers to detect and localize birds in auditory surveys. Recent research is reported on efforts to deal with such potential sources of error as bird misclassification, measurement error, and density gradients. Methods for inference about spatial and temporal variation in avian abundance are outlined. Discussion topics include opinions about the need to estimate detection probability when drawing inference about avian abundance, methodological recommendations based on the current state of knowledge and suggestions for future research.

  9. Bayesian Inference for Time Trends in Parameter Values using Weighted Evidence Sets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    D. L. Kelly; A. Malkhasyan

    2010-09-01

    There is a nearly ubiquitous assumption in PSA that parameter values are at least piecewise-constant in time. As a result, Bayesian inference tends to incorporate many years of plant operation, over which there have been significant changes in plant operational and maintenance practices, plant management, etc. These changes can cause significant changes in parameter values over time; however, failure to perform Bayesian inference in the proper time-dependent framework can mask these changes. Failure to question the assumption of constant parameter values, and failure to perform Bayesian inference in the proper time-dependent framework were noted as important issues in NUREG/CR-6813, performedmore » for the U. S. Nuclear Regulatory Commission’s Advisory Committee on Reactor Safeguards in 2003. That report noted that “in-dustry lacks tools to perform time-trend analysis with Bayesian updating.” This paper describes an applica-tion of time-dependent Bayesian inference methods developed for the European Commission Ageing PSA Network. These methods utilize open-source software, implementing Markov chain Monte Carlo sampling. The paper also illustrates an approach to incorporating multiple sources of data via applicability weighting factors that address differences in key influences, such as vendor, component boundaries, conditions of the operating environment, etc.« less

  10. Bayesian Inference for Time Trends in Parameter Values: Case Study for the Ageing PSA Network of the European Commission

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Dana L. Kelly; Albert Malkhasyan

    2010-06-01

    There is a nearly ubiquitous assumption in PSA that parameter values are at least piecewise-constant in time. As a result, Bayesian inference tends to incorporate many years of plant operation, over which there have been significant changes in plant operational and maintenance practices, plant management, etc. These changes can cause significant changes in parameter values over time; however, failure to perform Bayesian inference in the proper time-dependent framework can mask these changes. Failure to question the assumption of constant parameter values, and failure to perform Bayesian inference in the proper time-dependent framework were noted as important issues in NUREG/CR-6813, performedmore » for the U. S. Nuclear Regulatory Commission’s Advisory Committee on Reactor Safeguards in 2003. That report noted that “industry lacks tools to perform time-trend analysis with Bayesian updating.” This paper describes an application of time-dependent Bayesian inference methods developed for the European Commission Ageing PSA Network. These methods utilize open-source software, implementing Markov chain Monte Carlo sampling. The paper also illustrates the development of a generic prior distribution, which incorporates multiple sources of generic data via weighting factors that address differences in key influences, such as vendor, component boundaries, conditions of the operating environment, etc.« less

  11. Simultaneous Inference Procedures for Means.

    ERIC Educational Resources Information Center

    Krishnaiah, P. R.

    Some aspects of simultaneous tests for means are reviewed. Specifically, the comparison of univariate or multivariate normal populations based on the values of the means or mean vectors when the variances or covariance matrices are equal is discussed. Tukey's and Dunnett's tests for multiple comparisons of means, Scheffe's method of examining…

  12. Constructing an integrated gene similarity network for the identification of disease genes.

    PubMed

    Tian, Zhen; Guo, Maozu; Wang, Chunyu; Xing, LinLin; Wang, Lei; Zhang, Yin

    2017-09-20

    Discovering novel genes that are involved human diseases is a challenging task in biomedical research. In recent years, several computational approaches have been proposed to prioritize candidate disease genes. Most of these methods are mainly based on protein-protein interaction (PPI) networks. However, since these PPI networks contain false positives and only cover less half of known human genes, their reliability and coverage are very low. Therefore, it is highly necessary to fuse multiple genomic data to construct a credible gene similarity network and then infer disease genes on the whole genomic scale. We proposed a novel method, named RWRB, to infer causal genes of interested diseases. First, we construct five individual gene (protein) similarity networks based on multiple genomic data of human genes. Then, an integrated gene similarity network (IGSN) is reconstructed based on similarity network fusion (SNF) method. Finally, we employee the random walk with restart algorithm on the phenotype-gene bilayer network, which combines phenotype similarity network, IGSN as well as phenotype-gene association network, to prioritize candidate disease genes. We investigate the effectiveness of RWRB through leave-one-out cross-validation methods in inferring phenotype-gene relationships. Results show that RWRB is more accurate than state-of-the-art methods on most evaluation metrics. Further analysis shows that the success of RWRB is benefited from IGSN which has a wider coverage and higher reliability comparing with current PPI networks. Moreover, we conduct a comprehensive case study for Alzheimer's disease and predict some novel disease genes that supported by literature. RWRB is an effective and reliable algorithm in prioritizing candidate disease genes on the genomic scale. Software and supplementary information are available at http://nclab.hit.edu.cn/~tianzhen/RWRB/ .

  13. An empirical evaluation of two-stage species tree inference strategies using a multilocus dataset from North American pines

    PubMed Central

    2014-01-01

    Background As it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse sets of taxa, species trees are frequently being inferred from multilocus data. However, the behavior of many methods for performing this inference has remained largely unexplored. Some methods have been proven to be consistent given certain evolutionary models, whereas others rely on criteria that, although appropriate for many parameter values, have peculiar zones of the parameter space in which they fail to converge on the correct estimate as data sets increase in size. Results Here, using North American pines, we empirically evaluate the behavior of 24 strategies for species tree inference using three alternative outgroups (72 strategies total). The data consist of 120 individuals sampled in eight ingroup species from subsection Strobus and three outgroup species from subsection Gerardianae, spanning ∼47 kilobases of sequence at 121 loci. Each “strategy” for inferring species trees consists of three features: a species tree construction method, a gene tree inference method, and a choice of outgroup. We use multivariate analysis techniques such as principal components analysis and hierarchical clustering to identify tree characteristics that are robustly observed across strategies, as well as to identify groups of strategies that produce trees with similar features. We find that strategies that construct species trees using only topological information cluster together and that strategies that use additional non-topological information (e.g., branch lengths) also cluster together. Strategies that utilize more than one individual within a species to infer gene trees tend to produce estimates of species trees that contain clades present in trees estimated by other strategies. Strategies that use the minimize-deep-coalescences criterion to construct species trees tend to produce species tree estimates that contain clades that are not present in trees estimated by the Concatenation, RTC, SMRT, STAR, and STEAC methods, and that in general are more balanced than those inferred by these other strategies. Conclusions When constructing a species tree from a multilocus set of sequences, our observations provide a basis for interpreting differences in species tree estimates obtained via different approaches that have a two-stage structure in common, one step for gene tree estimation and a second step for species tree estimation. The methods explored here employ a number of distinct features of the data, and our analysis suggests that recovery of the same results from multiple methods that tend to differ in their patterns of inference can be a valuable tool for obtaining reliable estimates. PMID:24678701

  14. An empirical evaluation of two-stage species tree inference strategies using a multilocus dataset from North American pines.

    PubMed

    DeGiorgio, Michael; Syring, John; Eckert, Andrew J; Liston, Aaron; Cronn, Richard; Neale, David B; Rosenberg, Noah A

    2014-03-29

    As it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse sets of taxa, species trees are frequently being inferred from multilocus data. However, the behavior of many methods for performing this inference has remained largely unexplored. Some methods have been proven to be consistent given certain evolutionary models, whereas others rely on criteria that, although appropriate for many parameter values, have peculiar zones of the parameter space in which they fail to converge on the correct estimate as data sets increase in size. Here, using North American pines, we empirically evaluate the behavior of 24 strategies for species tree inference using three alternative outgroups (72 strategies total). The data consist of 120 individuals sampled in eight ingroup species from subsection Strobus and three outgroup species from subsection Gerardianae, spanning ∼47 kilobases of sequence at 121 loci. Each "strategy" for inferring species trees consists of three features: a species tree construction method, a gene tree inference method, and a choice of outgroup. We use multivariate analysis techniques such as principal components analysis and hierarchical clustering to identify tree characteristics that are robustly observed across strategies, as well as to identify groups of strategies that produce trees with similar features. We find that strategies that construct species trees using only topological information cluster together and that strategies that use additional non-topological information (e.g., branch lengths) also cluster together. Strategies that utilize more than one individual within a species to infer gene trees tend to produce estimates of species trees that contain clades present in trees estimated by other strategies. Strategies that use the minimize-deep-coalescences criterion to construct species trees tend to produce species tree estimates that contain clades that are not present in trees estimated by the Concatenation, RTC, SMRT, STAR, and STEAC methods, and that in general are more balanced than those inferred by these other strategies. When constructing a species tree from a multilocus set of sequences, our observations provide a basis for interpreting differences in species tree estimates obtained via different approaches that have a two-stage structure in common, one step for gene tree estimation and a second step for species tree estimation. The methods explored here employ a number of distinct features of the data, and our analysis suggests that recovery of the same results from multiple methods that tend to differ in their patterns of inference can be a valuable tool for obtaining reliable estimates.

  15. Inferring gene and protein interactions using PubMed citations and consensus Bayesian networks

    PubMed Central

    Dalman, Mark; Haddad, Joseph; Duan, Zhong-Hui

    2017-01-01

    The PubMed database offers an extensive set of publication data that can be useful, yet inherently complex to use without automated computational techniques. Data repositories such as the Genomic Data Commons (GDC) and the Gene Expression Omnibus (GEO) offer experimental data storage and retrieval as well as curated gene expression profiles. Genetic interaction databases, including Reactome and Ingenuity Pathway Analysis, offer pathway and experiment data analysis using data curated from these publications and data repositories. We have created a method to generate and analyze consensus networks, inferring potential gene interactions, using large numbers of Bayesian networks generated by data mining publications in the PubMed database. Through the concept of network resolution, these consensus networks can be tailored to represent possible genetic interactions. We designed a set of experiments to confirm that our method is stable across variation in both sample and topological input sizes. Using gene product interactions from the KEGG pathway database and data mining PubMed publication abstracts, we verify that regardless of the network resolution or the inferred consensus network, our method is capable of inferring meaningful gene interactions through consensus Bayesian network generation with multiple, randomized topological orderings. Our method can not only confirm the existence of currently accepted interactions, but has the potential to hypothesize new ones as well. We show our method confirms the existence of known gene interactions such as JAK-STAT-PI3K-AKT-mTOR, infers novel gene interactions such as RAS- Bcl-2 and RAS-AKT, and found significant pathway-pathway interactions between the JAK-STAT signaling and Cardiac Muscle Contraction KEGG pathways. PMID:29049295

  16. Use of genetic data to infer population-specific ecological and phenotypic traits from mixed aggregations

    USGS Publications Warehouse

    Moran, Paul; Bromaghin, Jeffrey F.; Masuda, Michele

    2014-01-01

    Many applications in ecological genetics involve sampling individuals from a mixture of multiple biological populations and subsequently associating those individuals with the populations from which they arose. Analytical methods that assign individuals to their putative population of origin have utility in both basic and applied research, providing information about population-specific life history and habitat use, ecotoxins, pathogen and parasite loads, and many other non-genetic ecological, or phenotypic traits. Although the question is initially directed at the origin of individuals, in most cases the ultimate desire is to investigate the distribution of some trait among populations. Current practice is to assign individuals to a population of origin and study properties of the trait among individuals within population strata as if they constituted independent samples. It seemed that approach might bias population-specific trait inference. In this study we made trait inferences directly through modeling, bypassing individual assignment. We extended a Bayesian model for population mixture analysis to incorporate parameters for the phenotypic trait and compared its performance to that of individual assignment with a minimum probability threshold for assignment. The Bayesian mixture model outperformed individual assignment under some trait inference conditions. However, by discarding individuals whose origins are most uncertain, the individual assignment method provided a less complex analytical technique whose performance may be adequate for some common trait inference problems. Our results provide specific guidance for method selection under various genetic relationships among populations with different trait distributions.

  17. Use of Genetic Data to Infer Population-Specific Ecological and Phenotypic Traits from Mixed Aggregations

    PubMed Central

    Moran, Paul; Bromaghin, Jeffrey F.; Masuda, Michele

    2014-01-01

    Many applications in ecological genetics involve sampling individuals from a mixture of multiple biological populations and subsequently associating those individuals with the populations from which they arose. Analytical methods that assign individuals to their putative population of origin have utility in both basic and applied research, providing information about population-specific life history and habitat use, ecotoxins, pathogen and parasite loads, and many other non-genetic ecological, or phenotypic traits. Although the question is initially directed at the origin of individuals, in most cases the ultimate desire is to investigate the distribution of some trait among populations. Current practice is to assign individuals to a population of origin and study properties of the trait among individuals within population strata as if they constituted independent samples. It seemed that approach might bias population-specific trait inference. In this study we made trait inferences directly through modeling, bypassing individual assignment. We extended a Bayesian model for population mixture analysis to incorporate parameters for the phenotypic trait and compared its performance to that of individual assignment with a minimum probability threshold for assignment. The Bayesian mixture model outperformed individual assignment under some trait inference conditions. However, by discarding individuals whose origins are most uncertain, the individual assignment method provided a less complex analytical technique whose performance may be adequate for some common trait inference problems. Our results provide specific guidance for method selection under various genetic relationships among populations with different trait distributions. PMID:24905464

  18. Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion.

    PubMed

    Gebru, Israel D; Ba, Sileye; Li, Xiaofei; Horaud, Radu

    2018-05-01

    Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around and turn their heads towards the other participants rather than facing the cameras and the microphones. Multiple-person visual tracking is combined with multiple speech-source localization in order to tackle the speech-to-person association problem. The latter is solved within a novel audio-visual fusion method on the following grounds: binaural spectral features are first extracted from a microphone pair, then a supervised audio-visual alignment technique maps these features onto an image, and finally a semi-supervised clustering method assigns binaural spectral features to visible persons. The main advantage of this method over previous work is that it processes in a principled way speech signals uttered simultaneously by multiple persons. The diarization itself is cast into a latent-variable temporal graphical model that infers speaker identities and speech turns, based on the output of an audio-visual association process, executed at each time slice, and on the dynamics of the diarization variable itself. The proposed formulation yields an efficient exact inference procedure. A novel dataset, that contains audio-visual training data as well as a number of scenarios involving several participants engaged in formal and informal dialogue, is introduced. The proposed method is thoroughly tested and benchmarked with respect to several state-of-the art diarization algorithms.

  19. Occupancy Estimation and Modeling : Inferring Patterns and Dynamics of Species Occurrence

    USGS Publications Warehouse

    MacKenzie, D.I.; Nichols, J.D.; Royle, J. Andrew; Pollock, K.H.; Bailey, L.L.; Hines, J.E.

    2006-01-01

    This is the first book to examine the latest methods in analyzing presence/absence data surveys. Using four classes of models (single-species, single-season; single-species, multiple season; multiple-species, single-season; and multiple-species, multiple-season), the authors discuss the practical sampling situation, present a likelihood-based model enabling direct estimation of the occupancy-related parameters while allowing for imperfect detectability, and make recommendations for designing studies using these models. It provides authoritative insights into the latest in estimation modeling; discusses multiple models which lay the groundwork for future study designs; addresses critical issues of imperfect detectibility and its effects on estimation; and explores the role of probability in estimating in detail.

  20. LDAP: a web server for lncRNA-disease association prediction.

    PubMed

    Lan, Wei; Li, Min; Zhao, Kaijie; Liu, Jin; Wu, Fang-Xiang; Pan, Yi; Wang, Jianxin

    2017-02-01

    Increasing evidences have demonstrated that long noncoding RNAs (lncRNAs) play important roles in many human diseases. Therefore, predicting novel lncRNA-disease associations would contribute to dissect the complex mechanisms of disease pathogenesis. Some computational methods have been developed to infer lncRNA-disease associations. However, most of these methods infer lncRNA-disease associations only based on single data resource. In this paper, we propose a new computational method to predict lncRNA-disease associations by integrating multiple biological data resources. Then, we implement this method as a web server for lncRNA-disease association prediction (LDAP). The input of the LDAP server is the lncRNA sequence. The LDAP predicts potential lncRNA-disease associations by using a bagging SVM classifier based on lncRNA similarity and disease similarity. The web server is available at http://bioinformatics.csu.edu.cn/ldap jxwang@mail.csu.edu.cn. Supplementary data are available at Bioinformatics online.

  1. Inferred Eccentricity and Period Distributions of Kepler Eclipsing Binaries

    NASA Astrophysics Data System (ADS)

    Prsa, Andrej; Matijevic, G.

    2014-01-01

    Determining the underlying eccentricity and orbital period distributions from an observed sample of eclipsing binary stars is not a trivial task. Shen and Turner (2008) have shown that the commonly used maximum likelihood estimators are biased to larger eccentricities and they do not describe the underlying distribution correctly; orbital periods suffer from a similar bias. Hogg, Myers and Bovy (2010) proposed a hierarchical probabilistic method for inferring the true eccentricity distribution of exoplanet orbits that uses the likelihood functions for individual star eccentricities. The authors show that proper inference outperforms the simple histogramming of the best-fit eccentricity values. We apply this method to the complete sample of eclipsing binary stars observed by the Kepler mission (Prsa et al. 2011) to derive the unbiased underlying eccentricity and orbital period distributions. These distributions can be used for the studies of multiple star formation, dynamical evolution, and they can serve as a drop-in replacement to prior, ad-hoc distributions used in the exoplanet field for determining false positive occurrence rates.

  2. Edge-directed inference for microaneurysms detection in digital fundus images

    NASA Astrophysics Data System (ADS)

    Huang, Ke; Yan, Michelle; Aviyente, Selin

    2007-03-01

    Microaneurysms (MAs) detection is a critical step in diabetic retinopathy screening, since MAs are the earliest visible warning of potential future problems. A variety of algorithms have been proposed for MAs detection in mass screening. Different methods have been proposed for MAs detection. The core technology for most of existing methods is based on a directional mathematical morphological operation called "Top-Hat" filter that requires multiple filtering operations at each pixel. Background structure, uneven illumination and noise often cause confusion between MAs and some non-MA structures and limits the applicability of the filter. In this paper, a novel detection framework based on edge directed inference is proposed for MAs detection. The candidate MA regions are first delineated from the edge map of a fundus image. Features measuring shape, brightness and contrast are extracted for each candidate MA region to better exclude false detection from true MAs. Algorithmic analysis and empirical evaluation reveal that the proposed edge directed inference outperforms the "Top-Hat" based algorithm in both detection accuracy and computational speed.

  3. A Systematic Bayesian Integration of Epidemiological and Genetic Data

    PubMed Central

    Lau, Max S. Y.; Marion, Glenn; Streftaris, George; Gibson, Gavin

    2015-01-01

    Genetic sequence data on pathogens have great potential to inform inference of their transmission dynamics ultimately leading to better disease control. Where genetic change and disease transmission occur on comparable timescales additional information can be inferred via the joint analysis of such genetic sequence data and epidemiological observations based on clinical symptoms and diagnostic tests. Although recently introduced approaches represent substantial progress, for computational reasons they approximate genuine joint inference of disease dynamics and genetic change in the pathogen population, capturing partially the joint epidemiological-evolutionary dynamics. Improved methods are needed to fully integrate such genetic data with epidemiological observations, for achieving a more robust inference of the transmission tree and other key epidemiological parameters such as latent periods. Here, building on current literature, a novel Bayesian framework is proposed that infers simultaneously and explicitly the transmission tree and unobserved transmitted pathogen sequences. Our framework facilitates the use of realistic likelihood functions and enables systematic and genuine joint inference of the epidemiological-evolutionary process from partially observed outbreaks. Using simulated data it is shown that this approach is able to infer accurately joint epidemiological-evolutionary dynamics, even when pathogen sequences and epidemiological data are incomplete, and when sequences are available for only a fraction of exposures. These results also characterise and quantify the value of incomplete and partial sequence data, which has important implications for sampling design, and demonstrate the abilities of the introduced method to identify multiple clusters within an outbreak. The framework is used to analyse an outbreak of foot-and-mouth disease in the UK, enhancing current understanding of its transmission dynamics and evolutionary process. PMID:26599399

  4. The effects of inference method, population sampling, and gene sampling on species tree inferences: an empirical study in slender salamanders (Plethodontidae: Batrachoseps).

    PubMed

    Jockusch, Elizabeth L; Martínez-Solano, Iñigo; Timpe, Elizabeth K

    2015-01-01

    Species tree methods are now widely used to infer the relationships among species from multilocus data sets. Many methods have been developed, which differ in whether gene and species trees are estimated simultaneously or sequentially, and in how gene trees are used to infer the species tree. While these methods perform well on simulated data, less is known about what impacts their performance on empirical data. We used a data set including five nuclear genes and one mitochondrial gene for 22 species of Batrachoseps to compare the effects of method of analysis, within-species sampling and gene sampling on species tree inferences. For this data set, the choice of inference method had the largest effect on the species tree topology. Exclusion of individual loci had large effects in *BEAST and STEM, but not in MP-EST. Different loci carried the greatest leverage in these different methods, showing that the causes of their disproportionate effects differ. Even though substantial information was present in the nuclear loci, the mitochondrial gene dominated the *BEAST species tree. This leverage is inherent to the mtDNA locus and results from its high variation and lower assumed ploidy. This mtDNA leverage may be problematic when mtDNA has undergone introgression, as is likely in this data set. By contrast, the leverage of RAG1 in STEM analyses does not reflect properties inherent to the locus, but rather results from a gene tree that is strongly discordant with all others, and is best explained by introgression between distantly related species. Within-species sampling was also important, especially in *BEAST analyses, as shown by differences in tree topology across 100 subsampled data sets. Despite the sensitivity of the species tree methods to multiple factors, five species groups, the relationships among these, and some relationships within them, are generally consistently resolved for Batrachoseps. © The Author(s) 2014. Published by Oxford University Press, on behalf of the Society of Systematic Biologists. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  5. Haplotype Reconstruction in Large Pedigrees with Many Untyped Individuals

    NASA Astrophysics Data System (ADS)

    Li, Xin; Li, Jing

    Haplotypes, as they specify the linkage patterns between dispersed genetic variations, provide important information for understanding the genetics of human traits. However haplotypes are not directly available from current genotyping platforms, and hence there are extensive investigations of computational methods to recover such information. Two major computational challenges arising in current family-based disease studies are large family sizes and many ungenotyped family members. Traditional haplotyping methods can neither handle large families nor families with missing members. In this paper, we propose a method which addresses these issues by integrating multiple novel techniques. The method consists of three major components: pairwise identical-bydescent (IBD) inference, global IBD reconstruction and haplotype restoring. By reconstructing the global IBD of a family from pairwise IBD and then restoring the haplotypes based on the inferred IBD, this method can scale to large pedigrees, and more importantly it can handle families with missing members. Compared with existing methods, this method demonstrates much higher power to recover haplotype information, especially in families with many untyped individuals.

  6. Multivariate inference of pathway activity in host immunity and response to therapeutics

    PubMed Central

    Goel, Gautam; Conway, Kara L.; Jaeger, Martin; Netea, Mihai G.; Xavier, Ramnik J.

    2014-01-01

    Developing a quantitative view of how biological pathways are regulated in response to environmental factors is central for understanding of disease phenotypes. We present a computational framework, named Multivariate Inference of Pathway Activity (MIPA), which quantifies degree of activity induced in a biological pathway by computing five distinct measures from transcriptomic profiles of its member genes. Statistical significance of inferred activity is examined using multiple independent self-contained tests followed by a competitive analysis. The method incorporates a new algorithm to identify a subset of genes that may regulate the extent of activity induced in a pathway. We present an in-depth evaluation of specificity, robustness, and reproducibility of our method. We benchmarked MIPA's false positive rate at less than 1%. Using transcriptomic profiles representing distinct physiological and disease states, we illustrate applicability of our method in (i) identifying gene–gene interactions in autophagy-dependent response to Salmonella infection, (ii) uncovering gene–environment interactions in host response to bacterial and viral pathogens and (iii) identifying driver genes and processes that contribute to wound healing and response to anti-TNFα therapy. We provide relevant experimental validation that corroborates the accuracy and advantage of our method. PMID:25147207

  7. Road Traffic Anomaly Detection via Collaborative Path Inference from GPS Snippets.

    PubMed

    Wang, Hongtao; Wen, Hui; Yi, Feng; Zhu, Hongsong; Sun, Limin

    2017-03-09

    Road traffic anomaly denotes a road segment that is anomalous in terms of traffic flow of vehicles. Detecting road traffic anomalies from GPS (Global Position System) snippets data is becoming critical in urban computing since they often suggest underlying events. However, the noisy ands parse nature of GPS snippets data have ushered multiple problems, which have prompted the detection of road traffic anomalies to be very challenging. To address these issues, we propose a two-stage solution which consists of two components: a Collaborative Path Inference (CPI) model and a Road Anomaly Test (RAT) model. CPI model performs path inference incorporating both static and dynamic features into a Conditional Random Field (CRF). Dynamic context features are learned collaboratively from large GPS snippets via a tensor decomposition technique. Then RAT calculates the anomalous degree for each road segment from the inferred fine-grained trajectories in given time intervals. We evaluated our method using a large scale real world dataset, which includes one-month GPS location data from more than eight thousand taxi cabs in Beijing. The evaluation results show the advantages of our method beyond other baseline techniques.

  8. Deep Learning for Population Genetic Inference.

    PubMed

    Sheehan, Sara; Song, Yun S

    2016-03-01

    Given genomic variation data from multiple individuals, computing the likelihood of complex population genetic models is often infeasible. To circumvent this problem, we introduce a novel likelihood-free inference framework by applying deep learning, a powerful modern technique in machine learning. Deep learning makes use of multilayer neural networks to learn a feature-based function from the input (e.g., hundreds of correlated summary statistics of data) to the output (e.g., population genetic parameters of interest). We demonstrate that deep learning can be effectively employed for population genetic inference and learning informative features of data. As a concrete application, we focus on the challenging problem of jointly inferring natural selection and demography (in the form of a population size change history). Our method is able to separate the global nature of demography from the local nature of selection, without sequential steps for these two factors. Studying demography and selection jointly is motivated by Drosophila, where pervasive selection confounds demographic analysis. We apply our method to 197 African Drosophila melanogaster genomes from Zambia to infer both their overall demography, and regions of their genome under selection. We find many regions of the genome that have experienced hard sweeps, and fewer under selection on standing variation (soft sweep) or balancing selection. Interestingly, we find that soft sweeps and balancing selection occur more frequently closer to the centromere of each chromosome. In addition, our demographic inference suggests that previously estimated bottlenecks for African Drosophila melanogaster are too extreme.

  9. Deep Learning for Population Genetic Inference

    PubMed Central

    Sheehan, Sara; Song, Yun S.

    2016-01-01

    Given genomic variation data from multiple individuals, computing the likelihood of complex population genetic models is often infeasible. To circumvent this problem, we introduce a novel likelihood-free inference framework by applying deep learning, a powerful modern technique in machine learning. Deep learning makes use of multilayer neural networks to learn a feature-based function from the input (e.g., hundreds of correlated summary statistics of data) to the output (e.g., population genetic parameters of interest). We demonstrate that deep learning can be effectively employed for population genetic inference and learning informative features of data. As a concrete application, we focus on the challenging problem of jointly inferring natural selection and demography (in the form of a population size change history). Our method is able to separate the global nature of demography from the local nature of selection, without sequential steps for these two factors. Studying demography and selection jointly is motivated by Drosophila, where pervasive selection confounds demographic analysis. We apply our method to 197 African Drosophila melanogaster genomes from Zambia to infer both their overall demography, and regions of their genome under selection. We find many regions of the genome that have experienced hard sweeps, and fewer under selection on standing variation (soft sweep) or balancing selection. Interestingly, we find that soft sweeps and balancing selection occur more frequently closer to the centromere of each chromosome. In addition, our demographic inference suggests that previously estimated bottlenecks for African Drosophila melanogaster are too extreme. PMID:27018908

  10. Pairing field methods to improve inference in wildlife surveys while accommodating detection covariance.

    PubMed

    Clare, John; McKinney, Shawn T; DePue, John E; Loftin, Cynthia S

    2017-10-01

    It is common to use multiple field sampling methods when implementing wildlife surveys to compare method efficacy or cost efficiency, integrate distinct pieces of information provided by separate methods, or evaluate method-specific biases and misclassification error. Existing models that combine information from multiple field methods or sampling devices permit rigorous comparison of method-specific detection parameters, enable estimation of additional parameters such as false-positive detection probability, and improve occurrence or abundance estimates, but with the assumption that the separate sampling methods produce detections independently of one another. This assumption is tenuous if methods are paired or deployed in close proximity simultaneously, a common practice that reduces the additional effort required to implement multiple methods and reduces the risk that differences between method-specific detection parameters are confounded by other environmental factors. We develop occupancy and spatial capture-recapture models that permit covariance between the detections produced by different methods, use simulation to compare estimator performance of the new models to models assuming independence, and provide an empirical application based on American marten (Martes americana) surveys using paired remote cameras, hair catches, and snow tracking. Simulation results indicate existing models that assume that methods independently detect organisms produce biased parameter estimates and substantially understate estimate uncertainty when this assumption is violated, while our reformulated models are robust to either methodological independence or covariance. Empirical results suggested that remote cameras and snow tracking had comparable probability of detecting present martens, but that snow tracking also produced false-positive marten detections that could potentially substantially bias distribution estimates if not corrected for. Remote cameras detected marten individuals more readily than passive hair catches. Inability to photographically distinguish individual sex did not appear to induce negative bias in camera density estimates; instead, hair catches appeared to produce detection competition between individuals that may have been a source of negative bias. Our model reformulations broaden the range of circumstances in which analyses incorporating multiple sources of information can be robustly used, and our empirical results demonstrate that using multiple field-methods can enhance inferences regarding ecological parameters of interest and improve understanding of how reliably survey methods sample these parameters. © 2017 by the Ecological Society of America.

  11. Prediction system of hydroponic plant growth and development using algorithm Fuzzy Mamdani method

    NASA Astrophysics Data System (ADS)

    Sudana, I. Made; Purnawirawan, Okta; Arief, Ulfa Mediaty

    2017-03-01

    Hydroponics is a method of farming without soil. One of the Hydroponic plants is Watercress (Nasturtium Officinale). The development and growth process of hydroponic Watercress was influenced by levels of nutrients, acidity and temperature. The independent variables can be used as input variable system to predict the value level of plants growth and development. The prediction system is using Fuzzy Algorithm Mamdani method. This system was built to implement the function of Fuzzy Inference System (Fuzzy Inference System/FIS) as a part of the Fuzzy Logic Toolbox (FLT) by using MATLAB R2007b. FIS is a computing system that works on the principle of fuzzy reasoning which is similar to humans' reasoning. Basically FIS consists of four units which are fuzzification unit, fuzzy logic reasoning unit, base knowledge unit and defuzzification unit. In addition to know the effect of independent variables on the plants growth and development that can be visualized with the function diagram of FIS output surface that is shaped three-dimensional, and statistical tests based on the data from the prediction system using multiple linear regression method, which includes multiple linear regression analysis, T test, F test, the coefficient of determination and donations predictor that are calculated using SPSS (Statistical Product and Service Solutions) software applications.

  12. Using directed information for influence discovery in interconnected dynamical systems

    NASA Astrophysics Data System (ADS)

    Rao, Arvind; Hero, Alfred O.; States, David J.; Engel, James Douglas

    2008-08-01

    Structure discovery in non-linear dynamical systems is an important and challenging problem that arises in various applications such as computational neuroscience, econometrics, and biological network discovery. Each of these systems have multiple interacting variables and the key problem is the inference of the underlying structure of the systems (which variables are connected to which others) based on the output observations (such as multiple time trajectories of the variables). Since such applications demand the inference of directed relationships among variables in these non-linear systems, current methods that have a linear assumption on structure or yield undirected variable dependencies are insufficient. Hence, in this work, we present a methodology for structure discovery using an information-theoretic metric called directed time information (DTI). Using both synthetic dynamical systems as well as true biological datasets (kidney development and T-cell data), we demonstrate the utility of DTI in such problems.

  13. Causal inference and the data-fusion problem

    PubMed Central

    Bareinboim, Elias; Pearl, Judea

    2016-01-01

    We review concepts, principles, and tools that unify current approaches to causal analysis and attend to new challenges presented by big data. In particular, we address the problem of data fusion—piecing together multiple datasets collected under heterogeneous conditions (i.e., different populations, regimes, and sampling methods) to obtain valid answers to queries of interest. The availability of multiple heterogeneous datasets presents new opportunities to big data analysts, because the knowledge that can be acquired from combined data would not be possible from any individual source alone. However, the biases that emerge in heterogeneous environments require new analytical tools. Some of these biases, including confounding, sampling selection, and cross-population biases, have been addressed in isolation, largely in restricted parametric models. We here present a general, nonparametric framework for handling these biases and, ultimately, a theoretical solution to the problem of data fusion in causal inference tasks. PMID:27382148

  14. A Fiducial Approach to Extremes and Multiple Comparisons

    ERIC Educational Resources Information Center

    Wandler, Damian V.

    2010-01-01

    Generalized fiducial inference is a powerful tool for many difficult problems. Based on an extension of R. A. Fisher's work, we used generalized fiducial inference for two extreme value problems and a multiple comparison procedure. The first extreme value problem is dealing with the generalized Pareto distribution. The generalized Pareto…

  15. HLA Type Inference via Haplotypes Identical by Descent

    NASA Astrophysics Data System (ADS)

    Setty, Manu N.; Gusev, Alexander; Pe'Er, Itsik

    The Human Leukocyte Antigen (HLA) genes play a major role in adaptive immune response and are used to differentiate self antigens from non self ones. HLA genes are hyper variable with nearly every locus harboring over a dozen alleles. This variation plays an important role in susceptibility to multiple autoimmune diseases and needs to be matched on for organ transplantation. Unfortunately, HLA typing by serological methods is time consuming and expensive compared to high throughput Single Nucleotide Polymorphism (SNP) data. We present a new computational method to infer per-locus HLA types using shared segments Identical By Descent (IBD), inferred from SNP genotype data. IBD information is modeled as graph where shared haplotypes are explored among clusters of individuals with known and unknown HLA types to identify the latter. We analyze performance of the method in a previously typed subset of the HapMap population, achieving accuracy of 96% in HLA-A, 94% in HLA-B, 95% in HLA-C, 77% in HLA-DR1, 93% in HLA-DQA1 and 90% in HLA-DQB1 genes. We compare our method to a tag SNP based approach and demonstrate higher sensitivity and specificity. Our method demonstrates the power of using shared haplotype segments for large-scale imputation at the HLA locus.

  16. Trait inferences in goal-directed behavior: ERP timing and localization under spontaneous and intentional processing

    PubMed Central

    Van den Eede, Sofie; Baetens, Kris; Vandekerckhove, Marie

    2009-01-01

    This study measured event-related potentials (ERPs) during multiple goal and trait inferences, under spontaneous or intentional instructions. Participants read sentences describing several goal-implying behaviors of a target person from which also a strong trait could be inferred or not. The last word of each sentence determined the consistency with the inference induced during preceding sentences. In comparison with behaviors that implied only a goal, stronger waveforms beginning at ∼150 ms were obtained when the behaviors additionally implied a trait. These ERPs showed considerable parallels between spontaneous and intentional inferences. This suggests that traits embedded in a stream of goal-directed behaviors were detected more rapidly and automatically than mere goals, irrespective of the participants’ spontaneous or intentional instructions. In line with this, source localization (LORETA) of the ERPs show predominantly activation in the temporo-parietal junction (TPJ) during 150–200 ms, suggesting that goals were detected at that time interval. During 200–300 ms, activation was stronger at the medial prefrontal cortex (mPFC) for multiple goals and traits as opposed to goals only, suggesting that traits were inferred during this time window. A cued recall measure taken after the presentation of the stimulus material support the occurrence of goal and trait inferences and shows significant correlations with the neural components, indicating that these components are valid neural indices of spontaneous and intentional social inferences. The early detection of multiple goal and trait inferences is explained in terms of their greater social relevance, leading to privileged attention allocation and processing in the brain. PMID:19270041

  17. Genealogies of rapidly adapting populations

    PubMed Central

    Neher, Richard A.; Hallatschek, Oskar

    2013-01-01

    The genetic diversity of a species is shaped by its recent evolutionary history and can be used to infer demographic events or selective sweeps. Most inference methods are based on the null hypothesis that natural selection is a weak or infrequent evolutionary force. However, many species, particularly pathogens, are under continuous pressure to adapt in response to changing environments. A statistical framework for inference from diversity data of such populations is currently lacking. Towards this goal, we explore the properties of genealogies in a model of continual adaptation in asexual populations. We show that lineages trace back to a small pool of highly fit ancestors, in which almost simultaneous coalescence of more than two lineages frequently occurs. Whereas such multiple mergers are unlikely under the neutral coalescent, they create a unique genetic footprint in adapting populations. The site frequency spectrum of derived neutral alleles, for example, is nonmonotonic and has a peak at high frequencies, whereas Tajima’s D becomes more and more negative with increasing sample size. Because multiple merger coalescents emerge in many models of rapid adaptation, we argue that they should be considered as a null model for adapting populations. PMID:23269838

  18. Thinking too positive? Revisiting current methods of population genetic selection inference.

    PubMed

    Bank, Claudia; Ewing, Gregory B; Ferrer-Admettla, Anna; Foll, Matthieu; Jensen, Jeffrey D

    2014-12-01

    In the age of next-generation sequencing, the availability of increasing amounts and improved quality of data at decreasing cost ought to allow for a better understanding of how natural selection is shaping the genome than ever before. However, alternative forces, such as demography and background selection (BGS), obscure the footprints of positive selection that we would like to identify. In this review, we illustrate recent developments in this area, and outline a roadmap for improved selection inference. We argue (i) that the development and obligatory use of advanced simulation tools is necessary for improved identification of selected loci, (ii) that genomic information from multiple time points will enhance the power of inference, and (iii) that results from experimental evolution should be utilized to better inform population genomic studies. Copyright © 2014 Elsevier Ltd. All rights reserved.

  19. INfORM: Inference of NetwOrk Response Modules.

    PubMed

    Marwah, Veer Singh; Kinaret, Pia Anneli Sofia; Serra, Angela; Scala, Giovanni; Lauerma, Antti; Fortino, Vittorio; Greco, Dario

    2018-06-15

    Detecting and interpreting responsive modules from gene expression data by using network-based approaches is a common but laborious task. It often requires the application of several computational methods implemented in different software packages, forcing biologists to compile complex analytical pipelines. Here we introduce INfORM (Inference of NetwOrk Response Modules), an R shiny application that enables non-expert users to detect, evaluate and select gene modules with high statistical and biological significance. INfORM is a comprehensive tool for the identification of biologically meaningful response modules from consensus gene networks inferred by using multiple algorithms. It is accessible through an intuitive graphical user interface allowing for a level of abstraction from the computational steps. INfORM is freely available for academic use at https://github.com/Greco-Lab/INfORM. Supplementary data are available at Bioinformatics online.

  20. Imputation approaches for animal movement modeling

    USGS Publications Warehouse

    Scharf, Henry; Hooten, Mevin B.; Johnson, Devin S.

    2017-01-01

    The analysis of telemetry data is common in animal ecological studies. While the collection of telemetry data for individual animals has improved dramatically, the methods to properly account for inherent uncertainties (e.g., measurement error, dependence, barriers to movement) have lagged behind. Still, many new statistical approaches have been developed to infer unknown quantities affecting animal movement or predict movement based on telemetry data. Hierarchical statistical models are useful to account for some of the aforementioned uncertainties, as well as provide population-level inference, but they often come with an increased computational burden. For certain types of statistical models, it is straightforward to provide inference if the latent true animal trajectory is known, but challenging otherwise. In these cases, approaches related to multiple imputation have been employed to account for the uncertainty associated with our knowledge of the latent trajectory. Despite the increasing use of imputation approaches for modeling animal movement, the general sensitivity and accuracy of these methods have not been explored in detail. We provide an introduction to animal movement modeling and describe how imputation approaches may be helpful for certain types of models. We also assess the performance of imputation approaches in two simulation studies. Our simulation studies suggests that inference for model parameters directly related to the location of an individual may be more accurate than inference for parameters associated with higher-order processes such as velocity or acceleration. Finally, we apply these methods to analyze a telemetry data set involving northern fur seals (Callorhinus ursinus) in the Bering Sea. Supplementary materials accompanying this paper appear online.

  1. QuASAR: quantitative allele-specific analysis of reads

    PubMed Central

    Harvey, Chris T.; Moyerbrailean, Gregory A.; Davis, Gordon O.; Wen, Xiaoquan; Luca, Francesca; Pique-Regi, Roger

    2015-01-01

    Motivation: Expression quantitative trait loci (eQTL) studies have discovered thousands of genetic variants that regulate gene expression, enabling a better understanding of the functional role of non-coding sequences. However, eQTL studies are costly, requiring large sample sizes and genome-wide genotyping of each sample. In contrast, analysis of allele-specific expression (ASE) is becoming a popular approach to detect the effect of genetic variation on gene expression, even within a single individual. This is typically achieved by counting the number of RNA-seq reads matching each allele at heterozygous sites and testing the null hypothesis of a 1:1 allelic ratio. In principle, when genotype information is not readily available, it could be inferred from the RNA-seq reads directly. However, there are currently no existing methods that jointly infer genotypes and conduct ASE inference, while considering uncertainty in the genotype calls. Results: We present QuASAR, quantitative allele-specific analysis of reads, a novel statistical learning method for jointly detecting heterozygous genotypes and inferring ASE. The proposed ASE inference step takes into consideration the uncertainty in the genotype calls, while including parameters that model base-call errors in sequencing and allelic over-dispersion. We validated our method with experimental data for which high-quality genotypes are available. Results for an additional dataset with multiple replicates at different sequencing depths demonstrate that QuASAR is a powerful tool for ASE analysis when genotypes are not available. Availability and implementation: http://github.com/piquelab/QuASAR. Contact: fluca@wayne.edu or rpique@wayne.edu Supplementary information: Supplementary Material is available at Bioinformatics online. PMID:25480375

  2. Using DNA to track the origin of the largest ivory seizure since the 1989 trade ban.

    PubMed

    Wasser, Samuel K; Mailand, Celia; Booth, Rebecca; Mutayoba, Benezeth; Kisamo, Emily; Clark, Bill; Stephens, Matthew

    2007-03-06

    The illegal ivory trade recently intensified to the highest levels ever reported. Policing this trafficking has been hampered by the inability to reliably determine geographic origin of contraband ivory. Ivory can be smuggled across multiple international borders and along numerous trade routes, making poaching hotspots and potential trade routes difficult to identify. This fluidity also makes it difficult to refute a country's denial of poaching problems. We extend an innovative DNA assignment method to determine the geographic origin(s) of large elephant ivory seizures. A Voronoi tessellation method is used that utilizes genetic similarities across tusks to simultaneously infer the origin of multiple samples that could have one or more common origin(s). We show that this joint analysis performs better than sample-by-sample methods in assigning sample clusters of known origin. The joint method is then used to infer the geographic origin of the largest ivory seizure since the 1989 ivory trade ban. Wildlife authorities initially suspected that this ivory came from multiple locations across forest and savanna Africa. However, we show that the ivory was entirely from savanna elephants, most probably originating from a narrow east-to-west band of southern Africa, centered on Zambia. These findings enabled law enforcement to focus their investigation to a smaller area and fewer trade routes and led to changes within the Zambian government to improve antipoaching efforts. Such outcomes demonstrate the potential of genetic analyses to help combat the expanding wildlife trade by identifying origin(s) of large seizures of contraband ivory. Broader applications to wildlife trade are discussed.

  3. Using DNA to track the origin of the largest ivory seizure since the 1989 trade ban

    PubMed Central

    Wasser, Samuel K.; Mailand, Celia; Booth, Rebecca; Mutayoba, Benezeth; Kisamo, Emily; Clark, Bill; Stephens, Matthew

    2007-01-01

    The illegal ivory trade recently intensified to the highest levels ever reported. Policing this trafficking has been hampered by the inability to reliably determine geographic origin of contraband ivory. Ivory can be smuggled across multiple international borders and along numerous trade routes, making poaching hotspots and potential trade routes difficult to identify. This fluidity also makes it difficult to refute a country's denial of poaching problems. We extend an innovative DNA assignment method to determine the geographic origin(s) of large elephant ivory seizures. A Voronoi tessellation method is used that utilizes genetic similarities across tusks to simultaneously infer the origin of multiple samples that could have one or more common origin(s). We show that this joint analysis performs better than sample-by-sample methods in assigning sample clusters of known origin. The joint method is then used to infer the geographic origin of the largest ivory seizure since the 1989 ivory trade ban. Wildlife authorities initially suspected that this ivory came from multiple locations across forest and savanna Africa. However, we show that the ivory was entirely from savanna elephants, most probably originating from a narrow east-to-west band of southern Africa, centered on Zambia. These findings enabled law enforcement to focus their investigation to a smaller area and fewer trade routes and led to changes within the Zambian government to improve antipoaching efforts. Such outcomes demonstrate the potential of genetic analyses to help combat the expanding wildlife trade by identifying origin(s) of large seizures of contraband ivory. Broader applications to wildlife trade are discussed. PMID:17360505

  4. Inference-Based Similarity Search in Randomized Montgomery Domains for Privacy-Preserving Biometric Identification.

    PubMed

    Wang, Yi; Wan, Jianwu; Guo, Jun; Cheung, Yiu-Ming; Yuen, Pong C; Yi Wang; Jianwu Wan; Jun Guo; Yiu-Ming Cheung; Yuen, Pong C; Cheung, Yiu-Ming; Guo, Jun; Yuen, Pong C; Wan, Jianwu; Wang, Yi

    2018-07-01

    Similarity search is essential to many important applications and often involves searching at scale on high-dimensional data based on their similarity to a query. In biometric applications, recent vulnerability studies have shown that adversarial machine learning can compromise biometric recognition systems by exploiting the biometric similarity information. Existing methods for biometric privacy protection are in general based on pairwise matching of secured biometric templates and have inherent limitations in search efficiency and scalability. In this paper, we propose an inference-based framework for privacy-preserving similarity search in Hamming space. Our approach builds on an obfuscated distance measure that can conceal Hamming distance in a dynamic interval. Such a mechanism enables us to systematically design statistically reliable methods for retrieving most likely candidates without knowing the exact distance values. We further propose to apply Montgomery multiplication for generating search indexes that can withstand adversarial similarity analysis, and show that information leakage in randomized Montgomery domains can be made negligibly small. Our experiments on public biometric datasets demonstrate that the inference-based approach can achieve a search accuracy close to the best performance possible with secure computation methods, but the associated cost is reduced by orders of magnitude compared to cryptographic primitives.

  5. Phylogeny of the cycads based on multiple single copy nuclear genes: congruence of concatenation and species tree inference methods

    USDA-ARS?s Scientific Manuscript database

    Despite a recent new classification, a stable tree of life for the cycads has been elusive, particularly regarding resolution of Bowenia, Stangeria and Dioon. In this study we apply five single copy nuclear genes (SCNGs) to the phylogeny of the order Cycadales. We specifically aim to evaluate seve...

  6. Statistical Methods for Generalized Linear Models with Covariates Subject to Detection Limits.

    PubMed

    Bernhardt, Paul W; Wang, Huixia J; Zhang, Daowen

    2015-05-01

    Censored observations are a common occurrence in biomedical data sets. Although a large amount of research has been devoted to estimation and inference for data with censored responses, very little research has focused on proper statistical procedures when predictors are censored. In this paper, we consider statistical methods for dealing with multiple predictors subject to detection limits within the context of generalized linear models. We investigate and adapt several conventional methods and develop a new multiple imputation approach for analyzing data sets with predictors censored due to detection limits. We establish the consistency and asymptotic normality of the proposed multiple imputation estimator and suggest a computationally simple and consistent variance estimator. We also demonstrate that the conditional mean imputation method often leads to inconsistent estimates in generalized linear models, while several other methods are either computationally intensive or lead to parameter estimates that are biased or more variable compared to the proposed multiple imputation estimator. In an extensive simulation study, we assess the bias and variability of different approaches within the context of a logistic regression model and compare variance estimation methods for the proposed multiple imputation estimator. Lastly, we apply several methods to analyze the data set from a recently-conducted GenIMS study.

  7. The role of multiple-scale modelling of epilepsy in seizure forecasting

    PubMed Central

    Kuhlmann, Levin; Grayden, David B.; Wendling, Fabrice; Schiff, Steven J.

    2014-01-01

    Over the past three decades, a number of seizure prediction, or forecasting, methods have been developed. Although major achievements were accomplished regarding the statistical evaluation of proposed algorithms, it is recognized that further progress is still necessary for clinical application in patients. The lack of physiological motivation can partly explain this limitation. Therefore, a natural question is raised: can computational models of epilepsy be used to improve these methods? Here we review the literature on the multiple-scale neural modelling of epilepsy and the use of such models to infer physiological changes underlying epilepsy and epileptic seizures. We argue how these methods can be applied to advance the state-of-the-art in seizure forecasting. PMID:26035674

  8. GLOBAL HELIOSEISMIC EVIDENCE FOR A DEEPLY PENETRATING SOLAR MERIDIONAL FLOW CONSISTING OF MULTIPLE FLOW CELLS

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Schad, A.; Roth, M.; Timmer, J., E-mail: ariane.schad@kis.uni-freiburg.de

    2013-12-01

    We use a novel global helioseismic analysis method to infer the meridional flow in the deep Solar interior. The method is based on the perturbation of eigenfunctions of Solar p modes due to meridional flow. We apply this method to time series obtained from Dopplergrams measured by the Michelson Doppler Imager aboard the Solar and Heliospheric Observatory covering the observation period 2004-2010. Our results show evidence that the meridional flow reaches down to the base of the convection zone. The flow profile has a complex spatial structure consisting of multiple flow cells distributed in depth and latitude. Toward the Solarmore » surface, our results are in good agreement with flow measurements from local helioseismology.« less

  9. Using a derivative-free optimization method for multiple solutions of inverse transport problems

    DOE PAGES

    Armstrong, Jerawan C.; Favorite, Jeffrey A.

    2016-01-14

    Identifying unknown components of an object that emits radiation is an important problem for national and global security. Radiation signatures measured from an object of interest can be used to infer object parameter values that are not known. This problem is called an inverse transport problem. An inverse transport problem may have multiple solutions and the most widely used approach for its solution is an iterative optimization method. This paper proposes a stochastic derivative-free global optimization algorithm to find multiple solutions of inverse transport problems. The algorithm is an extension of a multilevel single linkage (MLSL) method where a meshmore » adaptive direct search (MADS) algorithm is incorporated into the local phase. Furthermore, numerical test cases using uncollided fluxes of discrete gamma-ray lines are presented to show the performance of this new algorithm.« less

  10. Identifying predictors of time-inhomogeneous viral evolutionary processes.

    PubMed

    Bielejec, Filip; Baele, Guy; Rodrigo, Allen G; Suchard, Marc A; Lemey, Philippe

    2016-07-01

    Various factors determine the rate at which mutations are generated and fixed in viral genomes. Viral evolutionary rates may vary over the course of a single persistent infection and can reflect changes in replication rates and selective dynamics. Dedicated statistical inference approaches are required to understand how the complex interplay of these processes shapes the genetic diversity and divergence in viral populations. Although evolutionary models accommodating a high degree of complexity can now be formalized, adequately informing these models by potentially sparse data, and assessing the association of the resulting estimates with external predictors, remains a major challenge. In this article, we present a novel Bayesian evolutionary inference method, which integrates multiple potential predictors and tests their association with variation in the absolute rates of synonymous and non-synonymous substitutions along the evolutionary history. We consider clinical and virological measures as predictors, but also changes in population size trajectories that are simultaneously inferred using coalescent modelling. We demonstrate the potential of our method in an application to within-host HIV-1 sequence data sampled throughout the infection of multiple patients. While analyses of individual patient populations lack statistical power, we detect significant evidence for an abrupt drop in non-synonymous rates in late stage infection and a more gradual increase in synonymous rates over the course of infection in a joint analysis across all patients. The former is predicted by the immune relaxation hypothesis while the latter may be in line with increasing replicative fitness during the asymptomatic stage.

  11. Statistical detection of EEG synchrony using empirical bayesian inference.

    PubMed

    Singh, Archana K; Asoh, Hideki; Takeda, Yuji; Phillips, Steven

    2015-01-01

    There is growing interest in understanding how the brain utilizes synchronized oscillatory activity to integrate information across functionally connected regions. Computing phase-locking values (PLV) between EEG signals is a popular method for quantifying such synchronizations and elucidating their role in cognitive tasks. However, high-dimensionality in PLV data incurs a serious multiple testing problem. Standard multiple testing methods in neuroimaging research (e.g., false discovery rate, FDR) suffer severe loss of power, because they fail to exploit complex dependence structure between hypotheses that vary in spectral, temporal and spatial dimension. Previously, we showed that a hierarchical FDR and optimal discovery procedures could be effectively applied for PLV analysis to provide better power than FDR. In this article, we revisit the multiple comparison problem from a new Empirical Bayes perspective and propose the application of the local FDR method (locFDR; Efron, 2001) for PLV synchrony analysis to compute FDR as a posterior probability that an observed statistic belongs to a null hypothesis. We demonstrate the application of Efron's Empirical Bayes approach for PLV synchrony analysis for the first time. We use simulations to validate the specificity and sensitivity of locFDR and a real EEG dataset from a visual search study for experimental validation. We also compare locFDR with hierarchical FDR and optimal discovery procedures in both simulation and experimental analyses. Our simulation results showed that the locFDR can effectively control false positives without compromising on the power of PLV synchrony inference. Our results from the application locFDR on experiment data detected more significant discoveries than our previously proposed methods whereas the standard FDR method failed to detect any significant discoveries.

  12. Inverse MDS: Inferring Dissimilarity Structure from Multiple Item Arrangements

    PubMed Central

    Kriegeskorte, Nikolaus; Mur, Marieke

    2012-01-01

    The pairwise dissimilarities of a set of items can be intuitively visualized by a 2D arrangement of the items, in which the distances reflect the dissimilarities. Such an arrangement can be obtained by multidimensional scaling (MDS). We propose a method for the inverse process: inferring the pairwise dissimilarities from multiple 2D arrangements of items. Perceptual dissimilarities are classically measured using pairwise dissimilarity judgments. However, alternative methods including free sorting and 2D arrangements have previously been proposed. The present proposal is novel (a) in that the dissimilarity matrix is estimated by “inverse MDS” based on multiple arrangements of item subsets, and (b) in that the subsets are designed by an adaptive algorithm that aims to provide optimal evidence for the dissimilarity estimates. The subject arranges the items (represented as icons on a computer screen) by means of mouse drag-and-drop operations. The multi-arrangement method can be construed as a generalization of simpler methods: It reduces to pairwise dissimilarity judgments if each arrangement contains only two items, and to free sorting if the items are categorically arranged into discrete piles. Multi-arrangement combines the advantages of these methods. It is efficient (because the subject communicates many dissimilarity judgments with each mouse drag), psychologically attractive (because dissimilarities are judged in context), and can characterize continuous high-dimensional dissimilarity structures. We present two procedures for estimating the dissimilarity matrix: a simple weighted-aligned-average of the partial dissimilarity matrices and a computationally intensive algorithm, which estimates the dissimilarity matrix by iteratively minimizing the error of MDS-predictions of the subject’s arrangements. The Matlab code for interactive arrangement and dissimilarity estimation is available from the authors upon request. PMID:22848204

  13. Multiple co-clustering based on nonparametric mixture models with heterogeneous marginal distributions

    PubMed Central

    Yoshimoto, Junichiro; Shimizu, Yu; Okada, Go; Takamura, Masahiro; Okamoto, Yasumasa; Yamawaki, Shigeto; Doya, Kenji

    2017-01-01

    We propose a novel method for multiple clustering, which is useful for analysis of high-dimensional data containing heterogeneous types of features. Our method is based on nonparametric Bayesian mixture models in which features are automatically partitioned (into views) for each clustering solution. This feature partition works as feature selection for a particular clustering solution, which screens out irrelevant features. To make our method applicable to high-dimensional data, a co-clustering structure is newly introduced for each view. Further, the outstanding novelty of our method is that we simultaneously model different distribution families, such as Gaussian, Poisson, and multinomial distributions in each cluster block, which widens areas of application to real data. We apply the proposed method to synthetic and real data, and show that our method outperforms other multiple clustering methods both in recovering true cluster structures and in computation time. Finally, we apply our method to a depression dataset with no true cluster structure available, from which useful inferences are drawn about possible clustering structures of the data. PMID:29049392

  14. A Bayesian method for detecting pairwise associations in compositional data

    PubMed Central

    Ventz, Steffen; Huttenhower, Curtis

    2017-01-01

    Compositional data consist of vectors of proportions normalized to a constant sum from a basis of unobserved counts. The sum constraint makes inference on correlations between unconstrained features challenging due to the information loss from normalization. However, such correlations are of long-standing interest in fields including ecology. We propose a novel Bayesian framework (BAnOCC: Bayesian Analysis of Compositional Covariance) to estimate a sparse precision matrix through a LASSO prior. The resulting posterior, generated by MCMC sampling, allows uncertainty quantification of any function of the precision matrix, including the correlation matrix. We also use a first-order Taylor expansion to approximate the transformation from the unobserved counts to the composition in order to investigate what characteristics of the unobserved counts can make the correlations more or less difficult to infer. On simulated datasets, we show that BAnOCC infers the true network as well as previous methods while offering the advantage of posterior inference. Larger and more realistic simulated datasets further showed that BAnOCC performs well as measured by type I and type II error rates. Finally, we apply BAnOCC to a microbial ecology dataset from the Human Microbiome Project, which in addition to reproducing established ecological results revealed unique, competition-based roles for Proteobacteria in multiple distinct habitats. PMID:29140991

  15. SDG and qualitative trend based model multiple scale validation

    NASA Astrophysics Data System (ADS)

    Gao, Dong; Xu, Xin; Yin, Jianjin; Zhang, Hongyu; Zhang, Beike

    2017-09-01

    Verification, Validation and Accreditation (VV&A) is key technology of simulation and modelling. For the traditional model validation methods, the completeness is weak; it is carried out in one scale; it depends on human experience. The SDG (Signed Directed Graph) and qualitative trend based multiple scale validation is proposed. First the SDG model is built and qualitative trends are added to the model. And then complete testing scenarios are produced by positive inference. The multiple scale validation is carried out by comparing the testing scenarios with outputs of simulation model in different scales. Finally, the effectiveness is proved by carrying out validation for a reactor model.

  16. Learning to Observe "and" Infer

    ERIC Educational Resources Information Center

    Hanuscin, Deborah L.; Park Rogers, Meredith A.

    2008-01-01

    Researchers describe the need for students to have multiple opportunities and social interaction to learn about the differences between observation and inference and their role in developing scientific explanations (Harlen 2001; Simpson 2000). Helping children develop their skills of observation and inference in science while emphasizing the…

  17. Computational Precision of Mental Inference as Critical Source of Human Choice Suboptimality.

    PubMed

    Drugowitsch, Jan; Wyart, Valentin; Devauchelle, Anne-Dominique; Koechlin, Etienne

    2016-12-21

    Making decisions in uncertain environments often requires combining multiple pieces of ambiguous information from external cues. In such conditions, human choices resemble optimal Bayesian inference, but typically show a large suboptimal variability whose origin remains poorly understood. In particular, this choice suboptimality might arise from imperfections in mental inference rather than in peripheral stages, such as sensory processing and response selection. Here, we dissociate these three sources of suboptimality in human choices based on combining multiple ambiguous cues. Using a novel quantitative approach for identifying the origin and structure of choice variability, we show that imperfections in inference alone cause a dominant fraction of suboptimal choices. Furthermore, two-thirds of this suboptimality appear to derive from the limited precision of neural computations implementing inference rather than from systematic deviations from Bayes-optimal inference. These findings set an upper bound on the accuracy and ultimate predictability of human choices in uncertain environments. Copyright © 2016 Elsevier Inc. All rights reserved.

  18. Data fusion and classification using a hybrid intrinsic cellular inference network

    NASA Astrophysics Data System (ADS)

    Woodley, Robert; Walenz, Brett; Seiffertt, John; Robinette, Paul; Wunsch, Donald

    2010-04-01

    Hybrid Intrinsic Cellular Inference Network (HICIN) is designed for battlespace decision support applications. We developed an automatic method of generating hypotheses for an entity-attribute classifier. The capability and effectiveness of a domain specific ontology was used to generate automatic categories for data classification. Heterogeneous data is clustered using an Adaptive Resonance Theory (ART) inference engine on a sample (unclassified) data set. The data set is the Lahman baseball database. The actual data is immaterial to the architecture, however, parallels in the data can be easily drawn (i.e., "Team" maps to organization, "Runs scored/allowed" to Measure of organization performance (positive/negative), "Payroll" to organization resources, etc.). Results show that HICIN classifiers create known inferences from the heterogonous data. These inferences are not explicitly stated in the ontological description of the domain and are strictly data driven. HICIN uses data uncertainty handling to reduce errors in the classification. The uncertainty handling is based on subjective logic. The belief mass allows evidence from multiple sources to be mathematically combined to increase or discount an assertion. In military operations the ability to reduce uncertainty will be vital in the data fusion operation.

  19. Kernel learning at the first level of inference.

    PubMed

    Cawley, Gavin C; Talbot, Nicola L C

    2014-05-01

    Kernel learning methods, whether Bayesian or frequentist, typically involve multiple levels of inference, with the coefficients of the kernel expansion being determined at the first level and the kernel and regularisation parameters carefully tuned at the second level, a process known as model selection. Model selection for kernel machines is commonly performed via optimisation of a suitable model selection criterion, often based on cross-validation or theoretical performance bounds. However, if there are a large number of kernel parameters, as for instance in the case of automatic relevance determination (ARD), there is a substantial risk of over-fitting the model selection criterion, resulting in poor generalisation performance. In this paper we investigate the possibility of learning the kernel, for the Least-Squares Support Vector Machine (LS-SVM) classifier, at the first level of inference, i.e. parameter optimisation. The kernel parameters and the coefficients of the kernel expansion are jointly optimised at the first level of inference, minimising a training criterion with an additional regularisation term acting on the kernel parameters. The key advantage of this approach is that the values of only two regularisation parameters need be determined in model selection, substantially alleviating the problem of over-fitting the model selection criterion. The benefits of this approach are demonstrated using a suite of synthetic and real-world binary classification benchmark problems, where kernel learning at the first level of inference is shown to be statistically superior to the conventional approach, improves on our previous work (Cawley and Talbot, 2007) and is competitive with Multiple Kernel Learning approaches, but with reduced computational expense. Copyright © 2014 Elsevier Ltd. All rights reserved.

  20. Penalized regression procedures for variable selection in the potential outcomes framework

    PubMed Central

    Ghosh, Debashis; Zhu, Yeying; Coffman, Donna L.

    2015-01-01

    A recent topic of much interest in causal inference is model selection. In this article, we describe a framework in which to consider penalized regression approaches to variable selection for causal effects. The framework leads to a simple ‘impute, then select’ class of procedures that is agnostic to the type of imputation algorithm as well as penalized regression used. It also clarifies how model selection involves a multivariate regression model for causal inference problems, and that these methods can be applied for identifying subgroups in which treatment effects are homogeneous. Analogies and links with the literature on machine learning methods, missing data and imputation are drawn. A difference LASSO algorithm is defined, along with its multiple imputation analogues. The procedures are illustrated using a well-known right heart catheterization dataset. PMID:25628185

  1. Identifying Loci Under Selection Against Gene Flow in Isolation-with-Migration Models

    PubMed Central

    Sousa, Vitor C.; Carneiro, Miguel; Ferrand, Nuno; Hey, Jody

    2013-01-01

    When divergence occurs in the presence of gene flow, there can arise an interesting dynamic in which selection against gene flow, at sites associated with population-specific adaptations or genetic incompatibilities, can cause net gene flow to vary across the genome. Loci linked to sites under selection may experience reduced gene flow and may experience genetic bottlenecks by the action of nearby selective sweeps. Data from histories such as these may be poorly fitted by conventional neutral model approaches to demographic inference, which treat all loci as equally subject to forces of genetic drift and gene flow. To allow for demographic inference in the face of such histories, as well as the identification of loci affected by selection, we developed an isolation-with-migration model that explicitly provides for variation among genomic regions in migration rates and/or rates of genetic drift. The method allows for loci to fall into any of multiple groups, each characterized by a different set of parameters, thus relaxing the assumption that all loci share the same demography. By grouping loci, the method can be applied to data with multiple loci and still have tractable dimensionality and statistical power. We studied the performance of the method using simulated data, and we applied the method to study the divergence of two subspecies of European rabbits (Oryctolagus cuniculus). PMID:23457232

  2. A model-based approach to wildland fire reconstruction using sediment charcoal records

    USGS Publications Warehouse

    Itter, Malcolm S.; Finley, Andrew O.; Hooten, Mevin B.; Higuera, Philip E.; Marlon, Jennifer R.; Kelly, Ryan; McLachlan, Jason S.

    2017-01-01

    Lake sediment charcoal records are used in paleoecological analyses to reconstruct fire history, including the identification of past wildland fires. One challenge of applying sediment charcoal records to infer fire history is the separation of charcoal associated with local fire occurrence and charcoal originating from regional fire activity. Despite a variety of methods to identify local fires from sediment charcoal records, an integrated statistical framework for fire reconstruction is lacking. We develop a Bayesian point process model to estimate the probability of fire associated with charcoal counts from individual-lake sediments and estimate mean fire return intervals. A multivariate extension of the model combines records from multiple lakes to reduce uncertainty in local fire identification and estimate a regional mean fire return interval. The univariate and multivariate models are applied to 13 lakes in the Yukon Flats region of Alaska. Both models resulted in similar mean fire return intervals (100–350 years) with reduced uncertainty under the multivariate model due to improved estimation of regional charcoal deposition. The point process model offers an integrated statistical framework for paleofire reconstruction and extends existing methods to infer regional fire history from multiple lake records with uncertainty following directly from posterior distributions.

  3. Analysis of Sequence Data Under Multivariate Trait-Dependent Sampling.

    PubMed

    Tao, Ran; Zeng, Donglin; Franceschini, Nora; North, Kari E; Boerwinkle, Eric; Lin, Dan-Yu

    2015-06-01

    High-throughput DNA sequencing allows for the genotyping of common and rare variants for genetic association studies. At the present time and for the foreseeable future, it is not economically feasible to sequence all individuals in a large cohort. A cost-effective strategy is to sequence those individuals with extreme values of a quantitative trait. We consider the design under which the sampling depends on multiple quantitative traits. Under such trait-dependent sampling, standard linear regression analysis can result in bias of parameter estimation, inflation of type I error, and loss of power. We construct a likelihood function that properly reflects the sampling mechanism and utilizes all available data. We implement a computationally efficient EM algorithm and establish the theoretical properties of the resulting maximum likelihood estimators. Our methods can be used to perform separate inference on each trait or simultaneous inference on multiple traits. We pay special attention to gene-level association tests for rare variants. We demonstrate the superiority of the proposed methods over standard linear regression through extensive simulation studies. We provide applications to the Cohorts for Heart and Aging Research in Genomic Epidemiology Targeted Sequencing Study and the National Heart, Lung, and Blood Institute Exome Sequencing Project.

  4. Dynamic motif occupancy (DynaMO) analysis identifies transcription factors and their binding sites driving dynamic biological processes

    PubMed Central

    Kuang, Zheng; Ji, Zhicheng

    2018-01-01

    Abstract Biological processes are usually associated with genome-wide remodeling of transcription driven by transcription factors (TFs). Identifying key TFs and their spatiotemporal binding patterns are indispensable to understanding how dynamic processes are programmed. However, most methods are designed to predict TF binding sites only. We present a computational method, dynamic motif occupancy analysis (DynaMO), to infer important TFs and their spatiotemporal binding activities in dynamic biological processes using chromatin profiling data from multiple biological conditions such as time-course histone modification ChIP-seq data. In the first step, DynaMO predicts TF binding sites with a random forests approach. Next and uniquely, DynaMO infers dynamic TF binding activities at predicted binding sites using their local chromatin profiles from multiple biological conditions. Another landmark of DynaMO is to identify key TFs in a dynamic process using a clustering and enrichment analysis of dynamic TF binding patterns. Application of DynaMO to the yeast ultradian cycle, mouse circadian clock and human neural differentiation exhibits its accuracy and versatility. We anticipate DynaMO will be generally useful for elucidating transcriptional programs in dynamic processes. PMID:29325176

  5. QuASAR: quantitative allele-specific analysis of reads.

    PubMed

    Harvey, Chris T; Moyerbrailean, Gregory A; Davis, Gordon O; Wen, Xiaoquan; Luca, Francesca; Pique-Regi, Roger

    2015-04-15

    Expression quantitative trait loci (eQTL) studies have discovered thousands of genetic variants that regulate gene expression, enabling a better understanding of the functional role of non-coding sequences. However, eQTL studies are costly, requiring large sample sizes and genome-wide genotyping of each sample. In contrast, analysis of allele-specific expression (ASE) is becoming a popular approach to detect the effect of genetic variation on gene expression, even within a single individual. This is typically achieved by counting the number of RNA-seq reads matching each allele at heterozygous sites and testing the null hypothesis of a 1:1 allelic ratio. In principle, when genotype information is not readily available, it could be inferred from the RNA-seq reads directly. However, there are currently no existing methods that jointly infer genotypes and conduct ASE inference, while considering uncertainty in the genotype calls. We present QuASAR, quantitative allele-specific analysis of reads, a novel statistical learning method for jointly detecting heterozygous genotypes and inferring ASE. The proposed ASE inference step takes into consideration the uncertainty in the genotype calls, while including parameters that model base-call errors in sequencing and allelic over-dispersion. We validated our method with experimental data for which high-quality genotypes are available. Results for an additional dataset with multiple replicates at different sequencing depths demonstrate that QuASAR is a powerful tool for ASE analysis when genotypes are not available. http://github.com/piquelab/QuASAR. fluca@wayne.edu or rpique@wayne.edu Supplementary Material is available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  6. Statistical inference for Hardy-Weinberg proportions in the presence of missing genotype information.

    PubMed

    Graffelman, Jan; Sánchez, Milagros; Cook, Samantha; Moreno, Victor

    2013-01-01

    In genetic association studies, tests for Hardy-Weinberg proportions are often employed as a quality control checking procedure. Missing genotypes are typically discarded prior to testing. In this paper we show that inference for Hardy-Weinberg proportions can be biased when missing values are discarded. We propose to use multiple imputation of missing values in order to improve inference for Hardy-Weinberg proportions. For imputation we employ a multinomial logit model that uses information from allele intensities and/or neighbouring markers. Analysis of an empirical data set of single nucleotide polymorphisms possibly related to colon cancer reveals that missing genotypes are not missing completely at random. Deviation from Hardy-Weinberg proportions is mostly due to a lack of heterozygotes. Inbreeding coefficients estimated by multiple imputation of the missings are typically lowered with respect to inbreeding coefficients estimated by discarding the missings. Accounting for missings by multiple imputation qualitatively changed the results of 10 to 17% of the statistical tests performed. Estimates of inbreeding coefficients obtained by multiple imputation showed high correlation with estimates obtained by single imputation using an external reference panel. Our conclusion is that imputation of missing data leads to improved statistical inference for Hardy-Weinberg proportions.

  7. Novel application of multi-stimuli network inference to synovial fibroblasts of rheumatoid arthritis patients

    PubMed Central

    2014-01-01

    Background Network inference of gene expression data is an important challenge in systems biology. Novel algorithms may provide more detailed gene regulatory networks (GRN) for complex, chronic inflammatory diseases such as rheumatoid arthritis (RA), in which activated synovial fibroblasts (SFBs) play a major role. Since the detailed mechanisms underlying this activation are still unclear, simultaneous investigation of multi-stimuli activation of SFBs offers the possibility to elucidate the regulatory effects of multiple mediators and to gain new insights into disease pathogenesis. Methods A GRN was therefore inferred from RA-SFBs treated with 4 different stimuli (IL-1 β, TNF- α, TGF- β, and PDGF-D). Data from time series microarray experiments (0, 1, 2, 4, 12 h; Affymetrix HG-U133 Plus 2.0) were batch-corrected applying ‘ComBat’, analyzed for differentially expressed genes over time with ‘Limma’, and used for the inference of a robust GRN with NetGenerator V2.0, a heuristic ordinary differential equation-based method with soft integration of prior knowledge. Results Using all genes differentially expressed over time in RA-SFBs for any stimulus, and selecting the genes belonging to the most significant gene ontology (GO) term, i.e., ‘cartilage development’, a dynamic, robust, moderately complex multi-stimuli GRN was generated with 24 genes and 57 edges in total, 31 of which were gene-to-gene edges. Prior literature-based knowledge derived from Pathway Studio or manual searches was reflected in the final network by 25/57 confirmed edges (44%). The model contained known network motifs crucial for dynamic cellular behavior, e.g., cross-talk among pathways, positive feed-back loops, and positive feed-forward motifs (including suppression of the transcriptional repressor OSR2 by all 4 stimuli. Conclusion A multi-stimuli GRN highly concordant with literature data was successfully generated by network inference from the gene expression of stimulated RA-SFBs. The GRN showed high reliability, since 10 predicted edges were independently validated by literature findings post network inference. The selected GO term ‘cartilage development’ contained a number of differentiation markers, growth factors, and transcription factors with potential relevance for RA. Finally, the model provided new insight into the response of RA-SFBs to multiple stimuli implicated in the pathogenesis of RA, in particular to the ‘novel’ potent growth factor PDGF-D. PMID:24989895

  8. STRIDE: Species Tree Root Inference from Gene Duplication Events.

    PubMed

    Emms, David M; Kelly, Steven

    2017-12-01

    The correct interpretation of any phylogenetic tree is dependent on that tree being correctly rooted. We present STRIDE, a fast, effective, and outgroup-free method for identification of gene duplication events and species tree root inference in large-scale molecular phylogenetic analyses. STRIDE identifies sets of well-supported in-group gene duplication events from a set of unrooted gene trees, and analyses these events to infer a probability distribution over an unrooted species tree for the location of its root. We show that STRIDE correctly identifies the root of the species tree in multiple large-scale molecular phylogenetic data sets spanning a wide range of timescales and taxonomic groups. We demonstrate that the novel probability model implemented in STRIDE can accurately represent the ambiguity in species tree root assignment for data sets where information is limited. Furthermore, application of STRIDE to outgroup-free inference of the origin of the eukaryotic tree resulted in a root probability distribution that provides additional support for leading hypotheses for the origin of the eukaryotes. © The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

  9. Robust inference from multiple test statistics via permutations: a better alternative to the single test statistic approach for randomized trials.

    PubMed

    Ganju, Jitendra; Yu, Xinxin; Ma, Guoguang Julie

    2013-01-01

    Formal inference in randomized clinical trials is based on controlling the type I error rate associated with a single pre-specified statistic. The deficiency of using just one method of analysis is that it depends on assumptions that may not be met. For robust inference, we propose pre-specifying multiple test statistics and relying on the minimum p-value for testing the null hypothesis of no treatment effect. The null hypothesis associated with the various test statistics is that the treatment groups are indistinguishable. The critical value for hypothesis testing comes from permutation distributions. Rejection of the null hypothesis when the smallest p-value is less than the critical value controls the type I error rate at its designated value. Even if one of the candidate test statistics has low power, the adverse effect on the power of the minimum p-value statistic is not much. Its use is illustrated with examples. We conclude that it is better to rely on the minimum p-value rather than a single statistic particularly when that single statistic is the logrank test, because of the cost and complexity of many survival trials. Copyright © 2013 John Wiley & Sons, Ltd.

  10. Compression-based distance (CBD): a simple, rapid, and accurate method for microbiota composition comparison

    PubMed Central

    2013-01-01

    Background Perturbations in intestinal microbiota composition have been associated with a variety of gastrointestinal tract-related diseases. The alleviation of symptoms has been achieved using treatments that alter the gastrointestinal tract microbiota toward that of healthy individuals. Identifying differences in microbiota composition through the use of 16S rRNA gene hypervariable tag sequencing has profound health implications. Current computational methods for comparing microbial communities are usually based on multiple alignments and phylogenetic inference, making them time consuming and requiring exceptional expertise and computational resources. As sequencing data rapidly grows in size, simpler analysis methods are needed to meet the growing computational burdens of microbiota comparisons. Thus, we have developed a simple, rapid, and accurate method, independent of multiple alignments and phylogenetic inference, to support microbiota comparisons. Results We create a metric, called compression-based distance (CBD) for quantifying the degree of similarity between microbial communities. CBD uses the repetitive nature of hypervariable tag datasets and well-established compression algorithms to approximate the total information shared between two datasets. Three published microbiota datasets were used as test cases for CBD as an applicable tool. Our study revealed that CBD recaptured 100% of the statistically significant conclusions reported in the previous studies, while achieving a decrease in computational time required when compared to similar tools without expert user intervention. Conclusion CBD provides a simple, rapid, and accurate method for assessing distances between gastrointestinal tract microbiota 16S hypervariable tag datasets. PMID:23617892

  11. Compression-based distance (CBD): a simple, rapid, and accurate method for microbiota composition comparison.

    PubMed

    Yang, Fang; Chia, Nicholas; White, Bryan A; Schook, Lawrence B

    2013-04-23

    Perturbations in intestinal microbiota composition have been associated with a variety of gastrointestinal tract-related diseases. The alleviation of symptoms has been achieved using treatments that alter the gastrointestinal tract microbiota toward that of healthy individuals. Identifying differences in microbiota composition through the use of 16S rRNA gene hypervariable tag sequencing has profound health implications. Current computational methods for comparing microbial communities are usually based on multiple alignments and phylogenetic inference, making them time consuming and requiring exceptional expertise and computational resources. As sequencing data rapidly grows in size, simpler analysis methods are needed to meet the growing computational burdens of microbiota comparisons. Thus, we have developed a simple, rapid, and accurate method, independent of multiple alignments and phylogenetic inference, to support microbiota comparisons. We create a metric, called compression-based distance (CBD) for quantifying the degree of similarity between microbial communities. CBD uses the repetitive nature of hypervariable tag datasets and well-established compression algorithms to approximate the total information shared between two datasets. Three published microbiota datasets were used as test cases for CBD as an applicable tool. Our study revealed that CBD recaptured 100% of the statistically significant conclusions reported in the previous studies, while achieving a decrease in computational time required when compared to similar tools without expert user intervention. CBD provides a simple, rapid, and accurate method for assessing distances between gastrointestinal tract microbiota 16S hypervariable tag datasets.

  12. Statistical inference of the generation probability of T-cell receptors from sequence repertoires.

    PubMed

    Murugan, Anand; Mora, Thierry; Walczak, Aleksandra M; Callan, Curtis G

    2012-10-02

    Stochastic rearrangement of germline V-, D-, and J-genes to create variable coding sequence for certain cell surface receptors is at the origin of immune system diversity. This process, known as "VDJ recombination", is implemented via a series of stochastic molecular events involving gene choices and random nucleotide insertions between, and deletions from, genes. We use large sequence repertoires of the variable CDR3 region of human CD4+ T-cell receptor beta chains to infer the statistical properties of these basic biochemical events. Because any given CDR3 sequence can be produced in multiple ways, the probability distribution of hidden recombination events cannot be inferred directly from the observed sequences; we therefore develop a maximum likelihood inference method to achieve this end. To separate the properties of the molecular rearrangement mechanism from the effects of selection, we focus on nonproductive CDR3 sequences in T-cell DNA. We infer the joint distribution of the various generative events that occur when a new T-cell receptor gene is created. We find a rich picture of correlation (and absence thereof), providing insight into the molecular mechanisms involved. The generative event statistics are consistent between individuals, suggesting a universal biochemical process. Our probabilistic model predicts the generation probability of any specific CDR3 sequence by the primitive recombination process, allowing us to quantify the potential diversity of the T-cell repertoire and to understand why some sequences are shared between individuals. We argue that the use of formal statistical inference methods, of the kind presented in this paper, will be essential for quantitative understanding of the generation and evolution of diversity in the adaptive immune system.

  13. cit: hypothesis testing software for mediation analysis in genomic applications.

    PubMed

    Millstein, Joshua; Chen, Gary K; Breton, Carrie V

    2016-08-01

    The challenges of successfully applying causal inference methods include: (i) satisfying underlying assumptions, (ii) limitations in data/models accommodated by the software and (iii) low power of common multiple testing approaches. The causal inference test (CIT) is based on hypothesis testing rather than estimation, allowing the testable assumptions to be evaluated in the determination of statistical significance. A user-friendly software package provides P-values and optionally permutation-based FDR estimates (q-values) for potential mediators. It can handle single and multiple binary and continuous instrumental variables, binary or continuous outcome variables and adjustment covariates. Also, the permutation-based FDR option provides a non-parametric implementation. Simulation studies demonstrate the validity of the cit package and show a substantial advantage of permutation-based FDR over other common multiple testing strategies. The cit open-source R package is freely available from the CRAN website (https://cran.r-project.org/web/packages/cit/index.html) with embedded C ++ code that utilizes the GNU Scientific Library, also freely available (http://www.gnu.org/software/gsl/). joshua.millstein@usc.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  14. Probability genotype imputation method and integrated weighted lasso for QTL identification.

    PubMed

    Demetrashvili, Nino; Van den Heuvel, Edwin R; Wit, Ernst C

    2013-12-30

    Many QTL studies have two common features: (1) often there is missing marker information, (2) among many markers involved in the biological process only a few are causal. In statistics, the second issue falls under the headings "sparsity" and "causal inference". The goal of this work is to develop a two-step statistical methodology for QTL mapping for markers with binary genotypes. The first step introduces a novel imputation method for missing genotypes. Outcomes of the proposed imputation method are probabilities which serve as weights to the second step, namely in weighted lasso. The sparse phenotype inference is employed to select a set of predictive markers for the trait of interest. Simulation studies validate the proposed methodology under a wide range of realistic settings. Furthermore, the methodology outperforms alternative imputation and variable selection methods in such studies. The methodology was applied to an Arabidopsis experiment, containing 69 markers for 165 recombinant inbred lines of a F8 generation. The results confirm previously identified regions, however several new markers are also found. On the basis of the inferred ROC behavior these markers show good potential for being real, especially for the germination trait Gmax. Our imputation method shows higher accuracy in terms of sensitivity and specificity compared to alternative imputation method. Also, the proposed weighted lasso outperforms commonly practiced multiple regression as well as the traditional lasso and adaptive lasso with three weighting schemes. This means that under realistic missing data settings this methodology can be used for QTL identification.

  15. Inference of population splits and mixtures from genome-wide allele frequency data.

    PubMed

    Pickrell, Joseph K; Pritchard, Jonathan K

    2012-01-01

    Many aspects of the historical relationships between populations in a species are reflected in genetic data. Inferring these relationships from genetic data, however, remains a challenging task. In this paper, we present a statistical model for inferring the patterns of population splits and mixtures in multiple populations. In our model, the sampled populations in a species are related to their common ancestor through a graph of ancestral populations. Using genome-wide allele frequency data and a Gaussian approximation to genetic drift, we infer the structure of this graph. We applied this method to a set of 55 human populations and a set of 82 dog breeds and wild canids. In both species, we show that a simple bifurcating tree does not fully describe the data; in contrast, we infer many migration events. While some of the migration events that we find have been detected previously, many have not. For example, in the human data, we infer that Cambodians trace approximately 16% of their ancestry to a population ancestral to other extant East Asian populations. In the dog data, we infer that both the boxer and basenji trace a considerable fraction of their ancestry (9% and 25%, respectively) to wolves subsequent to domestication and that East Asian toy breeds (the Shih Tzu and the Pekingese) result from admixture between modern toy breeds and "ancient" Asian breeds. Software implementing the model described here, called TreeMix, is available at http://treemix.googlecode.com.

  16. Cognitive neuropsychology and its vicissitudes: The fate of Caramazza's axioms.

    PubMed

    Shallice, Tim

    2015-01-01

    Cognitive neuropsychology is characterized as the discipline in which one draws conclusions about the organization of the normal cognitive systems from the behaviour of brain-damaged individuals. In a series of papers, Caramazza, later in collaboration with McCloskey, put forward four assumptions as the bridge principles for making such inferences. Four potential pitfalls, one for each axiom, are discussed with respect to the use of single-case methods. Two of the pitfalls also apply to case series and group study procedures, and the other two are held to be indirectly testable or avoidable. Moreover, four other pitfalls are held to apply to case series or group study methods. It is held that inferences from single-case procedures may profitably be supported or rejected using case series/group study methods, but also that analogous support needs to be given in the other direction for functionally based case series or group studies. It is argued that at least six types of neuropsychological method are valuable for extrapolation to theories of the normal cognitive system but that the single- or multiple-case study remains a critical part of cognitive neuropsychology's methods.

  17. On the multiple imputation variance estimator for control-based and delta-adjusted pattern mixture models.

    PubMed

    Tang, Yongqiang

    2017-12-01

    Control-based pattern mixture models (PMM) and delta-adjusted PMMs are commonly used as sensitivity analyses in clinical trials with non-ignorable dropout. These PMMs assume that the statistical behavior of outcomes varies by pattern in the experimental arm in the imputation procedure, but the imputed data are typically analyzed by a standard method such as the primary analysis model. In the multiple imputation (MI) inference, Rubin's variance estimator is generally biased when the imputation and analysis models are uncongenial. One objective of the article is to quantify the bias of Rubin's variance estimator in the control-based and delta-adjusted PMMs for longitudinal continuous outcomes. These PMMs assume the same observed data distribution as the mixed effects model for repeated measures (MMRM). We derive analytic expressions for the MI treatment effect estimator and the associated Rubin's variance in these PMMs and MMRM as functions of the maximum likelihood estimator from the MMRM analysis and the observed proportion of subjects in each dropout pattern when the number of imputations is infinite. The asymptotic bias is generally small or negligible in the delta-adjusted PMM, but can be sizable in the control-based PMM. This indicates that the inference based on Rubin's rule is approximately valid in the delta-adjusted PMM. A simple variance estimator is proposed to ensure asymptotically valid MI inferences in these PMMs, and compared with the bootstrap variance. The proposed method is illustrated by the analysis of an antidepressant trial, and its performance is further evaluated via a simulation study. © 2017, The International Biometric Society.

  18. Prior robust empirical Bayes inference for large-scale data by conditioning on rank with application to microarray data

    PubMed Central

    Liao, J. G.; Mcmurry, Timothy; Berg, Arthur

    2014-01-01

    Empirical Bayes methods have been extensively used for microarray data analysis by modeling the large number of unknown parameters as random effects. Empirical Bayes allows borrowing information across genes and can automatically adjust for multiple testing and selection bias. However, the standard empirical Bayes model can perform poorly if the assumed working prior deviates from the true prior. This paper proposes a new rank-conditioned inference in which the shrinkage and confidence intervals are based on the distribution of the error conditioned on rank of the data. Our approach is in contrast to a Bayesian posterior, which conditions on the data themselves. The new method is almost as efficient as standard Bayesian methods when the working prior is close to the true prior, and it is much more robust when the working prior is not close. In addition, it allows a more accurate (but also more complex) non-parametric estimate of the prior to be easily incorporated, resulting in improved inference. The new method’s prior robustness is demonstrated via simulation experiments. Application to a breast cancer gene expression microarray dataset is presented. Our R package rank.Shrinkage provides a ready-to-use implementation of the proposed methodology. PMID:23934072

  19. Predicting Adverse Drug Effects from Literature- and Database-Mined Assertions.

    PubMed

    La, Mary K; Sedykh, Alexander; Fourches, Denis; Muratov, Eugene; Tropsha, Alexander

    2018-06-06

    Given that adverse drug effects (ADEs) have led to post-market patient harm and subsequent drug withdrawal, failure of candidate agents in the drug development process, and other negative outcomes, it is essential to attempt to forecast ADEs and other relevant drug-target-effect relationships as early as possible. Current pharmacologic data sources, providing multiple complementary perspectives on the drug-target-effect paradigm, can be integrated to facilitate the inference of relationships between these entities. This study aims to identify both existing and unknown relationships between chemicals (C), protein targets (T), and ADEs (E) based on evidence in the literature. Cheminformatics and data mining approaches were employed to integrate and analyze publicly available clinical pharmacology data and literature assertions interrelating drugs, targets, and ADEs. Based on these assertions, a C-T-E relationship knowledge base was developed. Known pairwise relationships between chemicals, targets, and ADEs were collected from several pharmacological and biomedical data sources. These relationships were curated and integrated according to Swanson's paradigm to form C-T-E triangles. Missing C-E edges were then inferred as C-E relationships. Unreported associations between drugs, targets, and ADEs were inferred, and inferences were prioritized as testable hypotheses. Several C-E inferences, including testosterone → myocardial infarction, were identified using inferences based on the literature sources published prior to confirmatory case reports. Timestamping approaches confirmed the predictive ability of this inference strategy on a larger scale. The presented workflow, based on free-access databases and an association-based inference scheme, provided novel C-E relationships that have been validated post hoc in case reports. With refinement of prioritization schemes for the generated C-E inferences, this workflow may provide an effective computational method for the early detection of potential drug candidate ADEs that can be followed by targeted experimental investigations.

  20. A comparison of algorithms for inference and learning in probabilistic graphical models.

    PubMed

    Frey, Brendan J; Jojic, Nebojsa

    2005-09-01

    Research into methods for reasoning under uncertainty is currently one of the most exciting areas of artificial intelligence, largely because it has recently become possible to record, store, and process large amounts of data. While impressive achievements have been made in pattern classification problems such as handwritten character recognition, face detection, speaker identification, and prediction of gene function, it is even more exciting that researchers are on the verge of introducing systems that can perform large-scale combinatorial analyses of data, decomposing the data into interacting components. For example, computational methods for automatic scene analysis are now emerging in the computer vision community. These methods decompose an input image into its constituent objects, lighting conditions, motion patterns, etc. Two of the main challenges are finding effective representations and models in specific applications and finding efficient algorithms for inference and learning in these models. In this paper, we advocate the use of graph-based probability models and their associated inference and learning algorithms. We review exact techniques and various approximate, computationally efficient techniques, including iterated conditional modes, the expectation maximization (EM) algorithm, Gibbs sampling, the mean field method, variational techniques, structured variational techniques and the sum-product algorithm ("loopy" belief propagation). We describe how each technique can be applied in a vision model of multiple, occluding objects and contrast the behaviors and performances of the techniques using a unifying cost function, free energy.

  1. The potential for increased power from combining P-values testing the same hypothesis.

    PubMed

    Ganju, Jitendra; Julie Ma, Guoguang

    2017-02-01

    The conventional approach to hypothesis testing for formal inference is to prespecify a single test statistic thought to be optimal. However, we usually have more than one test statistic in mind for testing the null hypothesis of no treatment effect but we do not know which one is the most powerful. Rather than relying on a single p-value, combining p-values from prespecified multiple test statistics can be used for inference. Combining functions include Fisher's combination test and the minimum p-value. Using randomization-based tests, the increase in power can be remarkable when compared with a single test and Simes's method. The versatility of the method is that it also applies when the number of covariates exceeds the number of observations. The increase in power is large enough to prefer combined p-values over a single p-value. The limitation is that the method does not provide an unbiased estimator of the treatment effect and does not apply to situations when the model includes treatment by covariate interaction.

  2. Petri Nets with Fuzzy Logic (PNFL): Reverse Engineering and Parametrization

    PubMed Central

    Küffner, Robert; Petri, Tobias; Windhager, Lukas; Zimmer, Ralf

    2010-01-01

    Background The recent DREAM4 blind assessment provided a particularly realistic and challenging setting for network reverse engineering methods. The in silico part of DREAM4 solicited the inference of cycle-rich gene regulatory networks from heterogeneous, noisy expression data including time courses as well as knockout, knockdown and multifactorial perturbations. Methodology and Principal Findings We inferred and parametrized simulation models based on Petri Nets with Fuzzy Logic (PNFL). This completely automated approach correctly reconstructed networks with cycles as well as oscillating network motifs. PNFL was evaluated as the best performer on DREAM4 in silico networks of size 10 with an area under the precision-recall curve (AUPR) of 81%. Besides topology, we inferred a range of additional mechanistic details with good reliability, e.g. distinguishing activation from inhibition as well as dependent from independent regulation. Our models also performed well on new experimental conditions such as double knockout mutations that were not included in the provided datasets. Conclusions The inference of biological networks substantially benefits from methods that are expressive enough to deal with diverse datasets in a unified way. At the same time, overly complex approaches could generate multiple different models that explain the data equally well. PNFL appears to strike the balance between expressive power and complexity. This also applies to the intuitive representation of PNFL models combining a straightforward graphical notation with colloquial fuzzy parameters. PMID:20862218

  3. Boosting probabilistic graphical model inference by incorporating prior knowledge from multiple sources.

    PubMed

    Praveen, Paurush; Fröhlich, Holger

    2013-01-01

    Inferring regulatory networks from experimental data via probabilistic graphical models is a popular framework to gain insights into biological systems. However, the inherent noise in experimental data coupled with a limited sample size reduces the performance of network reverse engineering. Prior knowledge from existing sources of biological information can address this low signal to noise problem by biasing the network inference towards biologically plausible network structures. Although integrating various sources of information is desirable, their heterogeneous nature makes this task challenging. We propose two computational methods to incorporate various information sources into a probabilistic consensus structure prior to be used in graphical model inference. Our first model, called Latent Factor Model (LFM), assumes a high degree of correlation among external information sources and reconstructs a hidden variable as a common source in a Bayesian manner. The second model, a Noisy-OR, picks up the strongest support for an interaction among information sources in a probabilistic fashion. Our extensive computational studies on KEGG signaling pathways as well as on gene expression data from breast cancer and yeast heat shock response reveal that both approaches can significantly enhance the reconstruction accuracy of Bayesian Networks compared to other competing methods as well as to the situation without any prior. Our framework allows for using diverse information sources, like pathway databases, GO terms and protein domain data, etc. and is flexible enough to integrate new sources, if available.

  4. Driver fatigue detection through multiple entropy fusion analysis in an EEG-based system.

    PubMed

    Min, Jianliang; Wang, Ping; Hu, Jianfeng

    2017-01-01

    Driver fatigue is an important contributor to road accidents, and fatigue detection has major implications for transportation safety. The aim of this research is to analyze the multiple entropy fusion method and evaluate several channel regions to effectively detect a driver's fatigue state based on electroencephalogram (EEG) records. First, we fused multiple entropies, i.e., spectral entropy, approximate entropy, sample entropy and fuzzy entropy, as features compared with autoregressive (AR) modeling by four classifiers. Second, we captured four significant channel regions according to weight-based electrodes via a simplified channel selection method. Finally, the evaluation model for detecting driver fatigue was established with four classifiers based on the EEG data from four channel regions. Twelve healthy subjects performed continuous simulated driving for 1-2 hours with EEG monitoring on a static simulator. The leave-one-out cross-validation approach obtained an accuracy of 98.3%, a sensitivity of 98.3% and a specificity of 98.2%. The experimental results verified the effectiveness of the proposed method, indicating that the multiple entropy fusion features are significant factors for inferring the fatigue state of a driver.

  5. Inverse Problems in Complex Models and Applications to Earth Sciences

    NASA Astrophysics Data System (ADS)

    Bosch, M. E.

    2015-12-01

    The inference of the subsurface earth structure and properties requires the integration of different types of data, information and knowledge, by combined processes of analysis and synthesis. To support the process of integrating information, the regular concept of data inversion is evolving to expand its application to models with multiple inner components (properties, scales, structural parameters) that explain multiple data (geophysical survey data, well-logs, core data). The probabilistic inference methods provide the natural framework for the formulation of these problems, considering a posterior probability density function (PDF) that combines the information from a prior information PDF and the new sets of observations. To formulate the posterior PDF in the context of multiple datasets, the data likelihood functions are factorized assuming independence of uncertainties for data originating across different surveys. A realistic description of the earth medium requires modeling several properties and structural parameters, which relate to each other according to dependency and independency notions. Thus, conditional probabilities across model components also factorize. A common setting proceeds by structuring the model parameter space in hierarchical layers. A primary layer (e.g. lithology) conditions a secondary layer (e.g. physical medium properties), which conditions a third layer (e.g. geophysical data). In general, less structured relations within model components and data emerge from the analysis of other inverse problems. They can be described with flexibility via direct acyclic graphs, which are graphs that map dependency relations between the model components. Examples of inverse problems in complex models can be shown at various scales. At local scale, for example, the distribution of gas saturation is inferred from pre-stack seismic data and a calibrated rock-physics model. At regional scale, joint inversion of gravity and magnetic data is applied for the estimation of lithological structure of the crust, with the lithotype body regions conditioning the mass density and magnetic susceptibility fields. At planetary scale, the Earth mantle temperature and element composition is inferred from seismic travel-time and geodetic data.

  6. Is using multiple imputation better than complete case analysis for estimating a prevalence (risk) difference in randomized controlled trials when binary outcome observations are missing?

    PubMed

    Mukaka, Mavuto; White, Sarah A; Terlouw, Dianne J; Mwapasa, Victor; Kalilani-Phiri, Linda; Faragher, E Brian

    2016-07-22

    Missing outcomes can seriously impair the ability to make correct inferences from randomized controlled trials (RCTs). Complete case (CC) analysis is commonly used, but it reduces sample size and is perceived to lead to reduced statistical efficiency of estimates while increasing the potential for bias. As multiple imputation (MI) methods preserve sample size, they are generally viewed as the preferred analytical approach. We examined this assumption, comparing the performance of CC and MI methods to determine risk difference (RD) estimates in the presence of missing binary outcomes. We conducted simulation studies of 5000 simulated data sets with 50 imputations of RCTs with one primary follow-up endpoint at different underlying levels of RD (3-25 %) and missing outcomes (5-30 %). For missing at random (MAR) or missing completely at random (MCAR) outcomes, CC method estimates generally remained unbiased and achieved precision similar to or better than MI methods, and high statistical coverage. Missing not at random (MNAR) scenarios yielded invalid inferences with both methods. Effect size estimate bias was reduced in MI methods by always including group membership even if this was unrelated to missingness. Surprisingly, under MAR and MCAR conditions in the assessed scenarios, MI offered no statistical advantage over CC methods. While MI must inherently accompany CC methods for intention-to-treat analyses, these findings endorse CC methods for per protocol risk difference analyses in these conditions. These findings provide an argument for the use of the CC approach to always complement MI analyses, with the usual caveat that the validity of the mechanism for missingness be thoroughly discussed. More importantly, researchers should strive to collect as much data as possible.

  7. Efficient inference for genetic association studies with multiple outcomes.

    PubMed

    Ruffieux, Helene; Davison, Anthony C; Hager, Jorg; Irincheeva, Irina

    2017-10-01

    Combined inference for heterogeneous high-dimensional data is critical in modern biology, where clinical and various kinds of molecular data may be available from a single study. Classical genetic association studies regress a single clinical outcome on many genetic variants one by one, but there is an increasing demand for joint analysis of many molecular outcomes and genetic variants in order to unravel functional interactions. Unfortunately, most existing approaches to joint modeling are either too simplistic to be powerful or are impracticable for computational reasons. Inspired by Richardson and others (2010, Bayesian Statistics 9), we consider a sparse multivariate regression model that allows simultaneous selection of predictors and associated responses. As Markov chain Monte Carlo (MCMC) inference on such models can be prohibitively slow when the number of genetic variants exceeds a few thousand, we propose a variational inference approach which produces posterior information very close to that of MCMC inference, at a much reduced computational cost. Extensive numerical experiments show that our approach outperforms popular variable selection methods and tailored Bayesian procedures, dealing within hours with problems involving hundreds of thousands of genetic variants and tens to hundreds of clinical or molecular outcomes. © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  8. The probabilistic convolution tree: efficient exact Bayesian inference for faster LC-MS/MS protein inference.

    PubMed

    Serang, Oliver

    2014-01-01

    Exact Bayesian inference can sometimes be performed efficiently for special cases where a function has commutative and associative symmetry of its inputs (called "causal independence"). For this reason, it is desirable to exploit such symmetry on big data sets. Here we present a method to exploit a general form of this symmetry on probabilistic adder nodes by transforming those probabilistic adder nodes into a probabilistic convolution tree with which dynamic programming computes exact probabilities. A substantial speedup is demonstrated using an illustration example that can arise when identifying splice forms with bottom-up mass spectrometry-based proteomics. On this example, even state-of-the-art exact inference algorithms require a runtime more than exponential in the number of splice forms considered. By using the probabilistic convolution tree, we reduce the runtime to O(k log(k)2) and the space to O(k log(k)) where k is the number of variables joined by an additive or cardinal operator. This approach, which can also be used with junction tree inference, is applicable to graphs with arbitrary dependency on counting variables or cardinalities and can be used on diverse problems and fields like forward error correcting codes, elemental decomposition, and spectral demixing. The approach also trivially generalizes to multiple dimensions.

  9. The Probabilistic Convolution Tree: Efficient Exact Bayesian Inference for Faster LC-MS/MS Protein Inference

    PubMed Central

    Serang, Oliver

    2014-01-01

    Exact Bayesian inference can sometimes be performed efficiently for special cases where a function has commutative and associative symmetry of its inputs (called “causal independence”). For this reason, it is desirable to exploit such symmetry on big data sets. Here we present a method to exploit a general form of this symmetry on probabilistic adder nodes by transforming those probabilistic adder nodes into a probabilistic convolution tree with which dynamic programming computes exact probabilities. A substantial speedup is demonstrated using an illustration example that can arise when identifying splice forms with bottom-up mass spectrometry-based proteomics. On this example, even state-of-the-art exact inference algorithms require a runtime more than exponential in the number of splice forms considered. By using the probabilistic convolution tree, we reduce the runtime to and the space to where is the number of variables joined by an additive or cardinal operator. This approach, which can also be used with junction tree inference, is applicable to graphs with arbitrary dependency on counting variables or cardinalities and can be used on diverse problems and fields like forward error correcting codes, elemental decomposition, and spectral demixing. The approach also trivially generalizes to multiple dimensions. PMID:24626234

  10. SWPhylo - A Novel Tool for Phylogenomic Inferences by Comparison of Oligonucleotide Patterns and Integration of Genome-Based and Gene-Based Phylogenetic Trees.

    PubMed

    Yu, Xiaoyu; Reva, Oleg N

    2018-01-01

    Modern phylogenetic studies may benefit from the analysis of complete genome sequences of various microorganisms. Evolutionary inferences based on genome-scale analysis are believed to be more accurate than the gene-based alternative. However, the computational complexity of current phylogenomic procedures, inappropriateness of standard phylogenetic tools to process genome-wide data, and lack of reliable substitution models which correlates with alignment-free phylogenomic approaches deter microbiologists from using these opportunities. For example, the super-matrix and super-tree approaches of phylogenomics use multiple integrated genomic loci or individual gene-based trees to infer an overall consensus tree. However, these approaches potentially multiply errors of gene annotation and sequence alignment not mentioning the computational complexity and laboriousness of the methods. In this article, we demonstrate that the annotation- and alignment-free comparison of genome-wide tetranucleotide frequencies, termed oligonucleotide usage patterns (OUPs), allowed a fast and reliable inference of phylogenetic trees. These were congruent to the corresponding whole genome super-matrix trees in terms of tree topology when compared with other known approaches including 16S ribosomal RNA and GyrA protein sequence comparison, complete genome-based MAUVE, and CVTree methods. A Web-based program to perform the alignment-free OUP-based phylogenomic inferences was implemented at http://swphylo.bi.up.ac.za/. Applicability of the tool was tested on different taxa from subspecies to intergeneric levels. Distinguishing between closely related taxonomic units may be enforced by providing the program with alignments of marker protein sequences, eg, GyrA.

  11. SWPhylo – A Novel Tool for Phylogenomic Inferences by Comparison of Oligonucleotide Patterns and Integration of Genome-Based and Gene-Based Phylogenetic Trees

    PubMed Central

    Yu, Xiaoyu; Reva, Oleg N

    2018-01-01

    Modern phylogenetic studies may benefit from the analysis of complete genome sequences of various microorganisms. Evolutionary inferences based on genome-scale analysis are believed to be more accurate than the gene-based alternative. However, the computational complexity of current phylogenomic procedures, inappropriateness of standard phylogenetic tools to process genome-wide data, and lack of reliable substitution models which correlates with alignment-free phylogenomic approaches deter microbiologists from using these opportunities. For example, the super-matrix and super-tree approaches of phylogenomics use multiple integrated genomic loci or individual gene-based trees to infer an overall consensus tree. However, these approaches potentially multiply errors of gene annotation and sequence alignment not mentioning the computational complexity and laboriousness of the methods. In this article, we demonstrate that the annotation- and alignment-free comparison of genome-wide tetranucleotide frequencies, termed oligonucleotide usage patterns (OUPs), allowed a fast and reliable inference of phylogenetic trees. These were congruent to the corresponding whole genome super-matrix trees in terms of tree topology when compared with other known approaches including 16S ribosomal RNA and GyrA protein sequence comparison, complete genome-based MAUVE, and CVTree methods. A Web-based program to perform the alignment-free OUP-based phylogenomic inferences was implemented at http://swphylo.bi.up.ac.za/. Applicability of the tool was tested on different taxa from subspecies to intergeneric levels. Distinguishing between closely related taxonomic units may be enforced by providing the program with alignments of marker protein sequences, eg, GyrA. PMID:29511354

  12. Adaptive inference for distinguishing credible from incredible patterns in nature

    USGS Publications Warehouse

    Holling, Crawford S.; Allen, Craig R.

    2002-01-01

    Strong inference is a powerful and rapid tool that can be used to identify and explain patterns in molecular biology, cell biology, and physiology. It is effective where causes are single and separable and where discrimination between pairwise alternative hypotheses can be determined experimentally by a simple yes or no answer. But causes in ecological systems are multiple and overlapping and are not entirely separable. Frequently, competing hypotheses cannot be distinguished by a single unambiguous test, but only by a suite of tests of different kinds, that produce a body of evidence to support one line of argument and not others. We call this process "adaptive inference". Instead of pitting each member of a pair of hypotheses against each other, adaptive inference relies on the exuberant invention of multiple, competing hypotheses, after which carefully structured comparative data are used to explore the logical consequences of each. Herein we present an example that demonstrates the attributes of adaptive inference that have developed out of a 30-year study of the resilience of ecosystems.

  13. Inferring gene ontologies from pairwise similarity data

    PubMed Central

    Kramer, Michael; Dutkowski, Janusz; Yu, Michael; Bafna, Vineet; Ideker, Trey

    2014-01-01

    Motivation: While the manually curated Gene Ontology (GO) is widely used, inferring a GO directly from -omics data is a compelling new problem. Recognizing that ontologies are a directed acyclic graph (DAG) of terms and hierarchical relations, algorithms are needed that: analyze a full matrix of gene–gene pairwise similarities from -omics data;infer true hierarchical structure in these data rather than enforcing hierarchy as a computational artifact; andrespect biological pleiotropy, by which a term in the hierarchy can relate to multiple higher level terms. Methods addressing these requirements are just beginning to emerge—none has been evaluated for GO inference. Methods: We consider two algorithms [Clique Extracted Ontology (CliXO), LocalFitness] that uniquely satisfy these requirements, compared with methods including standard clustering. CliXO is a new approach that finds maximal cliques in a network induced by progressive thresholding of a similarity matrix. We evaluate each method’s ability to reconstruct the GO biological process ontology from a similarity matrix based on (a) semantic similarities for GO itself or (b) three -omics datasets for yeast. Results: For task (a) using semantic similarity, CliXO accurately reconstructs GO (>99% precision, recall) and outperforms other approaches (<20% precision, <20% recall). For task (b) using -omics data, CliXO outperforms other methods using two -omics datasets and achieves ∼30% precision and recall using YeastNet v3, similar to an earlier approach (Network Extracted Ontology) and better than LocalFitness or standard clustering (20–25% precision, recall). Conclusion: This study provides algorithmic foundation for building gene ontologies by capturing hierarchical and pleiotropic structure embedded in biomolecular data. Contact: tideker@ucsd.edu PMID:24932003

  14. Automating approximate Bayesian computation by local linear regression.

    PubMed

    Thornton, Kevin R

    2009-07-07

    In several biological contexts, parameter inference often relies on computationally-intensive techniques. "Approximate Bayesian Computation", or ABC, methods based on summary statistics have become increasingly popular. A particular flavor of ABC based on using a linear regression to approximate the posterior distribution of the parameters, conditional on the summary statistics, is computationally appealing, yet no standalone tool exists to automate the procedure. Here, I describe a program to implement the method. The software package ABCreg implements the local linear-regression approach to ABC. The advantages are: 1. The code is standalone, and fully-documented. 2. The program will automatically process multiple data sets, and create unique output files for each (which may be processed immediately in R), facilitating the testing of inference procedures on simulated data, or the analysis of multiple data sets. 3. The program implements two different transformation methods for the regression step. 4. Analysis options are controlled on the command line by the user, and the program is designed to output warnings for cases where the regression fails. 5. The program does not depend on any particular simulation machinery (coalescent, forward-time, etc.), and therefore is a general tool for processing the results from any simulation. 6. The code is open-source, and modular.Examples of applying the software to empirical data from Drosophila melanogaster, and testing the procedure on simulated data, are shown. In practice, the ABCreg simplifies implementing ABC based on local-linear regression.

  15. An expert system shell for inferring vegetation characteristics: Implementation of additional techniques (task E)

    NASA Technical Reports Server (NTRS)

    Harrison, P. Ann

    1992-01-01

    The NASA VEGetation Workbench (VEG) is a knowledge based system that infers vegetation characteristics from reflectance data. The VEG subgoal PROPORTION.GROUND.COVER has been completed and a number of additional techniques that infer the proportion ground cover of a sample have been implemented. Some techniques operate on sample data at a single wavelength. The techniques previously incorporated in VEG for other subgoals operated on data at a single wavelength so implementing the additional single wavelength techniques required no changes to the structure of VEG. Two techniques which use data at multiple wavelengths to infer proportion ground cover were also implemented. This work involved modifying the structure of VEG so that multiple wavelength techniques could be incorporated. All the new techniques were tested using both the VEG 'Research Mode' and the 'Automatic Mode.'

  16. D-score: a search engine independent MD-score.

    PubMed

    Vaudel, Marc; Breiter, Daniela; Beck, Florian; Rahnenführer, Jörg; Martens, Lennart; Zahedi, René P

    2013-03-01

    While peptides carrying PTMs are routinely identified in gel-free MS, the localization of the PTMs onto the peptide sequences remains challenging. Search engine scores of secondary peptide matches have been used in different approaches in order to infer the quality of site inference, by penalizing the localization whenever the search engine similarly scored two candidate peptides with different site assignments. In the present work, we show how the estimation of posterior error probabilities for peptide candidates allows the estimation of a PTM score called the D-score, for multiple search engine studies. We demonstrate the applicability of this score to three popular search engines: Mascot, OMSSA, and X!Tandem, and evaluate its performance using an already published high resolution data set of synthetic phosphopeptides. For those peptides with phosphorylation site inference uncertainty, the number of spectrum matches with correctly localized phosphorylation increased by up to 25.7% when compared to using Mascot alone, although the actual increase depended on the fragmentation method used. Since this method relies only on search engine scores, it can be readily applied to the scoring of the localization of virtually any modification at no additional experimental or in silico cost. © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  17. Adult Age Differences in Categorization and Multiple-Cue Judgment

    ERIC Educational Resources Information Center

    Mata, Rui; von Helversen, Bettina; Karlsson, Linnea; Cupper, Lutz

    2012-01-01

    We often need to infer unknown properties of objects from observable ones, just like detectives must infer guilt from observable clues and behavior. But how do inferential processes change with age? We examined young and older adults' reliance on rule-based and similarity-based processes in an inference task that can be considered either a…

  18. Inference Generation during Text Comprehension by Adults with Right Hemisphere Brain Damage: Activation Failure Versus Multiple Activation.

    ERIC Educational Resources Information Center

    Tompkins, Connie A.; Fassbinder, Wiltrud; Blake, Margaret Lehman; Baumgaertner, Annette; Jayaram, Nandini

    2004-01-01

    ourse comprehensionEvidence conflicts as to whether adults with right hemisphere brain damage (RHD) generate inferences during text comprehension. M. Beeman (1993) reported that adults with RHD fail to activate the lexical-semantic bases of routine bridging inferences, which are necessary for comprehension. But other evidence indicates that adults…

  19. AlexSys: a knowledge-based expert system for multiple sequence alignment construction and analysis

    PubMed Central

    Aniba, Mohamed Radhouene; Poch, Olivier; Marchler-Bauer, Aron; Thompson, Julie Dawn

    2010-01-01

    Multiple sequence alignment (MSA) is a cornerstone of modern molecular biology and represents a unique means of investigating the patterns of conservation and diversity in complex biological systems. Many different algorithms have been developed to construct MSAs, but previous studies have shown that no single aligner consistently outperforms the rest. This has led to the development of a number of ‘meta-methods’ that systematically run several aligners and merge the output into one single solution. Although these methods generally produce more accurate alignments, they are inefficient because all the aligners need to be run first and the choice of the best solution is made a posteriori. Here, we describe the development of a new expert system, AlexSys, for the multiple alignment of protein sequences. AlexSys incorporates an intelligent inference engine to automatically select an appropriate aligner a priori, depending only on the nature of the input sequences. The inference engine was trained on a large set of reference multiple alignments, using a novel machine learning approach. Applying AlexSys to a test set of 178 alignments, we show that the expert system represents a good compromise between alignment quality and running time, making it suitable for high throughput projects. AlexSys is freely available from http://alnitak.u-strasbg.fr/∼aniba/alexsys. PMID:20530533

  20. Reliable inference of light curve parameters in the presence of systematics

    NASA Astrophysics Data System (ADS)

    Gibson, Neale P.

    2016-10-01

    Time-series photometry and spectroscopy of transiting exoplanets allow us to study their atmospheres. Unfortunately, the required precision to extract atmospheric information surpasses the design specifications of most general purpose instrumentation. This results in instrumental systematics in the light curves that are typically larger than the target precision. Systematics must therefore be modelled, leaving the inference of light-curve parameters conditioned on the subjective choice of systematics models and model-selection criteria. Here, I briefly review the use of systematics models commonly used for transmission and emission spectroscopy, including model selection, marginalisation over models, and stochastic processes. These form a hierarchy of models with increasing degree of objectivity. I argue that marginalisation over many systematics models is a minimal requirement for robust inference. Stochastic models provide even more flexibility and objectivity, and therefore produce the most reliable results. However, no systematics models are perfect, and the best strategy is to compare multiple methods and repeat observations where possible.

  1. A MAD-Bayes Algorithm for State-Space Inference and Clustering with Application to Querying Large Collections of ChIP-Seq Data Sets.

    PubMed

    Zuo, Chandler; Chen, Kailei; Keleş, Sündüz

    2017-06-01

    Current analytic approaches for querying large collections of chromatin immunoprecipitation followed by sequencing (ChIP-seq) data from multiple cell types rely on individual analysis of each data set (i.e., peak calling) independently. This approach discards the fact that functional elements are frequently shared among related cell types and leads to overestimation of the extent of divergence between different ChIP-seq samples. Methods geared toward multisample investigations have limited applicability in settings that aim to integrate 100s to 1000s of ChIP-seq data sets for query loci (e.g., thousands of genomic loci with a specific binding site). Recently, Zuo et al. developed a hierarchical framework for state-space matrix inference and clustering, named MBASIC, to enable joint analysis of user-specified loci across multiple ChIP-seq data sets. Although this versatile framework estimates both the underlying state-space (e.g., bound vs. unbound) and also groups loci with similar patterns together, its Expectation-Maximization-based estimation structure hinders its applicability with large number of loci and samples. We address this limitation by developing MAP-based asymptotic derivations from Bayes (MAD-Bayes) framework for MBASIC. This results in a K-means-like optimization algorithm that converges rapidly and hence enables exploring multiple initialization schemes and flexibility in tuning. Comparison with MBASIC indicates that this speed comes at a relatively insignificant loss in estimation accuracy. Although MAD-Bayes MBASIC is specifically designed for the analysis of user-specified loci, it is able to capture overall patterns of histone marks from multiple ChIP-seq data sets similar to those identified by genome-wide segmentation methods such as ChromHMM and Spectacle.

  2. Identification of water quality management policy of watershed system with multiple uncertain interactions using a multi-level-factorial risk-inference-based possibilistic-probabilistic programming approach.

    PubMed

    Liu, Jing; Li, Yongping; Huang, Guohe; Fu, Haiyan; Zhang, Junlong; Cheng, Guanhui

    2017-06-01

    In this study, a multi-level-factorial risk-inference-based possibilistic-probabilistic programming (MRPP) method is proposed for supporting water quality management under multiple uncertainties. The MRPP method can handle uncertainties expressed as fuzzy-random-boundary intervals, probability distributions, and interval numbers, and analyze the effects of uncertainties as well as their interactions on modeling outputs. It is applied to plan water quality management in the Xiangxihe watershed. Results reveal that a lower probability of satisfying the objective function (θ) as well as a higher probability of violating environmental constraints (q i ) would correspond to a higher system benefit with an increased risk of violating system feasibility. Chemical plants are the major contributors to biological oxygen demand (BOD) and total phosphorus (TP) discharges; total nitrogen (TN) would be mainly discharged by crop farming. It is also discovered that optimistic decision makers should pay more attention to the interactions between chemical plant and water supply, while decision makers who possess a risk-averse attitude would focus on the interactive effect of q i and benefit of water supply. The findings can help enhance the model's applicability and identify a suitable water quality management policy for environmental sustainability according to the practical situations.

  3. Conflicting Epistemologies and Inference in Coupled Human and Natural Systems

    NASA Astrophysics Data System (ADS)

    Garcia, M. E.

    2017-12-01

    Last year, I presented a model that projects per capita water consumption based on changes in price, population, building codes, and water stress salience. This model applied methods from hydrological science and engineering to relationships both within and beyond their traditional scope. Epistemologically, the development of mathematical models of natural or engineered systems is objectivist while research examining relationships between observations, perceptions and action is commonly constructivist or subjectivist. Drawing on multiple epistemologies is common in, and perhaps central to, the growing fields of coupled human and natural systems, and socio-hydrology. Critically, these philosophical perspectives vary in their view of the nature of the system as mechanistic, adaptive or constructed, and the split between aleatory and epistemic uncertainty. Interdisciplinary research is commonly cited as a way to address the critical and domain crossing challenge of sustainability as synthesis across perspectives can offer a more comprehensive view of system dynamics. However, combining methods and concepts from multiple ontologies and epistemologies can introduce contradictions into the logic of inference. These contractions challenge the evaluation of research products and the implications for practical application of research findings are not fully understood. Reflections on the evaluation, application, and generalization of the water consumption model described above are used to ground these broader questions and offer thoughts on the way forward.

  4. Amino acid transporter expansions associated with the evolution of obligate endosymbiosis in sap-feeding insects (Hemiptera: sternorrhyncha).

    PubMed

    Dahan, Romain A; Duncan, Rebecca P; Wilson, Alex C C; Dávalos, Liliana M

    2015-03-25

    Mutualistic obligate endosymbioses shape the evolution of endosymbiont genomes, but their impact on host genomes remains unclear. Insects of the sub-order Sternorrhyncha (Hemiptera) depend on bacterial endosymbionts for essential amino acids present at low abundances in their phloem-based diet. This obligate dependency has been proposed to explain why multiple amino acid transporter genes are maintained in the genomes of the insect hosts. We implemented phylogenetic comparative methods to test whether amino acid transporters have proliferated in sternorrhynchan genomes at rates grater than expected by chance. By applying a series of methods to reconcile gene and species trees, inferring the size of gene families in ancestral lineages, and simulating the null process of birth and death in multi-gene families, we uncovered a 10-fold increase in duplication rate in the AAAP family of amino acid transporters within Sternorrhyncha. This gene family expansion was unmatched in other closely related clades lacking endosymbionts that provide essential amino acids. Our findings support the influence of obligate endosymbioses on host genome evolution by both inferring significant expansions of gene families involved in symbiotic interactions, and discovering increases in the rate of duplication associated with multiple emergences of obligate symbiosis in Sternorrhyncha.

  5. Dynamic motif occupancy (DynaMO) analysis identifies transcription factors and their binding sites driving dynamic biological processes.

    PubMed

    Kuang, Zheng; Ji, Zhicheng; Boeke, Jef D; Ji, Hongkai

    2018-01-09

    Biological processes are usually associated with genome-wide remodeling of transcription driven by transcription factors (TFs). Identifying key TFs and their spatiotemporal binding patterns are indispensable to understanding how dynamic processes are programmed. However, most methods are designed to predict TF binding sites only. We present a computational method, dynamic motif occupancy analysis (DynaMO), to infer important TFs and their spatiotemporal binding activities in dynamic biological processes using chromatin profiling data from multiple biological conditions such as time-course histone modification ChIP-seq data. In the first step, DynaMO predicts TF binding sites with a random forests approach. Next and uniquely, DynaMO infers dynamic TF binding activities at predicted binding sites using their local chromatin profiles from multiple biological conditions. Another landmark of DynaMO is to identify key TFs in a dynamic process using a clustering and enrichment analysis of dynamic TF binding patterns. Application of DynaMO to the yeast ultradian cycle, mouse circadian clock and human neural differentiation exhibits its accuracy and versatility. We anticipate DynaMO will be generally useful for elucidating transcriptional programs in dynamic processes. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  6. Multiple Linear Regression for Reconstruction of Gene Regulatory Networks in Solving Cascade Error Problems

    PubMed Central

    Zainudin, Suhaila; Arif, Shereena M.

    2017-01-01

    Gene regulatory network (GRN) reconstruction is the process of identifying regulatory gene interactions from experimental data through computational analysis. One of the main reasons for the reduced performance of previous GRN methods had been inaccurate prediction of cascade motifs. Cascade error is defined as the wrong prediction of cascade motifs, where an indirect interaction is misinterpreted as a direct interaction. Despite the active research on various GRN prediction methods, the discussion on specific methods to solve problems related to cascade errors is still lacking. In fact, the experiments conducted by the past studies were not specifically geared towards proving the ability of GRN prediction methods in avoiding the occurrences of cascade errors. Hence, this research aims to propose Multiple Linear Regression (MLR) to infer GRN from gene expression data and to avoid wrongly inferring of an indirect interaction (A → B → C) as a direct interaction (A → C). Since the number of observations of the real experiment datasets was far less than the number of predictors, some predictors were eliminated by extracting the random subnetworks from global interaction networks via an established extraction method. In addition, the experiment was extended to assess the effectiveness of MLR in dealing with cascade error by using a novel experimental procedure that had been proposed in this work. The experiment revealed that the number of cascade errors had been very minimal. Apart from that, the Belsley collinearity test proved that multicollinearity did affect the datasets used in this experiment greatly. All the tested subnetworks obtained satisfactory results, with AUROC values above 0.5. PMID:28250767

  7. Algorithmic methods to infer the evolutionary trajectories in cancer progression

    PubMed Central

    Graudenzi, Alex; Ramazzotti, Daniele; Sanz-Pamplona, Rebeca; De Sano, Luca; Mauri, Giancarlo; Moreno, Victor; Antoniotti, Marco; Mishra, Bud

    2016-01-01

    The genomic evolution inherent to cancer relates directly to a renewed focus on the voluminous next-generation sequencing data and machine learning for the inference of explanatory models of how the (epi)genomic events are choreographed in cancer initiation and development. However, despite the increasing availability of multiple additional -omics data, this quest has been frustrated by various theoretical and technical hurdles, mostly stemming from the dramatic heterogeneity of the disease. In this paper, we build on our recent work on the “selective advantage” relation among driver mutations in cancer progression and investigate its applicability to the modeling problem at the population level. Here, we introduce PiCnIc (Pipeline for Cancer Inference), a versatile, modular, and customizable pipeline to extract ensemble-level progression models from cross-sectional sequenced cancer genomes. The pipeline has many translational implications because it combines state-of-the-art techniques for sample stratification, driver selection, identification of fitness-equivalent exclusive alterations, and progression model inference. We demonstrate PiCnIc’s ability to reproduce much of the current knowledge on colorectal cancer progression as well as to suggest novel experimentally verifiable hypotheses. PMID:27357673

  8. Gene network inference by fusing data from diverse distributions

    PubMed Central

    Žitnik, Marinka; Zupan, Blaž

    2015-01-01

    Motivation: Markov networks are undirected graphical models that are widely used to infer relations between genes from experimental data. Their state-of-the-art inference procedures assume the data arise from a Gaussian distribution. High-throughput omics data, such as that from next generation sequencing, often violates this assumption. Furthermore, when collected data arise from multiple related but otherwise nonidentical distributions, their underlying networks are likely to have common features. New principled statistical approaches are needed that can deal with different data distributions and jointly consider collections of datasets. Results: We present FuseNet, a Markov network formulation that infers networks from a collection of nonidentically distributed datasets. Our approach is computationally efficient and general: given any number of distributions from an exponential family, FuseNet represents model parameters through shared latent factors that define neighborhoods of network nodes. In a simulation study, we demonstrate good predictive performance of FuseNet in comparison to several popular graphical models. We show its effectiveness in an application to breast cancer RNA-sequencing and somatic mutation data, a novel application of graphical models. Fusion of datasets offers substantial gains relative to inference of separate networks for each dataset. Our results demonstrate that network inference methods for non-Gaussian data can help in accurate modeling of the data generated by emergent high-throughput technologies. Availability and implementation: Source code is at https://github.com/marinkaz/fusenet. Contact: blaz.zupan@fri.uni-lj.si Supplementary information: Supplementary information is available at Bioinformatics online. PMID:26072487

  9. Terrain classification in navigation of an autonomous mobile robot

    NASA Astrophysics Data System (ADS)

    Dodds, David R.

    1991-03-01

    In this paper we describe a method of path planning that integrates terrain classification (by means of fractals) the certainty grid method of spatial representation Kehtarnavaz Griswold collision-zones Dubois Prade fuzzy temporal and spatial knowledge and non-point sized qualitative navigational planning. An initially planned (" end-to-end" ) path is piece-wise modified to accommodate known and inferred moving obstacles and includes attention to time-varying multiple subgoals which may influence a section of path at a time after the robot has begun traversing that planned path.

  10. A Bayesian joint probability modeling approach for seasonal forecasting of streamflows at multiple sites

    NASA Astrophysics Data System (ADS)

    Wang, Q. J.; Robertson, D. E.; Chiew, F. H. S.

    2009-05-01

    Seasonal forecasting of streamflows can be highly valuable for water resources management. In this paper, a Bayesian joint probability (BJP) modeling approach for seasonal forecasting of streamflows at multiple sites is presented. A Box-Cox transformed multivariate normal distribution is proposed to model the joint distribution of future streamflows and their predictors such as antecedent streamflows and El Niño-Southern Oscillation indices and other climate indicators. Bayesian inference of model parameters and uncertainties is implemented using Markov chain Monte Carlo sampling, leading to joint probabilistic forecasts of streamflows at multiple sites. The model provides a parametric structure for quantifying relationships between variables, including intersite correlations. The Box-Cox transformed multivariate normal distribution has considerable flexibility for modeling a wide range of predictors and predictands. The Bayesian inference formulated allows the use of data that contain nonconcurrent and missing records. The model flexibility and data-handling ability means that the BJP modeling approach is potentially of wide practical application. The paper also presents a number of statistical measures and graphical methods for verification of probabilistic forecasts of continuous variables. Results for streamflows at three river gauges in the Murrumbidgee River catchment in southeast Australia show that the BJP modeling approach has good forecast quality and that the fitted model is consistent with observed data.

  11. Bayesian Reconstruction of Disease Outbreaks by Combining Epidemiologic and Genomic Data

    PubMed Central

    Jombart, Thibaut; Cori, Anne; Didelot, Xavier; Cauchemez, Simon; Fraser, Christophe; Ferguson, Neil

    2014-01-01

    Recent years have seen progress in the development of statistically rigorous frameworks to infer outbreak transmission trees (“who infected whom”) from epidemiological and genetic data. Making use of pathogen genome sequences in such analyses remains a challenge, however, with a variety of heuristic approaches having been explored to date. We introduce a statistical method exploiting both pathogen sequences and collection dates to unravel the dynamics of densely sampled outbreaks. Our approach identifies likely transmission events and infers dates of infections, unobserved cases and separate introductions of the disease. It also proves useful for inferring numbers of secondary infections and identifying heterogeneous infectivity and super-spreaders. After testing our approach using simulations, we illustrate the method with the analysis of the beginning of the 2003 Singaporean outbreak of Severe Acute Respiratory Syndrome (SARS), providing new insights into the early stage of this epidemic. Our approach is the first tool for disease outbreak reconstruction from genetic data widely available as free software, the R package outbreaker. It is applicable to various densely sampled epidemics, and improves previous approaches by detecting unobserved and imported cases, as well as allowing multiple introductions of the pathogen. Because of its generality, we believe this method will become a tool of choice for the analysis of densely sampled disease outbreaks, and will form a rigorous framework for subsequent methodological developments. PMID:24465202

  12. The effect of reading assignments in guided inquiry learning on students’ critical thinking skills

    NASA Astrophysics Data System (ADS)

    Syarkowi, A.

    2018-05-01

    The purpose of this study was to determine the effect of reading assignment in guided inquiry learning on senior high school students’ critical thinking skills. The research method which was used in this research was quasi-experiment research method with reading task as the treatment. Topic of inquiry process was Kirchhoff law. The instrument was used for this research was 25 multiple choice interpretive exercises with justification. The multiple choice test was divided on 3 categories such as involve basic clarification, the bases for a decision and inference skills. The result of significance test proved the improvement of students’ critical thinking skills of experiment class was significantly higher when compared with the control class, so it could be concluded that reading assignment can improve students’ critical thinking skills.

  13. Probabilistic Low-Rank Multitask Learning.

    PubMed

    Kong, Yu; Shao, Ming; Li, Kang; Fu, Yun

    2018-03-01

    In this paper, we consider the problem of learning multiple related tasks simultaneously with the goal of improving the generalization performance of individual tasks. The key challenge is to effectively exploit the shared information across multiple tasks as well as preserve the discriminative information for each individual task. To address this, we propose a novel probabilistic model for multitask learning (MTL) that can automatically balance between low-rank and sparsity constraints. The former assumes a low-rank structure of the underlying predictive hypothesis space to explicitly capture the relationship of different tasks and the latter learns the incoherent sparse patterns private to each task. We derive and perform inference via variational Bayesian methods. Experimental results on both regression and classification tasks on real-world applications demonstrate the effectiveness of the proposed method in dealing with the MTL problems.

  14. Automatic face naming by learning discriminative affinity matrices from weakly labeled images.

    PubMed

    Xiao, Shijie; Xu, Dong; Wu, Jianxin

    2015-10-01

    Given a collection of images, where each image contains several faces and is associated with a few names in the corresponding caption, the goal of face naming is to infer the correct name for each face. In this paper, we propose two new methods to effectively solve this problem by learning two discriminative affinity matrices from these weakly labeled images. We first propose a new method called regularized low-rank representation by effectively utilizing weakly supervised information to learn a low-rank reconstruction coefficient matrix while exploring multiple subspace structures of the data. Specifically, by introducing a specially designed regularizer to the low-rank representation method, we penalize the corresponding reconstruction coefficients related to the situations where a face is reconstructed by using face images from other subjects or by using itself. With the inferred reconstruction coefficient matrix, a discriminative affinity matrix can be obtained. Moreover, we also develop a new distance metric learning method called ambiguously supervised structural metric learning by using weakly supervised information to seek a discriminative distance metric. Hence, another discriminative affinity matrix can be obtained using the similarity matrix (i.e., the kernel matrix) based on the Mahalanobis distances of the data. Observing that these two affinity matrices contain complementary information, we further combine them to obtain a fused affinity matrix, based on which we develop a new iterative scheme to infer the name of each face. Comprehensive experiments demonstrate the effectiveness of our approach.

  15. What impact do assumptions about missing data have on conclusions? A practical sensitivity analysis for a cancer survival registry.

    PubMed

    Smuk, M; Carpenter, J R; Morris, T P

    2017-02-06

    Within epidemiological and clinical research, missing data are a common issue and often over looked in publications. When the issue of missing observations is addressed it is usually assumed that the missing data are 'missing at random' (MAR). This assumption should be checked for plausibility, however it is untestable, thus inferences should be assessed for robustness to departures from missing at random. We highlight the method of pattern mixture sensitivity analysis after multiple imputation using colorectal cancer data as an example. We focus on the Dukes' stage variable which has the highest proportion of missing observations. First, we find the probability of being in each Dukes' stage given the MAR imputed dataset. We use these probabilities in a questionnaire to elicit prior beliefs from experts on what they believe the probability would be in the missing data. The questionnaire responses are then used in a Dirichlet draw to create a Bayesian 'missing not at random' (MNAR) prior to impute the missing observations. The model of interest is applied and inferences are compared to those from the MAR imputed data. The inferences were largely insensitive to departure from MAR. Inferences under MNAR suggested a smaller association between Dukes' stage and death, though the association remained positive and with similarly low p values. We conclude by discussing the positives and negatives of our method and highlight the importance of making people aware of the need to test the MAR assumption.

  16. Boosting Probabilistic Graphical Model Inference by Incorporating Prior Knowledge from Multiple Sources

    PubMed Central

    Praveen, Paurush; Fröhlich, Holger

    2013-01-01

    Inferring regulatory networks from experimental data via probabilistic graphical models is a popular framework to gain insights into biological systems. However, the inherent noise in experimental data coupled with a limited sample size reduces the performance of network reverse engineering. Prior knowledge from existing sources of biological information can address this low signal to noise problem by biasing the network inference towards biologically plausible network structures. Although integrating various sources of information is desirable, their heterogeneous nature makes this task challenging. We propose two computational methods to incorporate various information sources into a probabilistic consensus structure prior to be used in graphical model inference. Our first model, called Latent Factor Model (LFM), assumes a high degree of correlation among external information sources and reconstructs a hidden variable as a common source in a Bayesian manner. The second model, a Noisy-OR, picks up the strongest support for an interaction among information sources in a probabilistic fashion. Our extensive computational studies on KEGG signaling pathways as well as on gene expression data from breast cancer and yeast heat shock response reveal that both approaches can significantly enhance the reconstruction accuracy of Bayesian Networks compared to other competing methods as well as to the situation without any prior. Our framework allows for using diverse information sources, like pathway databases, GO terms and protein domain data, etc. and is flexible enough to integrate new sources, if available. PMID:23826291

  17. DES Y1 Results: Validating Cosmological Parameter Estimation Using Simulated Dark Energy Surveys

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    MacCrann, N.; et al.

    We use mock galaxy survey simulations designed to resemble the Dark Energy Survey Year 1 (DES Y1) data to validate and inform cosmological parameter estimation. When similar analysis tools are applied to both simulations and real survey data, they provide powerful validation tests of the DES Y1 cosmological analyses presented in companion papers. We use two suites of galaxy simulations produced using different methods, which therefore provide independent tests of our cosmological parameter inference. The cosmological analysis we aim to validate is presented in DES Collaboration et al. (2017) and uses angular two-point correlation functions of galaxy number counts and weak lensing shear, as well as their cross-correlation, in multiple redshift bins. While our constraints depend on the specific set of simulated realisations available, for both suites of simulations we find that the input cosmology is consistent with the combined constraints from multiple simulated DES Y1 realizations in themore » $$\\Omega_m-\\sigma_8$$ plane. For one of the suites, we are able to show with high confidence that any biases in the inferred $$S_8=\\sigma_8(\\Omega_m/0.3)^{0.5}$$ and $$\\Omega_m$$ are smaller than the DES Y1 $$1-\\sigma$$ uncertainties. For the other suite, for which we have fewer realizations, we are unable to be this conclusive; we infer a roughly 70% probability that systematic biases in the recovered $$\\Omega_m$$ and $$S_8$$ are sub-dominant to the DES Y1 uncertainty. As cosmological analyses of this kind become increasingly more precise, validation of parameter inference using survey simulations will be essential to demonstrate robustness.« less

  18. Recapitulating phylogenies using k-mers: from trees to networks.

    PubMed

    Bernard, Guillaume; Ragan, Mark A; Chan, Cheong Xin

    2016-01-01

    Ernst Haeckel based his landmark Tree of Life on the supposed ontogenic recapitulation of phylogeny, i.e. that successive embryonic stages during the development of an organism re-trace the morphological forms of its ancestors over the course of evolution. Much of this idea has since been discredited. Today, phylogenies are often based on families of molecular sequences. The standard approach starts with a multiple sequence alignment, in which the sequences are arranged relative to each other in a way that maximises a measure of similarity position-by-position along their entire length. A tree (or sometimes a network) is then inferred. Rigorous multiple sequence alignment is computationally demanding, and evolutionary processes that shape the genomes of many microbes (bacteria, archaea and some morphologically simple eukaryotes) can add further complications. In particular, recombination, genome rearrangement and lateral genetic transfer undermine the assumptions that underlie multiple sequence alignment, and imply that a tree-like structure may be too simplistic. Here, using genome sequences of 143 bacterial and archaeal genomes, we construct a network of phylogenetic relatedness based on the number of shared k -mers (subsequences at fixed length k ). Our findings suggest that the network captures not only key aspects of microbial genome evolution as inferred from a tree, but also features that are not treelike. The method is highly scalable, allowing for investigation of genome evolution across a large number of genomes. Instead of using specific regions or sequences from genome sequences, or indeed Haeckel's idea of ontogeny, we argue that genome phylogenies can be inferred using k -mers from whole-genome sequences. Representing these networks dynamically allows biological questions of interest to be formulated and addressed quickly and in a visually intuitive manner.

  19. A ranking method for the concurrent learning of compounds with various activity profiles.

    PubMed

    Dörr, Alexander; Rosenbaum, Lars; Zell, Andreas

    2015-01-01

    In this study, we present a SVM-based ranking algorithm for the concurrent learning of compounds with different activity profiles and their varying prioritization. To this end, a specific labeling of each compound was elaborated in order to infer virtual screening models against multiple targets. We compared the method with several state-of-the-art SVM classification techniques that are capable of inferring multi-target screening models on three chemical data sets (cytochrome P450s, dehydrogenases, and a trypsin-like protease data set) containing three different biological targets each. The experiments show that ranking-based algorithms show an increased performance for single- and multi-target virtual screening. Moreover, compounds that do not completely fulfill the desired activity profile are still ranked higher than decoys or compounds with an entirely undesired profile, compared to other multi-target SVM methods. SVM-based ranking methods constitute a valuable approach for virtual screening in multi-target drug design. The utilization of such methods is most helpful when dealing with compounds with various activity profiles and the finding of many ligands with an already perfectly matching activity profile is not to be expected.

  20. Methodology for the inference of gene function from phenotype data.

    PubMed

    Ascensao, Joao A; Dolan, Mary E; Hill, David P; Blake, Judith A

    2014-12-12

    Biomedical ontologies are increasingly instrumental in the advancement of biological research primarily through their use to efficiently consolidate large amounts of data into structured, accessible sets. However, ontology development and usage can be hampered by the segregation of knowledge by domain that occurs due to independent development and use of the ontologies. The ability to infer data associated with one ontology to data associated with another ontology would prove useful in expanding information content and scope. We here focus on relating two ontologies: the Gene Ontology (GO), which encodes canonical gene function, and the Mammalian Phenotype Ontology (MP), which describes non-canonical phenotypes, using statistical methods to suggest GO functional annotations from existing MP phenotype annotations. This work is in contrast to previous studies that have focused on inferring gene function from phenotype primarily through lexical or semantic similarity measures. We have designed and tested a set of algorithms that represents a novel methodology to define rules for predicting gene function by examining the emergent structure and relationships between the gene functions and phenotypes rather than inspecting the terms semantically. The algorithms inspect relationships among multiple phenotype terms to deduce if there are cases where they all arise from a single gene function. We apply this methodology to data about genes in the laboratory mouse that are formally represented in the Mouse Genome Informatics (MGI) resource. From the data, 7444 rule instances were generated from five generalized rules, resulting in 4818 unique GO functional predictions for 1796 genes. We show that our method is capable of inferring high-quality functional annotations from curated phenotype data. As well as creating inferred annotations, our method has the potential to allow for the elucidation of unforeseen, biologically significant associations between gene function and phenotypes that would be overlooked by a semantics-based approach. Future work will include the implementation of the described algorithms for a variety of other model organism databases, taking full advantage of the abundance of available high quality curated data.

  1. An integrated data model to estimate spatiotemporal occupancy, abundance, and colonization dynamics

    USGS Publications Warehouse

    Williams, Perry J.; Hooten, Mevin B.; Womble, Jamie N.; Esslinger, George G.; Bower, Michael R.; Hefley, Trevor J.

    2017-01-01

    Ecological invasions and colonizations occur dynamically through space and time. Estimating the distribution and abundance of colonizing species is critical for efficient management or conservation. We describe a statistical framework for simultaneously estimating spatiotemporal occupancy and abundance dynamics of a colonizing species. Our method accounts for several issues that are common when modeling spatiotemporal ecological data including multiple levels of detection probability, multiple data sources, and computational limitations that occur when making fine-scale inference over a large spatiotemporal domain. We apply the model to estimate the colonization dynamics of sea otters (Enhydra lutris) in Glacier Bay, in southeastern Alaska.

  2. Inferring HIV Escape Rates from Multi-Locus Genotype Data

    DOE PAGES

    Kessinger, Taylor A.; Perelson, Alan S.; Neher, Richard A.

    2013-09-03

    Cytotoxic T-lymphocytes (CTLs) recognize viral protein fragments displayed by major histocompatibility complex molecules on the surface of virally infected cells and generate an anti-viral response that can kill the infected cells. Virus variants whose protein fragments are not efficiently presented on infected cells or whose fragments are presented but not recognized by CTLs therefore have a competitive advantage and spread rapidly through the population. We present a method that allows a more robust estimation of these escape rates from serially sampled sequence data. The proposed method accounts for competition between multiple escapes by explicitly modeling the accumulation of escape mutationsmore » and the stochastic effects of rare multiple mutants. Applying our method to serially sampled HIV sequence data, we estimate rates of HIV escape that are substantially larger than those previously reported. The method can be extended to complex escapes that require compensatory mutations. We expect our method to be applicable in other contexts such as cancer evolution where time series data is also available.« less

  3. A Statistical Method for Synthesizing Mediation Analyses Using the Product of Coefficient Approach Across Multiple Trials

    PubMed Central

    Huang, Shi; MacKinnon, David P.; Perrino, Tatiana; Gallo, Carlos; Cruden, Gracelyn; Brown, C Hendricks

    2016-01-01

    Mediation analysis often requires larger sample sizes than main effect analysis to achieve the same statistical power. Combining results across similar trials may be the only practical option for increasing statistical power for mediation analysis in some situations. In this paper, we propose a method to estimate: 1) marginal means for mediation path a, the relation of the independent variable to the mediator; 2) marginal means for path b, the relation of the mediator to the outcome, across multiple trials; and 3) the between-trial level variance-covariance matrix based on a bivariate normal distribution. We present the statistical theory and an R computer program to combine regression coefficients from multiple trials to estimate a combined mediated effect and confidence interval under a random effects model. Values of coefficients a and b, along with their standard errors from each trial are the input for the method. This marginal likelihood based approach with Monte Carlo confidence intervals provides more accurate inference than the standard meta-analytic approach. We discuss computational issues, apply the method to two real-data examples and make recommendations for the use of the method in different settings. PMID:28239330

  4. Driver fatigue detection through multiple entropy fusion analysis in an EEG-based system

    PubMed Central

    Min, Jianliang; Wang, Ping

    2017-01-01

    Driver fatigue is an important contributor to road accidents, and fatigue detection has major implications for transportation safety. The aim of this research is to analyze the multiple entropy fusion method and evaluate several channel regions to effectively detect a driver's fatigue state based on electroencephalogram (EEG) records. First, we fused multiple entropies, i.e., spectral entropy, approximate entropy, sample entropy and fuzzy entropy, as features compared with autoregressive (AR) modeling by four classifiers. Second, we captured four significant channel regions according to weight-based electrodes via a simplified channel selection method. Finally, the evaluation model for detecting driver fatigue was established with four classifiers based on the EEG data from four channel regions. Twelve healthy subjects performed continuous simulated driving for 1–2 hours with EEG monitoring on a static simulator. The leave-one-out cross-validation approach obtained an accuracy of 98.3%, a sensitivity of 98.3% and a specificity of 98.2%. The experimental results verified the effectiveness of the proposed method, indicating that the multiple entropy fusion features are significant factors for inferring the fatigue state of a driver. PMID:29220351

  5. The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection

    PubMed Central

    Yu, Yun; Degnan, James H.; Nakhleh, Luay

    2012-01-01

    Gene tree topologies have proven a powerful data source for various tasks, including species tree inference and species delimitation. Consequently, methods for computing probabilities of gene trees within species trees have been developed and widely used in probabilistic inference frameworks. All these methods assume an underlying multispecies coalescent model. However, when reticulate evolutionary events such as hybridization occur, these methods are inadequate, as they do not account for such events. Methods that account for both hybridization and deep coalescence in computing the probability of a gene tree topology currently exist for very limited cases. However, no such methods exist for general cases, owing primarily to the fact that it is currently unknown how to compute the probability of a gene tree topology within the branches of a phylogenetic network. Here we present a novel method for computing the probability of gene tree topologies on phylogenetic networks and demonstrate its application to the inference of hybridization in the presence of incomplete lineage sorting. We reanalyze a Saccharomyces species data set for which multiple analyses had converged on a species tree candidate. Using our method, though, we show that an evolutionary hypothesis involving hybridization in this group has better support than one of strict divergence. A similar reanalysis on a group of three Drosophila species shows that the data is consistent with hybridization. Further, using extensive simulation studies, we demonstrate the power of gene tree topologies at obtaining accurate estimates of branch lengths and hybridization probabilities of a given phylogenetic network. Finally, we discuss identifiability issues with detecting hybridization, particularly in cases that involve extinction or incomplete sampling of taxa. PMID:22536161

  6. Modeling the Effect of Grain Size Mixing on Thermal Inertia Values Derived from Diurnal and Seasonal THEMIS Measurements

    NASA Astrophysics Data System (ADS)

    McCarty, C.; Moersch, J.

    2017-12-01

    Sedimentary processes have slowed over Mars' geologic history. Analysis of the surface today can provide insight into the processes that may have affected it over its history. Sub-resolved checkerboard mixtures of materials with different thermal inertias (and therefore different grain sizes) can lead to differences in thermal inertia values inferred from night and day radiance observations. Information about the grain size distribution of a surface can help determine the degree of sorting it has experienced or it's geologic maturity. Standard methods for deriving thermal inertia from measurements made with THEMIS can give values for the same location that vary by as much as 20% between scenes. Such methods make the assumption that each THEMIS pixel contains material that has uniform thermophysical properties. Here we propose that if a mixture of small and large particles is present within a pixel, the inferred thermal inertia will be strongly dominated by whichever particle is warmer at the time of the measurement because the power radiated by a surface is proportional (by the Stefan-Boltzmann law) to the fourth power of its temperature. This effect will result in a change in thermal inertia values inferred from measurements taken at different times of day and night. Therefore, we expect to see correlation between the magnitude of diurnal variations in inferred thermal inertia values and the degree of grain size mixing for a given pixel location. Preliminary work has shown that the magnitude of such diurnal variation in inferred thermal inertias is sufficient to detect geologically useful differences in grain size distributions. We hypothesize that at least some of the 20% variability in thermal inertias inferred from multiple scenes for a given location could be attributed to sub-pixel grain size mixing rather than uncertainty inherent to the experiment, as previously thought. Mapping the difference in inferred thermal inertias from day and night THEMIS observations may prove to be a new way of distinguishing surfaces that have relatively uniform grain sizes from those that have mixed grain sizes. Assessing the effects of different geologic processes can be aided by noting variations in grain size distributions, so this method may be useful as a new way to extract geologic interpretations from the THEMIS thermal data set.

  7. Travel Time Estimation Using Freeway Point Detector Data Based on Evolving Fuzzy Neural Inference System.

    PubMed

    Tang, Jinjun; Zou, Yajie; Ash, John; Zhang, Shen; Liu, Fang; Wang, Yinhai

    2016-01-01

    Travel time is an important measurement used to evaluate the extent of congestion within road networks. This paper presents a new method to estimate the travel time based on an evolving fuzzy neural inference system. The input variables in the system are traffic flow data (volume, occupancy, and speed) collected from loop detectors located at points both upstream and downstream of a given link, and the output variable is the link travel time. A first order Takagi-Sugeno fuzzy rule set is used to complete the inference. For training the evolving fuzzy neural network (EFNN), two learning processes are proposed: (1) a K-means method is employed to partition input samples into different clusters, and a Gaussian fuzzy membership function is designed for each cluster to measure the membership degree of samples to the cluster centers. As the number of input samples increases, the cluster centers are modified and membership functions are also updated; (2) a weighted recursive least squares estimator is used to optimize the parameters of the linear functions in the Takagi-Sugeno type fuzzy rules. Testing datasets consisting of actual and simulated data are used to test the proposed method. Three common criteria including mean absolute error (MAE), root mean square error (RMSE), and mean absolute relative error (MARE) are utilized to evaluate the estimation performance. Estimation results demonstrate the accuracy and effectiveness of the EFNN method through comparison with existing methods including: multiple linear regression (MLR), instantaneous model (IM), linear model (LM), neural network (NN), and cumulative plots (CP).

  8. Travel Time Estimation Using Freeway Point Detector Data Based on Evolving Fuzzy Neural Inference System

    PubMed Central

    Tang, Jinjun; Zou, Yajie; Ash, John; Zhang, Shen; Liu, Fang; Wang, Yinhai

    2016-01-01

    Travel time is an important measurement used to evaluate the extent of congestion within road networks. This paper presents a new method to estimate the travel time based on an evolving fuzzy neural inference system. The input variables in the system are traffic flow data (volume, occupancy, and speed) collected from loop detectors located at points both upstream and downstream of a given link, and the output variable is the link travel time. A first order Takagi-Sugeno fuzzy rule set is used to complete the inference. For training the evolving fuzzy neural network (EFNN), two learning processes are proposed: (1) a K-means method is employed to partition input samples into different clusters, and a Gaussian fuzzy membership function is designed for each cluster to measure the membership degree of samples to the cluster centers. As the number of input samples increases, the cluster centers are modified and membership functions are also updated; (2) a weighted recursive least squares estimator is used to optimize the parameters of the linear functions in the Takagi-Sugeno type fuzzy rules. Testing datasets consisting of actual and simulated data are used to test the proposed method. Three common criteria including mean absolute error (MAE), root mean square error (RMSE), and mean absolute relative error (MARE) are utilized to evaluate the estimation performance. Estimation results demonstrate the accuracy and effectiveness of the EFNN method through comparison with existing methods including: multiple linear regression (MLR), instantaneous model (IM), linear model (LM), neural network (NN), and cumulative plots (CP). PMID:26829639

  9. Metainference: A Bayesian inference method for heterogeneous systems.

    PubMed

    Bonomi, Massimiliano; Camilloni, Carlo; Cavalli, Andrea; Vendruscolo, Michele

    2016-01-01

    Modeling a complex system is almost invariably a challenging task. The incorporation of experimental observations can be used to improve the quality of a model and thus to obtain better predictions about the behavior of the corresponding system. This approach, however, is affected by a variety of different errors, especially when a system simultaneously populates an ensemble of different states and experimental data are measured as averages over such states. To address this problem, we present a Bayesian inference method, called "metainference," that is able to deal with errors in experimental measurements and with experimental measurements averaged over multiple states. To achieve this goal, metainference models a finite sample of the distribution of models using a replica approach, in the spirit of the replica-averaging modeling based on the maximum entropy principle. To illustrate the method, we present its application to a heterogeneous model system and to the determination of an ensemble of structures corresponding to the thermal fluctuations of a protein molecule. Metainference thus provides an approach to modeling complex systems with heterogeneous components and interconverting between different states by taking into account all possible sources of errors.

  10. Squeezing observational data for better causal inference: Methods and examples for prevention research.

    PubMed

    Garcia-Huidobro, Diego; Michael Oakes, J

    2017-04-01

    Randomised controlled trials (RCTs) are typically viewed as the gold standard for causal inference. This is because effects of interest can be identified with the fewest assumptions, especially imbalance in background characteristics. Yet because conducting RCTs are expensive, time consuming and sometimes unethical, observational studies are frequently used to study causal associations. In these studies, imbalance, or confounding, is usually controlled with multiple regression, which entails strong assumptions. The purpose of this manuscript is to describe strengths and weaknesses of several methods to control for confounding in observational studies, and to demonstrate their use in cross-sectional dataset that use patient registration data from the Juan Pablo II Primary Care Clinic in La Pintana-Chile. The dataset contains responses from 5855 families who provided complete information on family socio-demographics, family functioning and health problems among their family members. We employ regression adjustment, stratification, restriction, matching, propensity score matching, standardisation and inverse probability weighting to illustrate the approaches to better causal inference in non-experimental data and compare results. By applying study design and data analysis techniques that control for confounding in different ways than regression adjustment, researchers may strengthen the scientific relevance of observational studies. © 2016 International Union of Psychological Science.

  11. Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests.

    PubMed

    Posada, David; Buckley, Thomas R

    2004-10-01

    Model selection is a topic of special relevance in molecular phylogenetics that affects many, if not all, stages of phylogenetic inference. Here we discuss some fundamental concepts and techniques of model selection in the context of phylogenetics. We start by reviewing different aspects of the selection of substitution models in phylogenetics from a theoretical, philosophical and practical point of view, and summarize this comparison in table format. We argue that the most commonly implemented model selection approach, the hierarchical likelihood ratio test, is not the optimal strategy for model selection in phylogenetics, and that approaches like the Akaike Information Criterion (AIC) and Bayesian methods offer important advantages. In particular, the latter two methods are able to simultaneously compare multiple nested or nonnested models, assess model selection uncertainty, and allow for the estimation of phylogenies and model parameters using all available models (model-averaged inference or multimodel inference). We also describe how the relative importance of the different parameters included in substitution models can be depicted. To illustrate some of these points, we have applied AIC-based model averaging to 37 mitochondrial DNA sequences from the subgenus Ohomopterus(genus Carabus) ground beetles described by Sota and Vogler (2001).

  12. Lifetime Prediction for Degradation of Solar Mirrors using Step-Stress Accelerated Testing (Presentation)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lee, J.; Elmore, R.; Kennedy, C.

    This research is to illustrate the use of statistical inference techniques in order to quantify the uncertainty surrounding reliability estimates in a step-stress accelerated degradation testing (SSADT) scenario. SSADT can be used when a researcher is faced with a resource-constrained environment, e.g., limits on chamber time or on the number of units to test. We apply the SSADT methodology to a degradation experiment involving concentrated solar power (CSP) mirrors and compare the results to a more traditional multiple accelerated testing paradigm. Specifically, our work includes: (1) designing a durability testing plan for solar mirrors (3M's new improved silvered acrylic "Solarmore » Reflector Film (SFM) 1100") through the ultra-accelerated weathering system (UAWS), (2) defining degradation paths of optical performance based on the SSADT model which is accelerated by high UV-radiant exposure, and (3) developing service lifetime prediction models for solar mirrors using advanced statistical inference. We use the method of least squares to estimate the model parameters and this serves as the basis for the statistical inference in SSADT. Several quantities of interest can be estimated from this procedure, e.g., mean-time-to-failure (MTTF) and warranty time. The methods allow for the estimation of quantities that may be of interest to the domain scientists.« less

  13. PyDREAM: high-dimensional parameter inference for biological models in python.

    PubMed

    Shockley, Erin M; Vrugt, Jasper A; Lopez, Carlos F; Valencia, Alfonso

    2018-02-15

    Biological models contain many parameters whose values are difficult to measure directly via experimentation and therefore require calibration against experimental data. Markov chain Monte Carlo (MCMC) methods are suitable to estimate multivariate posterior model parameter distributions, but these methods may exhibit slow or premature convergence in high-dimensional search spaces. Here, we present PyDREAM, a Python implementation of the (Multiple-Try) Differential Evolution Adaptive Metropolis [DREAM(ZS)] algorithm developed by Vrugt and ter Braak (2008) and Laloy and Vrugt (2012). PyDREAM achieves excellent performance for complex, parameter-rich models and takes full advantage of distributed computing resources, facilitating parameter inference and uncertainty estimation of CPU-intensive biological models. PyDREAM is freely available under the GNU GPLv3 license from the Lopez lab GitHub repository at http://github.com/LoLab-VU/PyDREAM. c.lopez@vanderbilt.edu. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.

  14. Reliability of dose volume constraint inference from clinical data.

    PubMed

    Lutz, C M; Møller, D S; Hoffmann, L; Knap, M M; Alber, M

    2017-04-21

    Dose volume histogram points (DVHPs) frequently serve as dose constraints in radiotherapy treatment planning. An experiment was designed to investigate the reliability of DVHP inference from clinical data for multiple cohort sizes and complication incidence rates. The experimental background was radiation pneumonitis in non-small cell lung cancer and the DVHP inference method was based on logistic regression. From 102 NSCLC real-life dose distributions and a postulated DVHP model, an 'ideal' cohort was generated where the most predictive model was equal to the postulated model. A bootstrap and a Cohort Replication Monte Carlo (CoRepMC) approach were applied to create 1000 equally sized populations each. The cohorts were then analyzed to establish inference frequency distributions. This was applied to nine scenarios for cohort sizes of 102 (1), 500 (2) to 2000 (3) patients (by sampling with replacement) and three postulated DVHP models. The Bootstrap was repeated for a 'non-ideal' cohort, where the most predictive model did not coincide with the postulated model. The Bootstrap produced chaotic results for all models of cohort size 1 for both the ideal and non-ideal cohorts. For cohort size 2 and 3, the distributions for all populations were more concentrated around the postulated DVHP. For the CoRepMC, the inference frequency increased with cohort size and incidence rate. Correct inference rates  >[Formula: see text] were only achieved by cohorts with more than 500 patients. Both Bootstrap and CoRepMC indicate that inference of the correct or approximate DVHP for typical cohort sizes is highly uncertain. CoRepMC results were less spurious than Bootstrap results, demonstrating the large influence that randomness in dose-response has on the statistical analysis.

  15. Reliability of dose volume constraint inference from clinical data

    NASA Astrophysics Data System (ADS)

    Lutz, C. M.; Møller, D. S.; Hoffmann, L.; Knap, M. M.; Alber, M.

    2017-04-01

    Dose volume histogram points (DVHPs) frequently serve as dose constraints in radiotherapy treatment planning. An experiment was designed to investigate the reliability of DVHP inference from clinical data for multiple cohort sizes and complication incidence rates. The experimental background was radiation pneumonitis in non-small cell lung cancer and the DVHP inference method was based on logistic regression. From 102 NSCLC real-life dose distributions and a postulated DVHP model, an ‘ideal’ cohort was generated where the most predictive model was equal to the postulated model. A bootstrap and a Cohort Replication Monte Carlo (CoRepMC) approach were applied to create 1000 equally sized populations each. The cohorts were then analyzed to establish inference frequency distributions. This was applied to nine scenarios for cohort sizes of 102 (1), 500 (2) to 2000 (3) patients (by sampling with replacement) and three postulated DVHP models. The Bootstrap was repeated for a ‘non-ideal’ cohort, where the most predictive model did not coincide with the postulated model. The Bootstrap produced chaotic results for all models of cohort size 1 for both the ideal and non-ideal cohorts. For cohort size 2 and 3, the distributions for all populations were more concentrated around the postulated DVHP. For the CoRepMC, the inference frequency increased with cohort size and incidence rate. Correct inference rates  >85 % were only achieved by cohorts with more than 500 patients. Both Bootstrap and CoRepMC indicate that inference of the correct or approximate DVHP for typical cohort sizes is highly uncertain. CoRepMC results were less spurious than Bootstrap results, demonstrating the large influence that randomness in dose-response has on the statistical analysis.

  16. Multivariate meta-analysis using individual participant data.

    PubMed

    Riley, R D; Price, M J; Jackson, D; Wardle, M; Gueyffier, F; Wang, J; Staessen, J A; White, I R

    2015-06-01

    When combining results across related studies, a multivariate meta-analysis allows the joint synthesis of correlated effect estimates from multiple outcomes. Joint synthesis can improve efficiency over separate univariate syntheses, may reduce selective outcome reporting biases, and enables joint inferences across the outcomes. A common issue is that within-study correlations needed to fit the multivariate model are unknown from published reports. However, provision of individual participant data (IPD) allows them to be calculated directly. Here, we illustrate how to use IPD to estimate within-study correlations, using a joint linear regression for multiple continuous outcomes and bootstrapping methods for binary, survival and mixed outcomes. In a meta-analysis of 10 hypertension trials, we then show how these methods enable multivariate meta-analysis to address novel clinical questions about continuous, survival and binary outcomes; treatment-covariate interactions; adjusted risk/prognostic factor effects; longitudinal data; prognostic and multiparameter models; and multiple treatment comparisons. Both frequentist and Bayesian approaches are applied, with example software code provided to derive within-study correlations and to fit the models. © 2014 The Authors. Research Synthesis Methods published by John Wiley & Sons, Ltd.

  17. Inference of emission rates from multiple sources using Bayesian probability theory.

    PubMed

    Yee, Eugene; Flesch, Thomas K

    2010-03-01

    The determination of atmospheric emission rates from multiple sources using inversion (regularized least-squares or best-fit technique) is known to be very susceptible to measurement and model errors in the problem, rendering the solution unusable. In this paper, a new perspective is offered for this problem: namely, it is argued that the problem should be addressed as one of inference rather than inversion. Towards this objective, Bayesian probability theory is used to estimate the emission rates from multiple sources. The posterior probability distribution for the emission rates is derived, accounting fully for the measurement errors in the concentration data and the model errors in the dispersion model used to interpret the data. The Bayesian inferential methodology for emission rate recovery is validated against real dispersion data, obtained from a field experiment involving various source-sensor geometries (scenarios) consisting of four synthetic area sources and eight concentration sensors. The recovery of discrete emission rates from three different scenarios obtained using Bayesian inference and singular value decomposition inversion are compared and contrasted.

  18. GeneNetFinder2: Improved Inference of Dynamic Gene Regulatory Relations with Multiple Regulators.

    PubMed

    Han, Kyungsook; Lee, Jeonghoon

    2016-01-01

    A gene involved in complex regulatory interactions may have multiple regulators since gene expression in such interactions is often controlled by more than one gene. Another thing that makes gene regulatory interactions complicated is that regulatory interactions are not static, but change over time during the cell cycle. Most research so far has focused on identifying gene regulatory relations between individual genes in a particular stage of the cell cycle. In this study we developed a method for identifying dynamic gene regulations of several types from the time-series gene expression data. The method can find gene regulations with multiple regulators that work in combination or individually as well as those with single regulators. The method has been implemented as the second version of GeneNetFinder (hereafter called GeneNetFinder2) and tested on several gene expression datasets. Experimental results with gene expression data revealed the existence of genes that are not regulated by individual genes but rather by a combination of several genes. Such gene regulatory relations cannot be found by conventional methods. Our method finds such regulatory relations as well as those with multiple, independent regulators or single regulators, and represents gene regulatory relations as a dynamic network in which different gene regulatory relations are shown in different stages of the cell cycle. GeneNetFinder2 is available at http://bclab.inha.ac.kr/GeneNetFinder and will be useful for modeling dynamic gene regulations with multiple regulators.

  19. Genetic network inference as a series of discrimination tasks.

    PubMed

    Kimura, Shuhei; Nakayama, Satoshi; Hatakeyama, Mariko

    2009-04-01

    Genetic network inference methods based on sets of differential equations generally require a great deal of time, as the equations must be solved many times. To reduce the computational cost, researchers have proposed other methods for inferring genetic networks by solving sets of differential equations only a few times, or even without solving them at all. When we try to obtain reasonable network models using these methods, however, we must estimate the time derivatives of the gene expression levels with great precision. In this study, we propose a new method to overcome the drawbacks of inference methods based on sets of differential equations. Our method infers genetic networks by obtaining classifiers capable of predicting the signs of the derivatives of the gene expression levels. For this purpose, we defined a genetic network inference problem as a series of discrimination tasks, then solved the defined series of discrimination tasks with a linear programming machine. Our experimental results demonstrated that the proposed method is capable of correctly inferring genetic networks, and doing so more than 500 times faster than the other inference methods based on sets of differential equations. Next, we applied our method to actual expression data of the bacterial SOS DNA repair system. And finally, we demonstrated that our approach relates to the inference method based on the S-system model. Though our method provides no estimation of the kinetic parameters, it should be useful for researchers interested only in the network structure of a target system. Supplementary data are available at Bioinformatics online.

  20. Developing multiple-choices test items as tools for measuring the scientific-generic skills on solar system

    NASA Astrophysics Data System (ADS)

    Bhakti, Satria Seto; Samsudin, Achmad; Chandra, Didi Teguh; Siahaan, Parsaoran

    2017-05-01

    The aim of research is developing multiple-choices test items as tools for measuring the scientific of generic skills on solar system. To achieve the aim that the researchers used the ADDIE model consisting Of: Analyzing, Design, Development, Implementation, dan Evaluation, all of this as a method research. While The scientific of generic skills limited research to five indicator including: (1) indirect observation, (2) awareness of the scale, (3) inference logic, (4) a causal relation, and (5) mathematical modeling. The participants are 32 students at one of junior high schools in Bandung. The result shown that multiple-choices that are constructed test items have been declared valid by the expert validator, and after the tests show that the matter of developing multiple-choices test items be able to measuring the scientific of generic skills on solar system.

  1. Metabarcoding of marine nematodes – evaluation of reference datasets used in tree-based taxonomy assignment approach

    PubMed Central

    2016-01-01

    Abstract Background Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. New information In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand. Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset. Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach. PMID:27932919

  2. Robust inference for group sequential trials.

    PubMed

    Ganju, Jitendra; Lin, Yunzhi; Zhou, Kefei

    2017-03-01

    For ethical reasons, group sequential trials were introduced to allow trials to stop early in the event of extreme results. Endpoints in such trials are usually mortality or irreversible morbidity. For a given endpoint, the norm is to use a single test statistic and to use that same statistic for each analysis. This approach is risky because the test statistic has to be specified before the study is unblinded, and there is loss in power if the assumptions that ensure optimality for each analysis are not met. To minimize the risk of moderate to substantial loss in power due to a suboptimal choice of a statistic, a robust method was developed for nonsequential trials. The concept is analogous to diversification of financial investments to minimize risk. The method is based on combining P values from multiple test statistics for formal inference while controlling the type I error rate at its designated value.This article evaluates the performance of 2 P value combining methods for group sequential trials. The emphasis is on time to event trials although results from less complex trials are also included. The gain or loss in power with the combination method relative to a single statistic is asymmetric in its favor. Depending on the power of each individual test, the combination method can give more power than any single test or give power that is closer to the test with the most power. The versatility of the method is that it can combine P values from different test statistics for analysis at different times. The robustness of results suggests that inference from group sequential trials can be strengthened with the use of combined tests. Copyright © 2017 John Wiley & Sons, Ltd.

  3. Metabarcoding of marine nematodes - evaluation of reference datasets used in tree-based taxonomy assignment approach.

    PubMed

    Holovachov, Oleksandr

    2016-01-01

    Metabarcoding is becoming a common tool used to assess and compare diversity of organisms in environmental samples. Identification of OTUs is one of the critical steps in the process and several taxonomy assignment methods were proposed to accomplish this task. This publication evaluates the quality of reference datasets, alongside with several alignment and phylogeny inference methods used in one of the taxonomy assignment methods, called tree-based approach. This approach assigns anonymous OTUs to taxonomic categories based on relative placements of OTUs and reference sequences on the cladogram and support that these placements receive. In tree-based taxonomy assignment approach, reliable identification of anonymous OTUs is based on their placement in monophyletic and highly supported clades together with identified reference taxa. Therefore, it requires high quality reference dataset to be used. Resolution of phylogenetic trees is strongly affected by the presence of erroneous sequences as well as alignment and phylogeny inference methods used in the process. Two preparation steps are essential for the successful application of tree-based taxonomy assignment approach. Curated collections of genetic information do include erroneous sequences. These sequences have detrimental effect on the resolution of cladograms used in tree-based approach. They must be identified and excluded from the reference dataset beforehand.Various combinations of multiple sequence alignment and phylogeny inference methods provide cladograms with different topology and bootstrap support. These combinations of methods need to be tested in order to determine the one that gives highest resolution for the particular reference dataset.Completing the above mentioned preparation steps is expected to decrease the number of unassigned OTUs and thus improve the results of the tree-based taxonomy assignment approach.

  4. Mixture-based gatekeeping procedures in adaptive clinical trials.

    PubMed

    Kordzakhia, George; Dmitrienko, Alex; Ishida, Eiji

    2018-01-01

    Clinical trials with data-driven decision rules often pursue multiple clinical objectives such as the evaluation of several endpoints or several doses of an experimental treatment. These complex analysis strategies give rise to "multivariate" multiplicity problems with several components or sources of multiplicity. A general framework for defining gatekeeping procedures in clinical trials with adaptive multistage designs is proposed in this paper. The mixture method is applied to build a gatekeeping procedure at each stage and inferences at each decision point (interim or final analysis) are performed using the combination function approach. An advantage of utilizing the mixture method is that it enables powerful gatekeeping procedures applicable to a broad class of settings with complex logical relationships among the hypotheses of interest. Further, the combination function approach supports flexible data-driven decisions such as a decision to increase the sample size or remove a treatment arm. The paper concludes with a clinical trial example that illustrates the methodology by applying it to develop an adaptive two-stage design with a mixture-based gatekeeping procedure.

  5. Multivariate meta-analysis using individual participant data

    PubMed Central

    Riley, R. D.; Price, M. J.; Jackson, D.; Wardle, M.; Gueyffier, F.; Wang, J.; Staessen, J. A.; White, I. R.

    2016-01-01

    When combining results across related studies, a multivariate meta-analysis allows the joint synthesis of correlated effect estimates from multiple outcomes. Joint synthesis can improve efficiency over separate univariate syntheses, may reduce selective outcome reporting biases, and enables joint inferences across the outcomes. A common issue is that within-study correlations needed to fit the multivariate model are unknown from published reports. However, provision of individual participant data (IPD) allows them to be calculated directly. Here, we illustrate how to use IPD to estimate within-study correlations, using a joint linear regression for multiple continuous outcomes and bootstrapping methods for binary, survival and mixed outcomes. In a meta-analysis of 10 hypertension trials, we then show how these methods enable multivariate meta-analysis to address novel clinical questions about continuous, survival and binary outcomes; treatment–covariate interactions; adjusted risk/prognostic factor effects; longitudinal data; prognostic and multiparameter models; and multiple treatment comparisons. Both frequentist and Bayesian approaches are applied, with example software code provided to derive within-study correlations and to fit the models. PMID:26099484

  6. A method based on multi-sensor data fusion for fault detection of planetary gearboxes.

    PubMed

    Lei, Yaguo; Lin, Jing; He, Zhengjia; Kong, Detong

    2012-01-01

    Studies on fault detection and diagnosis of planetary gearboxes are quite limited compared with those of fixed-axis gearboxes. Different from fixed-axis gearboxes, planetary gearboxes exhibit unique behaviors, which invalidate fault diagnosis methods that work well for fixed-axis gearboxes. It is a fact that for systems as complex as planetary gearboxes, multiple sensors mounted on different locations provide complementary information on the health condition of the systems. On this basis, a fault detection method based on multi-sensor data fusion is introduced in this paper. In this method, two features developed for planetary gearboxes are used to characterize the gear health conditions, and an adaptive neuro-fuzzy inference system (ANFIS) is utilized to fuse all features from different sensors. In order to demonstrate the effectiveness of the proposed method, experiments are carried out on a planetary gearbox test rig, on which multiple accelerometers are mounted for data collection. The comparisons between the proposed method and the methods based on individual sensors show that the former achieves much higher accuracies in detecting planetary gearbox faults.

  7. Nonparametric Bayesian inference of the microcanonical stochastic block model

    NASA Astrophysics Data System (ADS)

    Peixoto, Tiago P.

    2017-01-01

    A principled approach to characterize the hidden modular structure of networks is to formulate generative models and then infer their parameters from data. When the desired structure is composed of modules or "communities," a suitable choice for this task is the stochastic block model (SBM), where nodes are divided into groups, and the placement of edges is conditioned on the group memberships. Here, we present a nonparametric Bayesian method to infer the modular structure of empirical networks, including the number of modules and their hierarchical organization. We focus on a microcanonical variant of the SBM, where the structure is imposed via hard constraints, i.e., the generated networks are not allowed to violate the patterns imposed by the model. We show how this simple model variation allows simultaneously for two important improvements over more traditional inference approaches: (1) deeper Bayesian hierarchies, with noninformative priors replaced by sequences of priors and hyperpriors, which not only remove limitations that seriously degrade the inference on large networks but also reveal structures at multiple scales; (2) a very efficient inference algorithm that scales well not only for networks with a large number of nodes and edges but also with an unlimited number of modules. We show also how this approach can be used to sample modular hierarchies from the posterior distribution, as well as to perform model selection. We discuss and analyze the differences between sampling from the posterior and simply finding the single parameter estimate that maximizes it. Furthermore, we expose a direct equivalence between our microcanonical approach and alternative derivations based on the canonical SBM.

  8. The time and place of European admixture in Ashkenazi Jewish history.

    PubMed

    Xue, James; Lencz, Todd; Darvasi, Ariel; Pe'er, Itsik; Carmi, Shai

    2017-04-01

    The Ashkenazi Jewish (AJ) population is important in genetics due to its high rate of Mendelian disorders. AJ appeared in Europe in the 10th century, and their ancestry is thought to comprise European (EU) and Middle-Eastern (ME) components. However, both the time and place of admixture are subject to debate. Here, we attempt to characterize the AJ admixture history using a careful application of new and existing methods on a large AJ sample. Our main approach was based on local ancestry inference, in which we first classified each AJ genomic segment as EU or ME, and then compared allele frequencies along the EU segments to those of different EU populations. The contribution of each EU source was also estimated using GLOBETROTTER and haplotype sharing. The time of admixture was inferred based on multiple statistics, including ME segment lengths, the total EU ancestry per chromosome, and the correlation of ancestries along the chromosome. The major source of EU ancestry in AJ was found to be Southern Europe (≈60-80% of EU ancestry), with the rest being likely Eastern European. The inferred admixture time was ≈30 generations ago, but multiple lines of evidence suggest that it represents an average over two or more events, pre- and post-dating the founder event experienced by AJ in late medieval times. The time of the pre-bottleneck admixture event, which was likely Southern European, was estimated to ≈25-50 generations ago.

  9. The time and place of European admixture in Ashkenazi Jewish history

    PubMed Central

    Xue, James; Lencz, Todd; Darvasi, Ariel; Pe’er, Itsik

    2017-01-01

    The Ashkenazi Jewish (AJ) population is important in genetics due to its high rate of Mendelian disorders. AJ appeared in Europe in the 10th century, and their ancestry is thought to comprise European (EU) and Middle-Eastern (ME) components. However, both the time and place of admixture are subject to debate. Here, we attempt to characterize the AJ admixture history using a careful application of new and existing methods on a large AJ sample. Our main approach was based on local ancestry inference, in which we first classified each AJ genomic segment as EU or ME, and then compared allele frequencies along the EU segments to those of different EU populations. The contribution of each EU source was also estimated using GLOBETROTTER and haplotype sharing. The time of admixture was inferred based on multiple statistics, including ME segment lengths, the total EU ancestry per chromosome, and the correlation of ancestries along the chromosome. The major source of EU ancestry in AJ was found to be Southern Europe (≈60–80% of EU ancestry), with the rest being likely Eastern European. The inferred admixture time was ≈30 generations ago, but multiple lines of evidence suggest that it represents an average over two or more events, pre- and post-dating the founder event experienced by AJ in late medieval times. The time of the pre-bottleneck admixture event, which was likely Southern European, was estimated to ≈25–50 generations ago. PMID:28376121

  10. Simultaneous phylogeny reconstruction and multiple sequence alignment

    PubMed Central

    Yue, Feng; Shi, Jian; Tang, Jijun

    2009-01-01

    Background A phylogeny is the evolutionary history of a group of organisms. To date, sequence data is still the most used data type for phylogenetic reconstruction. Before any sequences can be used for phylogeny reconstruction, they must be aligned, and the quality of the multiple sequence alignment has been shown to affect the quality of the inferred phylogeny. At the same time, all the current multiple sequence alignment programs use a guide tree to produce the alignment and experiments showed that good guide trees can significantly improve the multiple alignment quality. Results We devise a new algorithm to simultaneously align multiple sequences and search for the phylogenetic tree that leads to the best alignment. We also implemented the algorithm as a C program package, which can handle both DNA and protein data and can take simple cost model as well as complex substitution matrices, such as PAM250 or BLOSUM62. The performance of the new method are compared with those from other popular multiple sequence alignment tools, including the widely used programs such as ClustalW and T-Coffee. Experimental results suggest that this method has good performance in terms of both phylogeny accuracy and alignment quality. Conclusion We present an algorithm to align multiple sequences and reconstruct the phylogenies that minimize the alignment score, which is based on an efficient algorithm to solve the median problems for three sequences. Our extensive experiments suggest that this method is very promising and can produce high quality phylogenies and alignments. PMID:19208110

  11. Causal inference in biology networks with integrated belief propagation.

    PubMed

    Chang, Rui; Karr, Jonathan R; Schadt, Eric E

    2015-01-01

    Inferring causal relationships among molecular and higher order phenotypes is a critical step in elucidating the complexity of living systems. Here we propose a novel method for inferring causality that is no longer constrained by the conditional dependency arguments that limit the ability of statistical causal inference methods to resolve causal relationships within sets of graphical models that are Markov equivalent. Our method utilizes Bayesian belief propagation to infer the responses of perturbation events on molecular traits given a hypothesized graph structure. A distance measure between the inferred response distribution and the observed data is defined to assess the 'fitness' of the hypothesized causal relationships. To test our algorithm, we infer causal relationships within equivalence classes of gene networks in which the form of the functional interactions that are possible are assumed to be nonlinear, given synthetic microarray and RNA sequencing data. We also apply our method to infer causality in real metabolic network with v-structure and feedback loop. We show that our method can recapitulate the causal structure and recover the feedback loop only from steady-state data which conventional method cannot.

  12. GeneSilico protein structure prediction meta-server.

    PubMed

    Kurowski, Michal A; Bujnicki, Janusz M

    2003-07-01

    Rigorous assessments of protein structure prediction have demonstrated that fold recognition methods can identify remote similarities between proteins when standard sequence search methods fail. It has been shown that the accuracy of predictions is improved when refined multiple sequence alignments are used instead of single sequences and if different methods are combined to generate a consensus model. There are several meta-servers available that integrate protein structure predictions performed by various methods, but they do not allow for submission of user-defined multiple sequence alignments and they seldom offer confidentiality of the results. We developed a novel WWW gateway for protein structure prediction, which combines the useful features of other meta-servers available, but with much greater flexibility of the input. The user may submit an amino acid sequence or a multiple sequence alignment to a set of methods for primary, secondary and tertiary structure prediction. Fold-recognition results (target-template alignments) are converted into full-atom 3D models and the quality of these models is uniformly assessed. A consensus between different FR methods is also inferred. The results are conveniently presented on-line on a single web page over a secure, password-protected connection. The GeneSilico protein structure prediction meta-server is freely available for academic users at http://genesilico.pl/meta.

  13. GeneSilico protein structure prediction meta-server

    PubMed Central

    Kurowski, Michal A.; Bujnicki, Janusz M.

    2003-01-01

    Rigorous assessments of protein structure prediction have demonstrated that fold recognition methods can identify remote similarities between proteins when standard sequence search methods fail. It has been shown that the accuracy of predictions is improved when refined multiple sequence alignments are used instead of single sequences and if different methods are combined to generate a consensus model. There are several meta-servers available that integrate protein structure predictions performed by various methods, but they do not allow for submission of user-defined multiple sequence alignments and they seldom offer confidentiality of the results. We developed a novel WWW gateway for protein structure prediction, which combines the useful features of other meta-servers available, but with much greater flexibility of the input. The user may submit an amino acid sequence or a multiple sequence alignment to a set of methods for primary, secondary and tertiary structure prediction. Fold-recognition results (target-template alignments) are converted into full-atom 3D models and the quality of these models is uniformly assessed. A consensus between different FR methods is also inferred. The results are conveniently presented on-line on a single web page over a secure, password-protected connection. The GeneSilico protein structure prediction meta-server is freely available for academic users at http://genesilico.pl/meta. PMID:12824313

  14. Activity recognition using Video Event Segmentation with Text (VEST)

    NASA Astrophysics Data System (ADS)

    Holloway, Hillary; Jones, Eric K.; Kaluzniacki, Andrew; Blasch, Erik; Tierno, Jorge

    2014-06-01

    Multi-Intelligence (multi-INT) data includes video, text, and signals that require analysis by operators. Analysis methods include information fusion approaches such as filtering, correlation, and association. In this paper, we discuss the Video Event Segmentation with Text (VEST) method, which provides event boundaries of an activity to compile related message and video clips for future interest. VEST infers meaningful activities by clustering multiple streams of time-sequenced multi-INT intelligence data and derived fusion products. We discuss exemplar results that segment raw full-motion video (FMV) data by using extracted commentary message timestamps, FMV metadata, and user-defined queries.

  15. Multi-scale occupancy estimation and modelling using multiple detection methods

    USGS Publications Warehouse

    Nichols, James D.; Bailey, Larissa L.; O'Connell, Allan F.; Talancy, Neil W.; Grant, Evan H. Campbell; Gilbert, Andrew T.; Annand, Elizabeth M.; Husband, Thomas P.; Hines, James E.

    2008-01-01

    Occupancy estimation and modelling based on detection–nondetection data provide an effective way of exploring change in a species’ distribution across time and space in cases where the species is not always detected with certainty. Today, many monitoring programmes target multiple species, or life stages within a species, requiring the use of multiple detection methods. When multiple methods or devices are used at the same sample sites, animals can be detected by more than one method.We develop occupancy models for multiple detection methods that permit simultaneous use of data from all methods for inference about method-specific detection probabilities. Moreover, the approach permits estimation of occupancy at two spatial scales: the larger scale corresponds to species’ use of a sample unit, whereas the smaller scale corresponds to presence of the species at the local sample station or site.We apply the models to data collected on two different vertebrate species: striped skunks Mephitis mephitis and red salamanders Pseudotriton ruber. For striped skunks, large-scale occupancy estimates were consistent between two sampling seasons. Small-scale occupancy probabilities were slightly lower in the late winter/spring when skunks tend to conserve energy, and movements are limited to males in search of females for breeding. There was strong evidence of method-specific detection probabilities for skunks. As anticipated, large- and small-scale occupancy areas completely overlapped for red salamanders. The analyses provided weak evidence of method-specific detection probabilities for this species.Synthesis and applications. Increasingly, many studies are utilizing multiple detection methods at sampling locations. The modelling approach presented here makes efficient use of detections from multiple methods to estimate occupancy probabilities at two spatial scales and to compare detection probabilities associated with different detection methods. The models can be viewed as another variation of Pollock's robust design and may be applicable to a wide variety of scenarios where species occur in an area but are not always near the sampled locations. The estimation approach is likely to be especially useful in multispecies conservation programmes by providing efficient estimates using multiple detection devices and by providing device-specific detection probability estimates for use in survey design.

  16. Alignment-free inference of hierarchical and reticulate phylogenomic relationships.

    PubMed

    Bernard, Guillaume; Chan, Cheong Xin; Chan, Yao-Ban; Chua, Xin-Yi; Cong, Yingnan; Hogan, James M; Maetschke, Stefan R; Ragan, Mark A

    2017-06-30

    We are amidst an ongoing flood of sequence data arising from the application of high-throughput technologies, and a concomitant fundamental revision in our understanding of how genomes evolve individually and within the biosphere. Workflows for phylogenomic inference must accommodate data that are not only much larger than before, but often more error prone and perhaps misassembled, or not assembled in the first place. Moreover, genomes of microbes, viruses and plasmids evolve not only by tree-like descent with modification but also by incorporating stretches of exogenous DNA. Thus, next-generation phylogenomics must address computational scalability while rethinking the nature of orthogroups, the alignment of multiple sequences and the inference and comparison of trees. New phylogenomic workflows have begun to take shape based on so-called alignment-free (AF) approaches. Here, we review the conceptual foundations of AF phylogenetics for the hierarchical (vertical) and reticulate (lateral) components of genome evolution, focusing on methods based on k-mers. We reflect on what seems to be successful, and on where further development is needed. © The Author 2017. Published by Oxford University Press.

  17. Applying a multiobjective metaheuristic inspired by honey bees to phylogenetic inference.

    PubMed

    Santander-Jiménez, Sergio; Vega-Rodríguez, Miguel A

    2013-10-01

    The development of increasingly popular multiobjective metaheuristics has allowed bioinformaticians to deal with optimization problems in computational biology where multiple objective functions must be taken into account. One of the most relevant research topics that can benefit from these techniques is phylogenetic inference. Throughout the years, different researchers have proposed their own view about the reconstruction of ancestral evolutionary relationships among species. As a result, biologists often report different phylogenetic trees from a same dataset when considering distinct optimality principles. In this work, we detail a multiobjective swarm intelligence approach based on the novel Artificial Bee Colony algorithm for inferring phylogenies. The aim of this paper is to propose a complementary view of phylogenetics according to the maximum parsimony and maximum likelihood criteria, in order to generate a set of phylogenetic trees that represent a compromise between these principles. Experimental results on a variety of nucleotide data sets and statistical studies highlight the relevance of the proposal with regard to other multiobjective algorithms and state-of-the-art biological methods. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.

  18. Supernova Cosmology Inference with Probabilistic Photometric Redshifts (SCIPPR)

    NASA Astrophysics Data System (ADS)

    Peters, Christina; Malz, Alex; Hlozek, Renée

    2018-01-01

    The Bayesian Estimation Applied to Multiple Species (BEAMS) framework employs probabilistic supernova type classifications to do photometric SN cosmology. This work extends BEAMS to replace high-confidence spectroscopic redshifts with photometric redshift probability density functions, a capability that will be essential in the era the Large Synoptic Survey Telescope and other next-generation photometric surveys where it will not be possible to perform spectroscopic follow up on every SN. We present the Supernova Cosmology Inference with Probabilistic Photometric Redshifts (SCIPPR) Bayesian hierarchical model for constraining the cosmological parameters from photometric lightcurves and host galaxy photometry, which includes selection effects and is extensible to uncertainty in the redshift-dependent supernova type proportions. We create a pair of realistic mock catalogs of joint posteriors over supernova type, redshift, and distance modulus informed by photometric supernova lightcurves and over redshift from simulated host galaxy photometry. We perform inference under our model to obtain a joint posterior probability distribution over the cosmological parameters and compare our results with other methods, namely: a spectroscopic subset, a subset of high probability photometrically classified supernovae, and reducing the photometric redshift probability to a single measurement and error bar.

  19. Transcriptome inference and systems approaches to polypharmacology and drug discovery in herbal medicine.

    PubMed

    Li, Peng; Chen, Jianxin; Zhang, Wuxia; Fu, Bangze; Wang, Wei

    2017-01-04

    Herbal medicine is a concoction of numerous chemical ingredients, and it exhibits polypharmacological effects to act on multiple pharmacological targets, regulating different biological mechanisms and treating a variety of diseases. Thus, this complexity is impossible to deconvolute by the reductionist method of extracting one active ingredient acting on one biological target. To dissect the polypharmacological effects of herbal medicines and their underling pharmacological targets as well as their corresponding active ingredients. We propose a system-biology strategy that combines omics and bioinformatical methodologies for exploring the polypharmacology of herbal mixtures. The myocardial ischemia model was induced by Ameroid constriction of the left anterior descending coronary in Ba-Ma miniature pigs. RNA-seq analysis was utilized to find the differential genes induced by myocardial ischemia in pigs treated with formula QSKL. A transcriptome-based inference method was used to find the landmark drugs with similar mechanisms to QSKL. Gene-level analysis of RNA-seq data in QSKL-treated cases versus control animals yields 279 differential genes. Transcriptome-based inference methods identified 80 landmark drugs that covered nearly all drug classes. Then, based on the landmark drugs, 155 potential pharmacological targets and 57 indications were identified for QSKL. Our results demonstrate the power of a combined approach for exploring the pharmacological target and chemical space of herbal medicines. We hope that our method could enhance our understanding of the molecular mechanisms of herbal systems and further accelerate the exploration of the value of traditional herbal medicine systems. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  20. An integrated data model to estimate spatiotemporal occupancy, abundance, and colonization dynamics.

    PubMed

    Williams, Perry J; Hooten, Mevin B; Womble, Jamie N; Esslinger, George G; Bower, Michael R; Hefley, Trevor J

    2017-02-01

    Ecological invasions and colonizations occur dynamically through space and time. Estimating the distribution and abundance of colonizing species is critical for efficient management or conservation. We describe a statistical framework for simultaneously estimating spatiotemporal occupancy and abundance dynamics of a colonizing species. Our method accounts for several issues that are common when modeling spatiotemporal ecological data including multiple levels of detection probability, multiple data sources, and computational limitations that occur when making fine-scale inference over a large spatiotemporal domain. We apply the model to estimate the colonization dynamics of sea otters (Enhydra lutris) in Glacier Bay, in southeastern Alaska. © 2016 by the Ecological Society of America.

  1. Challenges in Species Tree Estimation Under the Multispecies Coalescent Model

    PubMed Central

    Xu, Bo; Yang, Ziheng

    2016-01-01

    The multispecies coalescent (MSC) model has emerged as a powerful framework for inferring species phylogenies while accounting for ancestral polymorphism and gene tree-species tree conflict. A number of methods have been developed in the past few years to estimate the species tree under the MSC. The full likelihood methods (including maximum likelihood and Bayesian inference) average over the unknown gene trees and accommodate their uncertainties properly but involve intensive computation. The approximate or summary coalescent methods are computationally fast and are applicable to genomic datasets with thousands of loci, but do not make an efficient use of information in the multilocus data. Most of them take the two-step approach of reconstructing the gene trees for multiple loci by phylogenetic methods and then treating the estimated gene trees as observed data, without accounting for their uncertainties appropriately. In this article we review the statistical nature of the species tree estimation problem under the MSC, and explore the conceptual issues and challenges of species tree estimation by focusing mainly on simple cases of three or four closely related species. We use mathematical analysis and computer simulation to demonstrate that large differences in statistical performance may exist between the two classes of methods. We illustrate that several counterintuitive behaviors may occur with the summary methods but they are due to inefficient use of information in the data by summary methods and vanish when the data are analyzed using full-likelihood methods. These include (i) unidentifiability of parameters in the model, (ii) inconsistency in the so-called anomaly zone, (iii) singularity on the likelihood surface, and (iv) deterioration of performance upon addition of more data. We discuss the challenges and strategies of species tree inference for distantly related species when the molecular clock is violated, and highlight the need for improving the computational efficiency and model realism of the likelihood methods as well as the statistical efficiency of the summary methods. PMID:27927902

  2. DNA Multiple Sequence Alignment Guided by Protein Domains: The MSA-PAD 2.0 Method.

    PubMed

    Balech, Bachir; Monaco, Alfonso; Perniola, Michele; Santamaria, Monica; Donvito, Giacinto; Vicario, Saverio; Maggi, Giorgio; Pesole, Graziano

    2018-01-01

    Multiple sequence alignment (MSA) is a fundamental component in many DNA sequence analyses including metagenomics studies and phylogeny inference. When guided by protein profiles, DNA multiple alignments assume a higher precision and robustness. Here we present details of the use of the upgraded version of MSA-PAD (2.0), which is a DNA multiple sequence alignment framework able to align DNA sequences coding for single/multiple protein domains guided by PFAM or user-defined annotations. MSA-PAD has two alignment strategies, called "Gene" and "Genome," accounting for coding domains order and genomic rearrangements, respectively. Novel options were added to the present version, where the MSA can be guided by protein profiles provided by the user. This allows MSA-PAD 2.0 to run faster and to add custom protein profiles sometimes not present in PFAM database according to the user's interest. MSA-PAD 2.0 is currently freely available as a Web application at https://recasgateway.cloud.ba.infn.it/ .

  3. Spontaneous Trait Inferences on Social Media.

    PubMed

    Levordashka, Ana; Utz, Sonja

    2017-01-01

    The present research investigates whether spontaneous trait inferences occur under conditions characteristic of social media and networking sites: nonextreme, ostensibly self-generated content, simultaneous presentation of multiple cues, and self-paced browsing. We used an established measure of trait inferences (false recognition paradigm) and a direct assessment of impressions. Without being asked to do so, participants spontaneously formed impressions of people whose status updates they saw. Our results suggest that trait inferences occurred from nonextreme self-generated content, which is commonly found in social media updates (Experiment 1) and when nine status updates from different people were presented in parallel (Experiment 2). Although inferences did occur during free browsing, the results suggest that participants did not necessarily associate the traits with the corresponding status update authors (Experiment 3). Overall, the findings suggest that spontaneous trait inferences occur on social media. We discuss implications for online communication and research on spontaneous trait inferences.

  4. Separable Processes Before, During, and After the N400 Elicited by Previously Inferred and New Information: Evidence from Time-Frequency Decompositions

    PubMed Central

    Steele, Vaughn R.; Bernat, Edward M.; van den Broek, Paul; Collins, Paul F.; Patrick, Christopher J.; Marsolek, Chad J.

    2012-01-01

    Successful comprehension during reading often requires inferring information not explicitly presented. This information is readily accessible when subsequently encountered, and a neural correlate of this is an attenuation of the N400 event-related potential (ERP). We used ERPs and time-frequency (TF) analysis to investigate neural correlates of processing inferred information after a causal coherence inference had been generated during text comprehension. Participants read short texts, some of which promoted inference generation. After each text, they performed lexical decisions to target words that were unrelated or inference-related to the preceding text. Consistent with previous findings, inference-related words elicited an attenuated N400 relative to unrelated words. TF analyses revealed unique contributions to the N400 from activity occurring at 1–6 Hz (theta) and 0–2 Hz (delta), supporting the view that multiple, sequential processes underlie the N400. PMID:23165117

  5. Active Inference, homeostatic regulation and adaptive behavioural control

    PubMed Central

    Pezzulo, Giovanni; Rigoli, Francesco; Friston, Karl

    2015-01-01

    We review a theory of homeostatic regulation and adaptive behavioural control within the Active Inference framework. Our aim is to connect two research streams that are usually considered independently; namely, Active Inference and associative learning theories of animal behaviour. The former uses a probabilistic (Bayesian) formulation of perception and action, while the latter calls on multiple (Pavlovian, habitual, goal-directed) processes for homeostatic and behavioural control. We offer a synthesis these classical processes and cast them as successive hierarchical contextualisations of sensorimotor constructs, using the generative models that underpin Active Inference. This dissolves any apparent mechanistic distinction between the optimization processes that mediate classical control or learning. Furthermore, we generalize the scope of Active Inference by emphasizing interoceptive inference and homeostatic regulation. The ensuing homeostatic (or allostatic) perspective provides an intuitive explanation for how priors act as drives or goals to enslave action, and emphasises the embodied nature of inference. PMID:26365173

  6. Multiple hot-deck imputation for network inference from RNA sequencing data.

    PubMed

    Imbert, Alyssa; Valsesia, Armand; Le Gall, Caroline; Armenise, Claudia; Lefebvre, Gregory; Gourraud, Pierre-Antoine; Viguerie, Nathalie; Villa-Vialaneix, Nathalie

    2018-05-15

    Network inference provides a global view of the relations existing between gene expression in a given transcriptomic experiment (often only for a restricted list of chosen genes). However, it is still a challenging problem: even if the cost of sequencing techniques has decreased over the last years, the number of samples in a given experiment is still (very) small compared to the number of genes. We propose a method to increase the reliability of the inference when RNA-seq expression data have been measured together with an auxiliary dataset that can provide external information on gene expression similarity between samples. Our statistical approach, hd-MI, is based on imputation for samples without available RNA-seq data that are considered as missing data but are observed on the secondary dataset. hd-MI can improve the reliability of the inference for missing rates up to 30% and provides more stable networks with a smaller number of false positive edges. On a biological point of view, hd-MI was also found relevant to infer networks from RNA-seq data acquired in adipose tissue during a nutritional intervention in obese individuals. In these networks, novel links between genes were highlighted, as well as an improved comparability between the two steps of the nutritional intervention. Software and sample data are available as an R package, RNAseqNet, that can be downloaded from the Comprehensive R Archive Network (CRAN). alyssa.imbert@inra.fr or nathalie.villa-vialaneix@inra.fr. Supplementary data are available at Bioinformatics online.

  7. STATISTICAL METHODOLOGY FOR THE SIMULTANEOUS ANALYSIS OF MULTIPLE TYPES OF OUTCOMES IN NONLINEAR THRESHOLD MODELS.

    EPA Science Inventory

    Multiple outcomes are often measured on each experimental unit in toxicology experiments. These multiple observations typically imply the existence of correlation between endpoints, and a statistical analysis that incorporates it may result in improved inference. When both disc...

  8. Poisson-Based Inference for Perturbation Models in Adaptive Spelling Training

    ERIC Educational Resources Information Center

    Baschera, Gian-Marco; Gross, Markus

    2010-01-01

    We present an inference algorithm for perturbation models based on Poisson regression. The algorithm is designed to handle unclassified input with multiple errors described by independent mal-rules. This knowledge representation provides an intelligent tutoring system with local and global information about a student, such as error classification…

  9. FPGA Acceleration of the phylogenetic likelihood function for Bayesian MCMC inference methods.

    PubMed

    Zierke, Stephanie; Bakos, Jason D

    2010-04-12

    Likelihood (ML)-based phylogenetic inference has become a popular method for estimating the evolutionary relationships among species based on genomic sequence data. This method is used in applications such as RAxML, GARLI, MrBayes, PAML, and PAUP. The Phylogenetic Likelihood Function (PLF) is an important kernel computation for this method. The PLF consists of a loop with no conditional behavior or dependencies between iterations. As such it contains a high potential for exploiting parallelism using micro-architectural techniques. In this paper, we describe a technique for mapping the PLF and supporting logic onto a Field Programmable Gate Array (FPGA)-based co-processor. By leveraging the FPGA's on-chip DSP modules and the high-bandwidth local memory attached to the FPGA, the resultant co-processor can accelerate ML-based methods and outperform state-of-the-art multi-core processors. We use the MrBayes 3 tool as a framework for designing our co-processor. For large datasets, we estimate that our accelerated MrBayes, if run on a current-generation FPGA, achieves a 10x speedup relative to software running on a state-of-the-art server-class microprocessor. The FPGA-based implementation achieves its performance by deeply pipelining the likelihood computations, performing multiple floating-point operations in parallel, and through a natural log approximation that is chosen specifically to leverage a deeply pipelined custom architecture. Heterogeneous computing, which combines general-purpose processors with special-purpose co-processors such as FPGAs and GPUs, is a promising approach for high-performance phylogeny inference as shown by the growing body of literature in this field. FPGAs in particular are well-suited for this task because of their low power consumption as compared to many-core processors and Graphics Processor Units (GPUs).

  10. Mixed linear-non-linear inversion of crustal deformation data: Bayesian inference of model, weighting and regularization parameters

    NASA Astrophysics Data System (ADS)

    Fukuda, Jun'ichi; Johnson, Kaj M.

    2010-06-01

    We present a unified theoretical framework and solution method for probabilistic, Bayesian inversions of crustal deformation data. The inversions involve multiple data sets with unknown relative weights, model parameters that are related linearly or non-linearly through theoretic models to observations, prior information on model parameters and regularization priors to stabilize underdetermined problems. To efficiently handle non-linear inversions in which some of the model parameters are linearly related to the observations, this method combines both analytical least-squares solutions and a Monte Carlo sampling technique. In this method, model parameters that are linearly and non-linearly related to observations, relative weights of multiple data sets and relative weights of prior information and regularization priors are determined in a unified Bayesian framework. In this paper, we define the mixed linear-non-linear inverse problem, outline the theoretical basis for the method, provide a step-by-step algorithm for the inversion, validate the inversion method using synthetic data and apply the method to two real data sets. We apply the method to inversions of multiple geodetic data sets with unknown relative data weights for interseismic fault slip and locking depth. We also apply the method to the problem of estimating the spatial distribution of coseismic slip on faults with unknown fault geometry, relative data weights and smoothing regularization weight.

  11. Monte Carlo Bayesian inference on a statistical model of sub-gridcolumn moisture variability using high-resolution cloud observations. Part 1: Method.

    PubMed

    Norris, Peter M; da Silva, Arlindo M

    2016-07-01

    A method is presented to constrain a statistical model of sub-gridcolumn moisture variability using high-resolution satellite cloud data. The method can be used for large-scale model parameter estimation or cloud data assimilation. The gridcolumn model includes assumed probability density function (PDF) intra-layer horizontal variability and a copula-based inter-layer correlation model. The observables used in the current study are Moderate Resolution Imaging Spectroradiometer (MODIS) cloud-top pressure, brightness temperature and cloud optical thickness, but the method should be extensible to direct cloudy radiance assimilation for a small number of channels. The algorithm is a form of Bayesian inference with a Markov chain Monte Carlo (MCMC) approach to characterizing the posterior distribution. This approach is especially useful in cases where the background state is clear but cloudy observations exist. In traditional linearized data assimilation methods, a subsaturated background cannot produce clouds via any infinitesimal equilibrium perturbation, but the Monte Carlo approach is not gradient-based and allows jumps into regions of non-zero cloud probability. The current study uses a skewed-triangle distribution for layer moisture. The article also includes a discussion of the Metropolis and multiple-try Metropolis versions of MCMC.

  12. Monte Carlo Bayesian Inference on a Statistical Model of Sub-Gridcolumn Moisture Variability Using High-Resolution Cloud Observations. Part 1: Method

    NASA Technical Reports Server (NTRS)

    Norris, Peter M.; Da Silva, Arlindo M.

    2016-01-01

    A method is presented to constrain a statistical model of sub-gridcolumn moisture variability using high-resolution satellite cloud data. The method can be used for large-scale model parameter estimation or cloud data assimilation. The gridcolumn model includes assumed probability density function (PDF) intra-layer horizontal variability and a copula-based inter-layer correlation model. The observables used in the current study are Moderate Resolution Imaging Spectroradiometer (MODIS) cloud-top pressure, brightness temperature and cloud optical thickness, but the method should be extensible to direct cloudy radiance assimilation for a small number of channels. The algorithm is a form of Bayesian inference with a Markov chain Monte Carlo (MCMC) approach to characterizing the posterior distribution. This approach is especially useful in cases where the background state is clear but cloudy observations exist. In traditional linearized data assimilation methods, a subsaturated background cannot produce clouds via any infinitesimal equilibrium perturbation, but the Monte Carlo approach is not gradient-based and allows jumps into regions of non-zero cloud probability. The current study uses a skewed-triangle distribution for layer moisture. The article also includes a discussion of the Metropolis and multiple-try Metropolis versions of MCMC.

  13. Monte Carlo Bayesian inference on a statistical model of sub-gridcolumn moisture variability using high-resolution cloud observations. Part 1: Method

    PubMed Central

    Norris, Peter M.; da Silva, Arlindo M.

    2018-01-01

    A method is presented to constrain a statistical model of sub-gridcolumn moisture variability using high-resolution satellite cloud data. The method can be used for large-scale model parameter estimation or cloud data assimilation. The gridcolumn model includes assumed probability density function (PDF) intra-layer horizontal variability and a copula-based inter-layer correlation model. The observables used in the current study are Moderate Resolution Imaging Spectroradiometer (MODIS) cloud-top pressure, brightness temperature and cloud optical thickness, but the method should be extensible to direct cloudy radiance assimilation for a small number of channels. The algorithm is a form of Bayesian inference with a Markov chain Monte Carlo (MCMC) approach to characterizing the posterior distribution. This approach is especially useful in cases where the background state is clear but cloudy observations exist. In traditional linearized data assimilation methods, a subsaturated background cannot produce clouds via any infinitesimal equilibrium perturbation, but the Monte Carlo approach is not gradient-based and allows jumps into regions of non-zero cloud probability. The current study uses a skewed-triangle distribution for layer moisture. The article also includes a discussion of the Metropolis and multiple-try Metropolis versions of MCMC. PMID:29618847

  14. Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies

    PubMed Central

    Denton, James F.; Lugo-Martinez, Jose; Tucker, Abraham E.; Schrider, Daniel R.; Warren, Wesley C.; Hahn, Matthew W.

    2014-01-01

    Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process. PMID:25474019

  15. Extensive error in the number of genes inferred from draft genome assemblies.

    PubMed

    Denton, James F; Lugo-Martinez, Jose; Tucker, Abraham E; Schrider, Daniel R; Warren, Wesley C; Hahn, Matthew W

    2014-12-01

    Current sequencing methods produce large amounts of data, but genome assemblies based on these data are often woefully incomplete. These incomplete and error-filled assemblies result in many annotation errors, especially in the number of genes present in a genome. In this paper we investigate the magnitude of the problem, both in terms of total gene number and the number of copies of genes in specific families. To do this, we compare multiple draft assemblies against higher-quality versions of the same genomes, using several new assemblies of the chicken genome based on both traditional and next-generation sequencing technologies, as well as published draft assemblies of chimpanzee. We find that upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes. Using simulated genome assemblies of Drosophila melanogaster, we find that the major cause of increased gene numbers in draft genomes is the fragmentation of genes onto multiple individual contigs. Finally, we demonstrate the usefulness of RNA-Seq in improving the gene annotation of draft assemblies, largely by connecting genes that have been fragmented in the assembly process.

  16. [A complexity analysis of Chinese herbal property theory: the multiple formations of herbal property].

    PubMed

    Jin, Rui; Zhang, Bing

    2012-11-01

    Chinese herbal property theory (CHPT) is the fundamental characteristic of Chinese materia medica different from modern medicines. It reflects the herbal properties associated with efficacy and formed the early framework of four properties and five flavors in Shennong's Classic of Materia Medica. After the supplement and improvement of CHPT in the past thousands of years, it has developed a theory system including four properties, five flavors, meridian entry, direction of medicinal actions (ascending, descending, floating and sinking) and toxicity. However, because of the influence of philosophy about yin-yang theory and five-phase theory and the difference of cognitive approach and historical background at different times, CHPT became complex. One of the complexity features was the multiple methods for determining herbal property, which might include the inference from herbal efficacy, the thought of Chinese Taoist School and witchcraft, the classification thinking according to manifestations, etc. Another complexity feature was the multiselection associations between herbal property and efficacy, which indicated that the same property could be inferred from different kinds of efficacy. This paper analyzed these complexity features and provided the importance of cognitive approaches and efficacy attributes corresponding to certain herbal property in the study of CHPT.

  17. Evaluation of an artificial intelligence guided inverse planning system: clinical case study.

    PubMed

    Yan, Hui; Yin, Fang-Fang; Willett, Christopher

    2007-04-01

    An artificial intelligence (AI) guided method for parameter adjustment of inverse planning was implemented on a commercial inverse treatment planning system. For evaluation purpose, four typical clinical cases were tested and the results from both plans achieved by automated and manual methods were compared. The procedure of parameter adjustment mainly consists of three major loops. Each loop is in charge of modifying parameters of one category, which is carried out by a specially customized fuzzy inference system. A physician prescribed multiple constraints for a selected volume were adopted to account for the tradeoff between prescription dose to the PTV and dose-volume constraints for critical organs. The searching process for an optimal parameter combination began with the first constraint, and proceeds to the next until a plan with acceptable dose was achieved. The initial setup of the plan parameters was the same for each case and was adjusted independently by both manual and automated methods. After the parameters of one category were updated, the intensity maps of all fields were re-optimized and the plan dose was subsequently re-calculated. When final plan arrived, the dose statistics were calculated from both plans and compared. For planned target volume (PTV), the dose for 95% volume is up to 10% higher in plans using the automated method than those using the manual method. For critical organs, an average decrease of the plan dose was achieved. However, the automated method cannot improve the plan dose for some critical organs due to limitations of the inference rules currently employed. For normal tissue, there was no significant difference between plan doses achieved by either automated or manual method. With the application of AI-guided method, the basic parameter adjustment task can be accomplished automatically and a comparable plan dose was achieved in comparison with that achieved by the manual method. Future improvements to incorporate case-specific inference rules are essential to fully automate the inverse planning process.

  18. Training Inference Making Skills Using a Situation Model Approach Improves Reading Comprehension

    PubMed Central

    Bos, Lisanne T.; De Koning, Bjorn B.; Wassenburg, Stephanie I.; van der Schoot, Menno

    2016-01-01

    This study aimed to enhance third and fourth graders’ text comprehension at the situation model level. Therefore, we tested a reading strategy training developed to target inference making skills, which are widely considered to be pivotal to situation model construction. The training was grounded in contemporary literature on situation model-based inference making and addressed the source (text-based versus knowledge-based), type (necessary versus unnecessary for (re-)establishing coherence), and depth of an inference (making single lexical inferences versus combining multiple lexical inferences), as well as the type of searching strategy (forward versus backward). Results indicated that, compared to a control group (n = 51), children who followed the experimental training (n = 67) improved their inference making skills supportive to situation model construction. Importantly, our training also resulted in increased levels of general reading comprehension and motivation. In sum, this study showed that a ‘level of text representation’-approach can provide a useful framework to teach inference making skills to third and fourth graders. PMID:26913014

  19. Statistical framework for the utilization of simultaneous pupil plane and focal plane telemetry for exoplanet imaging. I. Accounting for aberrations in multiple planes.

    PubMed

    Frazin, Richard A

    2016-04-01

    A new generation of telescopes with mirror diameters of 20 m or more, called extremely large telescopes (ELTs), has the potential to provide unprecedented imaging and spectroscopy of exoplanetary systems, if the difficulties in achieving the extremely high dynamic range required to differentiate the planetary signal from the star can be overcome to a sufficient degree. Fully utilizing the potential of ELTs for exoplanet imaging will likely require simultaneous and self-consistent determination of both the planetary image and the unknown aberrations in multiple planes of the optical system, using statistical inference based on the wavefront sensor and science camera data streams. This approach promises to overcome the most important systematic errors inherent in the various schemes based on differential imaging, such as angular differential imaging and spectral differential imaging. This paper is the first in a series on this subject, in which a formalism is established for the exoplanet imaging problem, setting the stage for the statistical inference methods to follow in the future. Every effort has been made to be rigorous and complete, so that validity of approximations to be made later can be assessed. Here, the polarimetric image is expressed in terms of aberrations in the various planes of a polarizing telescope with an adaptive optics system. Further, it is shown that current methods that utilize focal plane sensing to correct the speckle field, e.g., electric field conjugation, rely on the tacit assumption that aberrations on multiple optical surfaces can be represented as aberration on a single optical surface, ultimately limiting their potential effectiveness for ground-based astronomy.

  20. A Simple Approach to Inference in Covariance Structure Modeling with Missing Data: Bayesian Analysis. Project 2.4, Quantitative Models To Monitor the Status and Progress of Learning and Performance and Their Antecedents.

    ERIC Educational Resources Information Center

    Muthen, Bengt

    This paper investigates methods that avoid using multiple groups to represent the missing data patterns in covariance structure modeling, attempting instead to do a single-group analysis where the only action the analyst has to take is to indicate that data is missing. A new covariance structure approach developed by B. Muthen and G. Arminger is…

  1. Competitive-Cooperative Automated Reasoning from Distributed and Multiple Source of Data

    NASA Astrophysics Data System (ADS)

    Fard, Amin Milani

    Knowledge extraction from distributed database systems, have been investigated during past decade in order to analyze billions of information records. In this work a competitive deduction approach in a heterogeneous data grid environment is proposed using classic data mining and statistical methods. By applying a game theory concept in a multi-agent model, we tried to design a policy for hierarchical knowledge discovery and inference fusion. To show the system run, a sample multi-expert system has also been developed.

  2. Reasoning in explanation-based decision making.

    PubMed

    Pennington, N; Hastie, R

    1993-01-01

    A general theory of explanation-based decision making is outlined and the multiple roles of inference processes in the theory are indicated. A typology of formal and informal inference forms, originally proposed by Collins (1978a, 1978b), is introduced as an appropriate framework to represent inferences that occur in the overarching explanation-based process. Results from the analysis of verbal reports of decision processes are presented to demonstrate the centrality and systematic character of reasoning in a representative legal decision-making task.

  3. Nonplanar wing load-line and slender wing theory

    NASA Technical Reports Server (NTRS)

    Deyoung, J.

    1977-01-01

    Nonplanar load line, slender wing, elliptic wing, and infinite aspect ratio limit loading theories are developed. These are quasi two dimensional theories but satisfy wing boundary conditions at all points along the nonplanar spanwise extent of the wing. These methods are applicable for generalized configurations such as the laterally nonplanar wing, multiple nonplanar wings, or wing with multiple winglets of arbitrary shape. Two dimensional theory infers simplicity which is practical when analyzing complicated configurations. The lateral spanwise distribution of angle of attack can be that due to winglet or control surface deflection, wing twist, or induced angles due to multiwings, multiwinglets, ground, walls, jet or fuselage. In quasi two dimensional theory the induced angles due to these extra conditions are likewise determined for two dimensional flow. Equations are developed for the normal to surface induced velocity due to a nonplanar trailing vorticity distribution. Application examples are made using these methods.

  4. LASSIM-A network inference toolbox for genome-wide mechanistic modeling.

    PubMed

    Magnusson, Rasmus; Mariotti, Guido Pio; Köpsén, Mattias; Lövfors, William; Gawel, Danuta R; Jörnsten, Rebecka; Linde, Jörg; Nordling, Torbjörn E M; Nyman, Elin; Schulze, Sylvie; Nestor, Colm E; Zhang, Huan; Cedersund, Gunnar; Benson, Mikael; Tjärnberg, Andreas; Gustafsson, Mika

    2017-06-01

    Recent technological advancements have made time-resolved, quantitative, multi-omics data available for many model systems, which could be integrated for systems pharmacokinetic use. Here, we present large-scale simulation modeling (LASSIM), which is a novel mathematical tool for performing large-scale inference using mechanistically defined ordinary differential equations (ODE) for gene regulatory networks (GRNs). LASSIM integrates structural knowledge about regulatory interactions and non-linear equations with multiple steady state and dynamic response expression datasets. The rationale behind LASSIM is that biological GRNs can be simplified using a limited subset of core genes that are assumed to regulate all other gene transcription events in the network. The LASSIM method is implemented as a general-purpose toolbox using the PyGMO Python package to make the most of multicore computers and high performance clusters, and is available at https://gitlab.com/Gustafsson-lab/lassim. As a method, LASSIM works in two steps, where it first infers a non-linear ODE system of the pre-specified core gene expression. Second, LASSIM in parallel optimizes the parameters that model the regulation of peripheral genes by core system genes. We showed the usefulness of this method by applying LASSIM to infer a large-scale non-linear model of naïve Th2 cell differentiation, made possible by integrating Th2 specific bindings, time-series together with six public and six novel siRNA-mediated knock-down experiments. ChIP-seq showed significant overlap for all tested transcription factors. Next, we performed novel time-series measurements of total T-cells during differentiation towards Th2 and verified that our LASSIM model could monitor those data significantly better than comparable models that used the same Th2 bindings. In summary, the LASSIM toolbox opens the door to a new type of model-based data analysis that combines the strengths of reliable mechanistic models with truly systems-level data. We demonstrate the power of this approach by inferring a mechanistically motivated, genome-wide model of the Th2 transcription regulatory system, which plays an important role in several immune related diseases.

  5. Why We (Usually) Don't Have to Worry about Multiple Comparisons

    ERIC Educational Resources Information Center

    Gelman, Andrew; Hill, Jennifer; Yajima, Masanao

    2012-01-01

    Applied researchers often find themselves making statistical inferences in settings that would seem to require multiple comparisons adjustments. We challenge the Type I error paradigm that underlies these corrections. Moreover we posit that the problem of multiple comparisons can disappear entirely when viewed from a hierarchical Bayesian…

  6. Contextual Richness and Word Learning: Context Enhances Comprehension but Retrieval Enhances Retention

    ERIC Educational Resources Information Center

    van den Broek, Gesa S. E.; Takashima, Atsuko; Segers, Eliane; Verhoeven, Ludo

    2018-01-01

    Learning new vocabulary from context typically requires multiple encounters during which word meaning can be retrieved from memory or inferred from context. We compared the effect of memory retrieval and context inferences on short- and long-term retention in three experiments. Participants studied novel words and then practiced the words either…

  7. Spatial scaling and multi-model inference in landscape genetics: Martes americana in northern Idaho

    Treesearch

    Tzeidle N. Wasserman; Samuel A. Cushman; Michael K. Schwartz; David O. Wallin

    2010-01-01

    Individual-based analyses relating landscape structure to genetic distances across complex landscapes enable rigorous evaluation of multiple alternative hypotheses linking landscape structure to gene flow. We utilize two extensions to increase the rigor of the individual-based causal modeling approach to inferring relationships between landscape patterns and gene flow...

  8. cDREM: inferring dynamic combinatorial gene regulation.

    PubMed

    Wise, Aaron; Bar-Joseph, Ziv

    2015-04-01

    Genes are often combinatorially regulated by multiple transcription factors (TFs). Such combinatorial regulation plays an important role in development and facilitates the ability of cells to respond to different stresses. While a number of approaches have utilized sequence and ChIP-based datasets to study combinational regulation, these have often ignored the combinational logic and the dynamics associated with such regulation. Here we present cDREM, a new method for reconstructing dynamic models of combinatorial regulation. cDREM integrates time series gene expression data with (static) protein interaction data. The method is based on a hidden Markov model and utilizes the sparse group Lasso to identify small subsets of combinatorially active TFs, their time of activation, and the logical function they implement. We tested cDREM on yeast and human data sets. Using yeast we show that the predicted combinatorial sets agree with other high throughput genomic datasets and improve upon prior methods developed to infer combinatorial regulation. Applying cDREM to study human response to flu, we were able to identify several combinatorial TF sets, some of which were known to regulate immune response while others represent novel combinations of important TFs.

  9. Identifying cooperative transcriptional regulations using protein–protein interactions

    PubMed Central

    Nagamine, Nobuyoshi; Kawada, Yuji; Sakakibara, Yasubumi

    2005-01-01

    Cooperative transcriptional activations among multiple transcription factors (TFs) are important to understand the mechanisms of complex transcriptional regulations in eukaryotes. Previous studies have attempted to find cooperative TFs based on gene expression data with gene expression profiles as a measure of similarity of gene regulations. In this paper, we use protein–protein interaction data to infer synergistic binding of cooperative TFs. Our fundamental idea is based on the assumption that genes contributing to a similar biological process are regulated under the same control mechanism. First, the protein–protein interaction networks are used to calculate the similarity of biological processes among genes. Second, we integrate this similarity and the chromatin immuno-precipitation data to identify cooperative TFs. Our computational experiments in yeast show that predictions made by our method have successfully identified eight pairs of cooperative TFs that have literature evidences but could not be identified by the previous method. Further, 12 new possible pairs have been inferred and we have examined the biological relevances for them. However, since a typical problem using protein–protein interaction data is that many false-positive data are contained, we propose a method combining various biological data to increase the prediction accuracy. PMID:16126847

  10. Metainference: A Bayesian inference method for heterogeneous systems

    PubMed Central

    Bonomi, Massimiliano; Camilloni, Carlo; Cavalli, Andrea; Vendruscolo, Michele

    2016-01-01

    Modeling a complex system is almost invariably a challenging task. The incorporation of experimental observations can be used to improve the quality of a model and thus to obtain better predictions about the behavior of the corresponding system. This approach, however, is affected by a variety of different errors, especially when a system simultaneously populates an ensemble of different states and experimental data are measured as averages over such states. To address this problem, we present a Bayesian inference method, called “metainference,” that is able to deal with errors in experimental measurements and with experimental measurements averaged over multiple states. To achieve this goal, metainference models a finite sample of the distribution of models using a replica approach, in the spirit of the replica-averaging modeling based on the maximum entropy principle. To illustrate the method, we present its application to a heterogeneous model system and to the determination of an ensemble of structures corresponding to the thermal fluctuations of a protein molecule. Metainference thus provides an approach to modeling complex systems with heterogeneous components and interconverting between different states by taking into account all possible sources of errors. PMID:26844300

  11. Faster Mass Spectrometry-based Protein Inference: Junction Trees are More Efficient than Sampling and Marginalization by Enumeration

    PubMed Central

    Serang, Oliver; Noble, William Stafford

    2012-01-01

    The problem of identifying the proteins in a complex mixture using tandem mass spectrometry can be framed as an inference problem on a graph that connects peptides to proteins. Several existing protein identification methods make use of statistical inference methods for graphical models, including expectation maximization, Markov chain Monte Carlo, and full marginalization coupled with approximation heuristics. We show that, for this problem, the majority of the cost of inference usually comes from a few highly connected subgraphs. Furthermore, we evaluate three different statistical inference methods using a common graphical model, and we demonstrate that junction tree inference substantially improves rates of convergence compared to existing methods. The python code used for this paper is available at http://noble.gs.washington.edu/proj/fido. PMID:22331862

  12. Inferring explicit weighted consensus networks to represent alternative evolutionary histories

    PubMed Central

    2013-01-01

    Background The advent of molecular biology techniques and constant increase in availability of genetic material have triggered the development of many phylogenetic tree inference methods. However, several reticulate evolution processes, such as horizontal gene transfer and hybridization, have been shown to blur the species evolutionary history by causing discordance among phylogenies inferred from different genes. Methods To tackle this problem, we hereby describe a new method for inferring and representing alternative (reticulate) evolutionary histories of species as an explicit weighted consensus network which can be constructed from a collection of gene trees with or without prior knowledge of the species phylogeny. Results We provide a way of building a weighted phylogenetic network for each of the following reticulation mechanisms: diploid hybridization, intragenic recombination and complete or partial horizontal gene transfer. We successfully tested our method on some synthetic and real datasets to infer the above-mentioned evolutionary events which may have influenced the evolution of many species. Conclusions Our weighted consensus network inference method allows one to infer, visualize and validate statistically major conflicting signals induced by the mechanisms of reticulate evolution. The results provided by the new method can be used to represent the inferred conflicting signals by means of explicit and easy-to-interpret phylogenetic networks. PMID:24359207

  13. Statistical inference from multiple iTRAQ experiments without using common reference standards.

    PubMed

    Herbrich, Shelley M; Cole, Robert N; West, Keith P; Schulze, Kerry; Yager, James D; Groopman, John D; Christian, Parul; Wu, Lee; O'Meally, Robert N; May, Damon H; McIntosh, Martin W; Ruczinski, Ingo

    2013-02-01

    Isobaric tags for relative and absolute quantitation (iTRAQ) is a prominent mass spectrometry technology for protein identification and quantification that is capable of analyzing multiple samples in a single experiment. Frequently, iTRAQ experiments are carried out using an aliquot from a pool of all samples, or "masterpool", in one of the channels as a reference sample standard to estimate protein relative abundances in the biological samples and to combine abundance estimates from multiple experiments. In this manuscript, we show that using a masterpool is counterproductive. We obtain more precise estimates of protein relative abundance by using the available biological data instead of the masterpool and do not need to occupy a channel that could otherwise be used for another biological sample. In addition, we introduce a simple statistical method to associate proteomic data from multiple iTRAQ experiments with a numeric response and show that this approach is more powerful than the conventionally employed masterpool-based approach. We illustrate our methods using data from four replicate iTRAQ experiments on aliquots of the same pool of plasma samples and from a 406-sample project designed to identify plasma proteins that covary with nutrient concentrations in chronically undernourished children from South Asia.

  14. Quantifying selection in evolving populations using time-resolved genetic data

    NASA Astrophysics Data System (ADS)

    Illingworth, Christopher J. R.; Mustonen, Ville

    2013-01-01

    Methods which uncover the molecular basis of the adaptive evolution of a population address some important biological questions. For example, the problem of identifying genetic variants which underlie drug resistance, a question of importance for the treatment of pathogens, and of cancer, can be understood as a matter of inferring selection. One difficulty in the inference of variants under positive selection is the potential complexity of the underlying evolutionary dynamics, which may involve an interplay between several contributing processes, including mutation, recombination and genetic drift. A source of progress may be found in modern sequencing technologies, which confer an increasing ability to gather information about evolving populations, granting a window into these complex processes. One particularly interesting development is the ability to follow evolution as it happens, by whole-genome sequencing of an evolving population at multiple time points. We here discuss how to use time-resolved sequence data to draw inferences about the evolutionary dynamics of a population under study. We begin by reviewing our earlier analysis of a yeast selection experiment, in which we used a deterministic evolutionary framework to identify alleles under selection for heat tolerance, and to quantify the selection acting upon them. Considering further the use of advanced intercross lines to measure selection, we here extend this framework to cover scenarios of simultaneous recombination and selection, and of two driver alleles with multiple linked neutral, or passenger, alleles, where the driver pair evolves under an epistatic fitness landscape. We conclude by discussing the limitations of the approach presented and outlining future challenges for such methodologies.

  15. Phylogenic inference using alignment-free methods for applications in microbial community surveys using 16s rRNA gene

    PubMed Central

    2017-01-01

    The diversity of microbiota is best explored by understanding the phylogenetic structure of the microbial communities. Traditionally, sequence alignment has been used for phylogenetic inference. However, alignment-based approaches come with significant challenges and limitations when massive amounts of data are analyzed. In the recent decade, alignment-free approaches have enabled genome-scale phylogenetic inference. Here we evaluate three alignment-free methods: ACS, CVTree, and Kr for phylogenetic inference with 16s rRNA gene data. We use a taxonomic gold standard to compare the accuracy of alignment-free phylogenetic inference with that of common microbiome-wide phylogenetic inference pipelines based on PyNAST and MUSCLE alignments with FastTree and RAxML. We re-simulate fecal communities from Human Microbiome Project data to evaluate the performance of the methods on datasets with properties of real data. Our comparisons show that alignment-free methods are not inferior to alignment-based methods in giving accurate and robust phylogenic trees. Moreover, consensus ensembles of alignment-free phylogenies are superior to those built from alignment-based methods in their ability to highlight community differences in low power settings. In addition, the overall running times of alignment-based and alignment-free phylogenetic inference are comparable. Taken together our empirical results suggest that alignment-free methods provide a viable approach for microbiome-wide phylogenetic inference. PMID:29136663

  16. A century of paraphyly: a molecular phylogeny of katydids (Orthoptera: Tettigoniidae) supports multiple origins of leaf-like wings.

    PubMed

    Mugleston, Joseph D; Song, Hojun; Whiting, Michael F

    2013-12-01

    The phylogenetic relationships of Tettigoniidae (katydids and bush-crickets) were inferred using molecular sequence data. Six genes (18S rDNA, 28S rDNA, Cytochrome Oxidase II, Histone 3, Tubulin Alpha I, and Wingless) were sequenced for 135 ingroup taxa representing 16 of the 19 extant katydid subfamilies. Five subfamilies (Tettigoniinae, Pseudophyllinae, Mecopodinae, Meconematinae, and Listroscelidinae) were found to be paraphyletic under various tree reconstruction methods (Maximum Likelihood, Bayesisan Inference and Maximum Parsimony). Seven subfamilies - Conocephalinae, Hetrodinae, Hexacentrinae, Saginae, Phaneropterinae, Phyllophorinae, and Lipotactinae - were each recovered as well-supported monophyletic groups. We mapped the small and exposed thoracic auditory spiracle (a defining character of the subfamily Pseudophyllinae) and found it to be homoplasious. We also found the leaf-like wings of katydids have been derived independently in at least six lineages. Copyright © 2013 Elsevier Inc. All rights reserved.

  17. From information processing to decisions: Formalizing and comparing psychologically plausible choice models.

    PubMed

    Heck, Daniel W; Hilbig, Benjamin E; Moshagen, Morten

    2017-08-01

    Decision strategies explain how people integrate multiple sources of information to make probabilistic inferences. In the past decade, increasingly sophisticated methods have been developed to determine which strategy explains decision behavior best. We extend these efforts to test psychologically more plausible models (i.e., strategies), including a new, probabilistic version of the take-the-best (TTB) heuristic that implements a rank order of error probabilities based on sequential processing. Within a coherent statistical framework, deterministic and probabilistic versions of TTB and other strategies can directly be compared using model selection by minimum description length or the Bayes factor. In an experiment with inferences from given information, only three of 104 participants were best described by the psychologically plausible, probabilistic version of TTB. Similar as in previous studies, most participants were classified as users of weighted-additive, a strategy that integrates all available information and approximates rational decisions. Copyright © 2017 Elsevier Inc. All rights reserved.

  18. Analysis prediction of Indonesian banks (BCA, BNI, MANDIRI) using adaptive neuro-fuzzy inference system (ANFIS) and investment strategies

    NASA Astrophysics Data System (ADS)

    Trianto, Andriantama Budi; Hadi, I. M.; Liong, The Houw; Purqon, Acep

    2015-09-01

    Indonesian economical development is growing well. It has effect for their invesment in Banks and the stock market. In this study, we perform prediction for the three blue chips of Indonesian bank i.e. BCA, BNI, and MANDIRI by using the method of Adaptive Neuro-Fuzzy Inference System (ANFIS) with Takagi-Sugeno rules and Generalized bell (Gbell) as the membership function. Our results show that ANFIS perform good prediction with RMSE for BCA of 27, BNI of 5.29, and MANDIRI of 13.41, respectively. Furthermore, we develop an active strategy to gain more benefit. We compare between passive strategy versus active strategy. Our results shows that for the passive strategy gains 13 million rupiah, while for the active strategy gains 47 million rupiah in one year. The active investment strategy significantly shows gaining multiple benefit than the passive one.

  19. A nonparametric method to generate synthetic populations to adjust for complex sampling design features.

    PubMed

    Dong, Qi; Elliott, Michael R; Raghunathan, Trivellore E

    2014-06-01

    Outside of the survey sampling literature, samples are often assumed to be generated by a simple random sampling process that produces independent and identically distributed (IID) samples. Many statistical methods are developed largely in this IID world. Application of these methods to data from complex sample surveys without making allowance for the survey design features can lead to erroneous inferences. Hence, much time and effort have been devoted to develop the statistical methods to analyze complex survey data and account for the sample design. This issue is particularly important when generating synthetic populations using finite population Bayesian inference, as is often done in missing data or disclosure risk settings, or when combining data from multiple surveys. By extending previous work in finite population Bayesian bootstrap literature, we propose a method to generate synthetic populations from a posterior predictive distribution in a fashion inverts the complex sampling design features and generates simple random samples from a superpopulation point of view, making adjustment on the complex data so that they can be analyzed as simple random samples. We consider a simulation study with a stratified, clustered unequal-probability of selection sample design, and use the proposed nonparametric method to generate synthetic populations for the 2006 National Health Interview Survey (NHIS), and the Medical Expenditure Panel Survey (MEPS), which are stratified, clustered unequal-probability of selection sample designs.

  20. A nonparametric method to generate synthetic populations to adjust for complex sampling design features

    PubMed Central

    Dong, Qi; Elliott, Michael R.; Raghunathan, Trivellore E.

    2017-01-01

    Outside of the survey sampling literature, samples are often assumed to be generated by a simple random sampling process that produces independent and identically distributed (IID) samples. Many statistical methods are developed largely in this IID world. Application of these methods to data from complex sample surveys without making allowance for the survey design features can lead to erroneous inferences. Hence, much time and effort have been devoted to develop the statistical methods to analyze complex survey data and account for the sample design. This issue is particularly important when generating synthetic populations using finite population Bayesian inference, as is often done in missing data or disclosure risk settings, or when combining data from multiple surveys. By extending previous work in finite population Bayesian bootstrap literature, we propose a method to generate synthetic populations from a posterior predictive distribution in a fashion inverts the complex sampling design features and generates simple random samples from a superpopulation point of view, making adjustment on the complex data so that they can be analyzed as simple random samples. We consider a simulation study with a stratified, clustered unequal-probability of selection sample design, and use the proposed nonparametric method to generate synthetic populations for the 2006 National Health Interview Survey (NHIS), and the Medical Expenditure Panel Survey (MEPS), which are stratified, clustered unequal-probability of selection sample designs. PMID:29200608

  1. UniEnt: uniform entropy model for the dynamics of a neuronal population

    NASA Astrophysics Data System (ADS)

    Hernandez Lahme, Damian; Nemenman, Ilya

    Sensory information and motor responses are encoded in the brain in a collective spiking activity of a large number of neurons. Understanding the neural code requires inferring statistical properties of such collective dynamics from multicellular neurophysiological recordings. Questions of whether synchronous activity or silence of multiple neurons carries information about the stimuli or the motor responses are especially interesting. Unfortunately, detection of such high order statistical interactions from data is especially challenging due to the exponentially large dimensionality of the state space of neural collectives. Here we present UniEnt, a method for the inference of strengths of multivariate neural interaction patterns. The method is based on the Bayesian prior that makes no assumptions (uniform a priori expectations) about the value of the entropy of the observed multivariate neural activity, in contrast to popular approaches that maximize this entropy. We then study previously published multi-electrode recordings data from salamander retina, exposing the relevance of higher order neural interaction patterns for information encoding in this system. This work was supported in part by Grants JSMF/220020321 and NSF/IOS/1208126.

  2. Multiple-Line Inference of Selection on Quantitative Traits

    PubMed Central

    Riedel, Nico; Khatri, Bhavin S.; Lässig, Michael; Berg, Johannes

    2015-01-01

    Trait differences between species may be attributable to natural selection. However, quantifying the strength of evidence for selection acting on a particular trait is a difficult task. Here we develop a population genetics test for selection acting on a quantitative trait that is based on multiple-line crosses. We show that using multiple lines increases both the power and the scope of selection inferences. First, a test based on three or more lines detects selection with strongly increased statistical significance, and we show explicitly how the sensitivity of the test depends on the number of lines. Second, a multiple-line test can distinguish between different lineage-specific selection scenarios. Our analytical results are complemented by extensive numerical simulations. We then apply the multiple-line test to QTL data on floral character traits in plant species of the Mimulus genus and on photoperiodic traits in different maize strains, where we find a signature of lineage-specific selection not seen in two-line tests. PMID:26139839

  3. Learning and retention through predictive inference and classification.

    PubMed

    Sakamoto, Yasuaki; Love, Bradley C

    2010-12-01

    Work in category learning addresses how humans acquire knowledge and, thus, should inform classroom practices. In two experiments, we apply and evaluate intuitions garnered from laboratory-based research in category learning to learning tasks situated in an educational context. In Experiment 1, learning through predictive inference and classification were compared for fifth-grade students using class-related materials. Making inferences about properties of category members and receiving feedback led to the acquisition of both queried (i.e., tested) properties and nonqueried properties that were correlated with a queried property (e.g., even if not queried, students learned about a species' habitat because it correlated with a queried property, like the species' size). In contrast, classifying items according to their species and receiving feedback led to knowledge of only the property most diagnostic of category membership. After multiple-day delay, the fifth-graders who learned through inference selectively retained information about the queried properties, and the fifth-graders who learned through classification retained information about the diagnostic property, indicating a role for explicit evaluation in establishing memories. Overall, inference learning resulted in fewer errors, better retention, and more liking of the categories than did classification learning. Experiment 2 revealed that querying a property only a few times was enough to manifest the full benefits of inference learning in undergraduate students. These results suggest that classroom teaching should emphasize reasoning from the category to multiple properties rather than from a set of properties to the category. (PsycINFO Database Record (c) 2010 APA, all rights reserved).

  4. Statistical analysis of particle trajectories in living cells

    NASA Astrophysics Data System (ADS)

    Briane, Vincent; Kervrann, Charles; Vimond, Myriam

    2018-06-01

    Recent advances in molecular biology and fluorescence microscopy imaging have made possible the inference of the dynamics of molecules in living cells. Such inference allows us to understand and determine the organization and function of the cell. The trajectories of particles (e.g., biomolecules) in living cells, computed with the help of object tracking methods, can be modeled with diffusion processes. Three types of diffusion are considered: (i) free diffusion, (ii) subdiffusion, and (iii) superdiffusion. The mean-square displacement (MSD) is generally used to discriminate the three types of particle dynamics. We propose here a nonparametric three-decision test as an alternative to the MSD method. The rejection of the null hypothesis, i.e., free diffusion, is accompanied by claims of the direction of the alternative (subdiffusion or superdiffusion). We study the asymptotic behavior of the test statistic under the null hypothesis and under parametric alternatives which are currently considered in the biophysics literature. In addition, we adapt the multiple-testing procedure of Benjamini and Hochberg to fit with the three-decision-test setting, in order to apply the test procedure to a collection of independent trajectories. The performance of our procedure is much better than the MSD method as confirmed by Monte Carlo experiments. The method is demonstrated on real data sets corresponding to protein dynamics observed in fluorescence microscopy.

  5. Using single cell sequencing data to model the evolutionary history of a tumor.

    PubMed

    Kim, Kyung In; Simon, Richard

    2014-01-24

    The introduction of next-generation sequencing (NGS) technology has made it possible to detect genomic alterations within tumor cells on a large scale. However, most applications of NGS show the genetic content of mixtures of cells. Recently developed single cell sequencing technology can identify variation within a single cell. Characterization of multiple samples from a tumor using single cell sequencing can potentially provide information on the evolutionary history of that tumor. This may facilitate understanding how key mutations accumulate and evolve in lineages to form a heterogeneous tumor. We provide a computational method to infer an evolutionary mutation tree based on single cell sequencing data. Our approach differs from traditional phylogenetic tree approaches in that our mutation tree directly describes temporal order relationships among mutation sites. Our method also accommodates sequencing errors. Furthermore, we provide a method for estimating the proportion of time from the earliest mutation event of the sample to the most recent common ancestor of the sample of cells. Finally, we discuss current limitations on modeling with single cell sequencing data and possible improvements under those limitations. Inferring the temporal ordering of mutational sites using current single cell sequencing data is a challenge. Our proposed method may help elucidate relationships among key mutations and their role in tumor progression.

  6. Inferring dynamic gene regulatory networks in cardiac differentiation through the integration of multi-dimensional data.

    PubMed

    Gong, Wuming; Koyano-Nakagawa, Naoko; Li, Tongbin; Garry, Daniel J

    2015-03-07

    Decoding the temporal control of gene expression patterns is key to the understanding of the complex mechanisms that govern developmental decisions during heart development. High-throughput methods have been employed to systematically study the dynamic and coordinated nature of cardiac differentiation at the global level with multiple dimensions. Therefore, there is a pressing need to develop a systems approach to integrate these data from individual studies and infer the dynamic regulatory networks in an unbiased fashion. We developed a two-step strategy to integrate data from (1) temporal RNA-seq, (2) temporal histone modification ChIP-seq, (3) transcription factor (TF) ChIP-seq and (4) gene perturbation experiments to reconstruct the dynamic network during heart development. First, we trained a logistic regression model to predict the probability (LR score) of any base being bound by 543 TFs with known positional weight matrices. Second, four dimensions of data were combined using a time-varying dynamic Bayesian network model to infer the dynamic networks at four developmental stages in the mouse [mouse embryonic stem cells (ESCs), mesoderm (MES), cardiac progenitors (CP) and cardiomyocytes (CM)]. Our method not only infers the time-varying networks between different stages of heart development, but it also identifies the TF binding sites associated with promoter or enhancers of downstream genes. The LR scores of experimentally verified ESCs and heart enhancers were significantly higher than random regions (p <10(-100)), suggesting that a high LR score is a reliable indicator for functional TF binding sites. Our network inference model identified a region with an elevated LR score approximately -9400 bp upstream of the transcriptional start site of Nkx2-5, which overlapped with a previously reported enhancer region (-9435 to -8922 bp). TFs such as Tead1, Gata4, Msx2, and Tgif1 were predicted to bind to this region and participate in the regulation of Nkx2-5 gene expression. Our model also predicted the key regulatory networks for the ESC-MES, MES-CP and CP-CM transitions. We report a novel method to systematically integrate multi-dimensional -omics data and reconstruct the gene regulatory networks. This method will allow one to rapidly determine the cis-modules that regulate key genes during cardiac differentiation.

  7. Inferring microRNA regulation of mRNA with partially ordered samples of paired expression data and exogenous prediction algorithms.

    PubMed

    Godsey, Brian; Heiser, Diane; Civin, Curt

    2012-01-01

    MicroRNAs (miRs) are known to play an important role in mRNA regulation, often by binding to complementary sequences in "target" mRNAs. Recently, several methods have been developed by which existing sequence-based target predictions can be combined with miR and mRNA expression data to infer true miR-mRNA targeting relationships. It has been shown that the combination of these two approaches gives more reliable results than either by itself. While a few such algorithms give excellent results, none fully addresses expression data sets with a natural ordering of the samples. If the samples in an experiment can be ordered or partially ordered by their expected similarity to one another, such as for time-series or studies of development processes, stages, or types, (e.g. cell type, disease, growth, aging), there are unique opportunities to infer miR-mRNA interactions that may be specific to the underlying processes, and existing methods do not exploit this. We propose an algorithm which specifically addresses [partially] ordered expression data and takes advantage of sample similarities based on the ordering structure. This is done within a Bayesian framework which specifies posterior distributions and therefore statistical significance for each model parameter and latent variable. We apply our model to a previously published expression data set of paired miR and mRNA arrays in five partially ordered conditions, with biological replicates, related to multiple myeloma, and we show how considering potential orderings can improve the inference of miR-mRNA interactions, as measured by existing knowledge about the involved transcripts.

  8. Phylogeny of the cycads based on multiple single-copy nuclear genes: congruence of concatenated parsimony, likelihood and species tree inference methods.

    PubMed

    Salas-Leiva, Dayana E; Meerow, Alan W; Calonje, Michael; Griffith, M Patrick; Francisco-Ortega, Javier; Nakamura, Kyoko; Stevenson, Dennis W; Lewis, Carl E; Namoff, Sandra

    2013-11-01

    Despite a recent new classification, a stable phylogeny for the cycads has been elusive, particularly regarding resolution of Bowenia, Stangeria and Dioon. In this study, five single-copy nuclear genes (SCNGs) are applied to the phylogeny of the order Cycadales. The specific aim is to evaluate several gene tree-species tree reconciliation approaches for developing an accurate phylogeny of the order, to contrast them with concatenated parsimony analysis and to resolve the erstwhile problematic phylogenetic position of these three genera. DNA sequences of five SCNGs were obtained for 20 cycad species representing all ten genera of Cycadales. These were analysed with parsimony, maximum likelihood (ML) and three Bayesian methods of gene tree-species tree reconciliation, using Cycas as the outgroup. A calibrated date estimation was developed with Bayesian methods, and biogeographic analysis was also conducted. Concatenated parsimony, ML and three species tree inference methods resolve exactly the same tree topology with high support at most nodes. Dioon and Bowenia are the first and second branches of Cycadales after Cycas, respectively, followed by an encephalartoid clade (Macrozamia-Lepidozamia-Encephalartos), which is sister to a zamioid clade, of which Ceratozamia is the first branch, and in which Stangeria is sister to Microcycas and Zamia. A single, well-supported phylogenetic hypothesis of the generic relationships of the Cycadales is presented. However, massive extinction events inferred from the fossil record that eliminated broader ancestral distributions within Zamiaceae compromise accurate optimization of ancestral biogeographical areas for that hypothesis. While major lineages of Cycadales are ancient, crown ages of all modern genera are no older than 12 million years, supporting a recent hypothesis of mostly Miocene radiations. This phylogeny can contribute to an accurate infrafamilial classification of Zamiaceae.

  9. Gene order in rosid phylogeny, inferred from pairwise syntenies among extant genomes

    PubMed Central

    2012-01-01

    Background Ancestral gene order reconstruction for flowering plants has lagged behind developments in yeasts, insects and higher animals, because of the recency of widespread plant genome sequencing, sequencers' embargoes on public data use, paralogies due to whole genome duplication (WGD) and fractionation of undeleted duplicates, extensive paralogy from other sources, and the computational cost of existing methods. Results We address these problems, using the gene order of four core eudicot genomes (cacao, castor bean, papaya and grapevine) that have escaped any recent WGD events, and two others (poplar and cucumber) that descend from independent WGDs, in inferring the ancestral gene order of the rosid clade and those of its main subgroups, the fabids and malvids. We improve and adapt techniques including the OMG method for extracting large, paralogy-free, multiple orthologies from conflated pairwise synteny data among the six genomes and the PATHGROUPS approach for ancestral gene order reconstruction in a given phylogeny, where some genomes may be descendants of WGD events. We use the gene order evidence to evaluate the hypothesis that the order Malpighiales belongs to the malvids rather than as traditionally assigned to the fabids. Conclusions Gene orders of ancestral eudicot species, involving 10,000 or more genes can be reconstructed in an efficient, parsimonious and consistent way, despite paralogies due to WGD and other processes. Pairwise genomic syntenies provide appropriate input to a parameter-free procedure of multiple ortholog identification followed by gene-order reconstruction in solving instances of the "small phylogeny" problem. PMID:22759433

  10. Separable processes before, during, and after the N400 elicited by previously inferred and new information: evidence from time-frequency decompositions.

    PubMed

    Steele, Vaughn R; Bernat, Edward M; van den Broek, Paul; Collins, Paul F; Patrick, Christopher J; Marsolek, Chad J

    2013-01-25

    Successful comprehension during reading often requires inferring information not explicitly presented. This information is readily accessible when subsequently encountered, and a neural correlate of this is an attenuation of the N400 event-related potential (ERP). We used ERPs and time-frequency (TF) analysis to investigate neural correlates of processing inferred information after a causal coherence inference had been generated during text comprehension. Participants read short texts, some of which promoted inference generation. After each text, they performed lexical decisions to target words that were unrelated or inference-related to the preceding text. Consistent with previous findings, inference-related words elicited an attenuated N400 relative to unrelated words. TF analyses revealed unique contributions to the N400 from activity occurring at 1-6 Hz (theta) and 0-2 Hz (delta), supporting the view that multiple, sequential processes underlie the N400. Copyright © 2012 Elsevier B.V. All rights reserved.

  11. Active Inference, homeostatic regulation and adaptive behavioural control.

    PubMed

    Pezzulo, Giovanni; Rigoli, Francesco; Friston, Karl

    2015-11-01

    We review a theory of homeostatic regulation and adaptive behavioural control within the Active Inference framework. Our aim is to connect two research streams that are usually considered independently; namely, Active Inference and associative learning theories of animal behaviour. The former uses a probabilistic (Bayesian) formulation of perception and action, while the latter calls on multiple (Pavlovian, habitual, goal-directed) processes for homeostatic and behavioural control. We offer a synthesis these classical processes and cast them as successive hierarchical contextualisations of sensorimotor constructs, using the generative models that underpin Active Inference. This dissolves any apparent mechanistic distinction between the optimization processes that mediate classical control or learning. Furthermore, we generalize the scope of Active Inference by emphasizing interoceptive inference and homeostatic regulation. The ensuing homeostatic (or allostatic) perspective provides an intuitive explanation for how priors act as drives or goals to enslave action, and emphasises the embodied nature of inference. Copyright © 2015 The Authors. Published by Elsevier Ltd.. All rights reserved.

  12. GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods.

    PubMed

    Schaffter, Thomas; Marbach, Daniel; Floreano, Dario

    2011-08-15

    Over the last decade, numerous methods have been developed for inference of regulatory networks from gene expression data. However, accurate and systematic evaluation of these methods is hampered by the difficulty of constructing adequate benchmarks and the lack of tools for a differentiated analysis of network predictions on such benchmarks. Here, we describe a novel and comprehensive method for in silico benchmark generation and performance profiling of network inference methods available to the community as an open-source software called GeneNetWeaver (GNW). In addition to the generation of detailed dynamical models of gene regulatory networks to be used as benchmarks, GNW provides a network motif analysis that reveals systematic prediction errors, thereby indicating potential ways of improving inference methods. The accuracy of network inference methods is evaluated using standard metrics such as precision-recall and receiver operating characteristic curves. We show how GNW can be used to assess the performance and identify the strengths and weaknesses of six inference methods. Furthermore, we used GNW to provide the international Dialogue for Reverse Engineering Assessments and Methods (DREAM) competition with three network inference challenges (DREAM3, DREAM4 and DREAM5). GNW is available at http://gnw.sourceforge.net along with its Java source code, user manual and supporting data. Supplementary data are available at Bioinformatics online. dario.floreano@epfl.ch.

  13. Mark-recapture with multiple, non-invasive marks.

    PubMed

    Bonner, Simon J; Holmberg, Jason

    2013-09-01

    Non-invasive marks, including pigmentation patterns, acquired scars, and genetic markers, are often used to identify individuals in mark-recapture experiments. If animals in a population can be identified from multiple, non-invasive marks then some individuals may be counted twice in the observed data. Analyzing the observed histories without accounting for these errors will provide incorrect inference about the population dynamics. Previous approaches to this problem include modeling data from only one mark and combining estimators obtained from each mark separately assuming that they are independent. Motivated by the analysis of data from the ECOCEAN online whale shark (Rhincodon typus) catalog, we describe a Bayesian method to analyze data from multiple, non-invasive marks that is based on the latent-multinomial model of Link et al. (2010, Biometrics 66, 178-185). Further to this, we describe a simplification of the Markov chain Monte Carlo algorithm of Link et al. (2010, Biometrics 66, 178-185) that leads to more efficient computation. We present results from the analysis of the ECOCEAN whale shark data and from simulation studies comparing our method with the previous approaches. © 2013, The International Biometric Society.

  14. Cluster mass inference via random field theory.

    PubMed

    Zhang, Hui; Nichols, Thomas E; Johnson, Timothy D

    2009-01-01

    Cluster extent and voxel intensity are two widely used statistics in neuroimaging inference. Cluster extent is sensitive to spatially extended signals while voxel intensity is better for intense but focal signals. In order to leverage strength from both statistics, several nonparametric permutation methods have been proposed to combine the two methods. Simulation studies have shown that of the different cluster permutation methods, the cluster mass statistic is generally the best. However, to date, there is no parametric cluster mass inference available. In this paper, we propose a cluster mass inference method based on random field theory (RFT). We develop this method for Gaussian images, evaluate it on Gaussian and Gaussianized t-statistic images and investigate its statistical properties via simulation studies and real data. Simulation results show that the method is valid under the null hypothesis and demonstrate that it can be more powerful than the cluster extent inference method. Further, analyses with a single subject and a group fMRI dataset demonstrate better power than traditional cluster size inference, and good accuracy relative to a gold-standard permutation test.

  15. Reconstructing population histories from single nucleotide polymorphism data.

    PubMed

    Sirén, Jukka; Marttinen, Pekka; Corander, Jukka

    2011-01-01

    Population genetics encompasses a strong theoretical and applied research tradition on the multiple demographic processes that shape genetic variation present within a species. When several distinct populations exist in the current generation, it is often natural to consider the pattern of their divergence from a single ancestral population in terms of a binary tree structure. Inference about such population histories based on molecular data has been an intensive research topic in the recent years. The most common approach uses coalescent theory to model genealogies of individuals sampled from the current populations. Such methods are able to compare several different evolutionary scenarios and to estimate demographic parameters. However, their major limitation is the enormous computational complexity associated with the indirect modeling of the demographies, which limits the application to small data sets. Here, we propose a novel Bayesian method for inferring population histories from unlinked single nucleotide polymorphisms, which is applicable also to data sets harboring large numbers of individuals from distinct populations. We use an approximation to the neutral Wright-Fisher diffusion to model random fluctuations in allele frequencies. The population histories are modeled as binary rooted trees that represent the historical order of divergence of the different populations. A combination of analytical, numerical, and Monte Carlo integration techniques are utilized for the inferences. A particularly important feature of our approach is that it provides intuitive measures of statistical uncertainty related with the estimates computed, which may be entirely lacking for the alternative methods in this context. The potential of our approach is illustrated by analyses of both simulated and real data sets.

  16. Effect of distance-related heterogeneity on population size estimates from point counts

    USGS Publications Warehouse

    Efford, Murray G.; Dawson, Deanna K.

    2009-01-01

    Point counts are used widely to index bird populations. Variation in the proportion of birds counted is a known source of error, and for robust inference it has been advocated that counts be converted to estimates of absolute population size. We used simulation to assess nine methods for the conduct and analysis of point counts when the data included distance-related heterogeneity of individual detection probability. Distance from the observer is a ubiquitous source of heterogeneity, because nearby birds are more easily detected than distant ones. Several recent methods (dependent double-observer, time of first detection, time of detection, independent multiple-observer, and repeated counts) do not account for distance-related heterogeneity, at least in their simpler forms. We assessed bias in estimates of population size by simulating counts with fixed radius w over four time intervals (occasions). Detection probability per occasion was modeled as a half-normal function of distance with scale parameter sigma and intercept g(0) = 1.0. Bias varied with sigma/w; values of sigma inferred from published studies were often 50% for a 100-m fixed-radius count. More critically, the bias of adjusted counts sometimes varied more than that of unadjusted counts, and inference from adjusted counts would be less robust. The problem was not solved by using mixture models or including distance as a covariate. Conventional distance sampling performed well in simulations, but its assumptions are difficult to meet in the field. We conclude that no existing method allows effective estimation of population size from point counts.

  17. Two-condition within-participant statistical mediation analysis: A path-analytic framework.

    PubMed

    Montoya, Amanda K; Hayes, Andrew F

    2017-03-01

    Researchers interested in testing mediation often use designs where participants are measured on a dependent variable Y and a mediator M in both of 2 different circumstances. The dominant approach to assessing mediation in such a design, proposed by Judd, Kenny, and McClelland (2001), relies on a series of hypothesis tests about components of the mediation model and is not based on an estimate of or formal inference about the indirect effect. In this article we recast Judd et al.'s approach in the path-analytic framework that is now commonly used in between-participant mediation analysis. By so doing, it is apparent how to estimate the indirect effect of a within-participant manipulation on some outcome through a mediator as the product of paths of influence. This path-analytic approach eliminates the need for discrete hypothesis tests about components of the model to support a claim of mediation, as Judd et al.'s method requires, because it relies only on an inference about the product of paths-the indirect effect. We generalize methods of inference for the indirect effect widely used in between-participant designs to this within-participant version of mediation analysis, including bootstrap confidence intervals and Monte Carlo confidence intervals. Using this path-analytic approach, we extend the method to models with multiple mediators operating in parallel and serially and discuss the comparison of indirect effects in these more complex models. We offer macros and code for SPSS, SAS, and Mplus that conduct these analyses. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

  18. Comparing methods of analysing datasets with small clusters: case studies using four paediatric datasets.

    PubMed

    Marston, Louise; Peacock, Janet L; Yu, Keming; Brocklehurst, Peter; Calvert, Sandra A; Greenough, Anne; Marlow, Neil

    2009-07-01

    Studies of prematurely born infants contain a relatively large percentage of multiple births, so the resulting data have a hierarchical structure with small clusters of size 1, 2 or 3. Ignoring the clustering may lead to incorrect inferences. The aim of this study was to compare statistical methods which can be used to analyse such data: generalised estimating equations, multilevel models, multiple linear regression and logistic regression. Four datasets which differed in total size and in percentage of multiple births (n = 254, multiple 18%; n = 176, multiple 9%; n = 10 098, multiple 3%; n = 1585, multiple 8%) were analysed. With the continuous outcome, two-level models produced similar results in the larger dataset, while generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) produced divergent estimates using the smaller dataset. For the dichotomous outcome, most methods, except generalised least squares multilevel modelling (ML GH 'xtlogit' in Stata) gave similar odds ratios and 95% confidence intervals within datasets. For the continuous outcome, our results suggest using multilevel modelling. We conclude that generalised least squares multilevel modelling (ML GLS 'xtreg' in Stata) and maximum likelihood multilevel modelling (ML MLE 'xtmixed' in Stata) should be used with caution when the dataset is small. Where the outcome is dichotomous and there is a relatively large percentage of non-independent data, it is recommended that these are accounted for in analyses using logistic regression with adjusted standard errors or multilevel modelling. If, however, the dataset has a small percentage of clusters greater than size 1 (e.g. a population dataset of children where there are few multiples) there appears to be less need to adjust for clustering.

  19. Rapid, High-Throughput Tracking of Bacterial Motility in 3D via Phase-Contrast Holographic Video Microscopy

    PubMed Central

    Cheong, Fook Chiong; Wong, Chui Ching; Gao, YunFeng; Nai, Mui Hoon; Cui, Yidan; Park, Sungsu; Kenney, Linda J.; Lim, Chwee Teck

    2015-01-01

    Tracking fast-swimming bacteria in three dimensions can be extremely challenging with current optical techniques and a microscopic approach that can rapidly acquire volumetric information is required. Here, we introduce phase-contrast holographic video microscopy as a solution for the simultaneous tracking of multiple fast moving cells in three dimensions. This technique uses interference patterns formed between the scattered and the incident field to infer the three-dimensional (3D) position and size of bacteria. Using this optical approach, motility dynamics of multiple bacteria in three dimensions, such as speed and turn angles, can be obtained within minutes. We demonstrated the feasibility of this method by effectively tracking multiple bacteria species, including Escherichia coli, Agrobacterium tumefaciens, and Pseudomonas aeruginosa. In addition, we combined our fast 3D imaging technique with a microfluidic device to present an example of a drug/chemical assay to study effects on bacterial motility. PMID:25762336

  20. Selection of latent variables for multiple mixed-outcome models

    PubMed Central

    ZHOU, LING; LIN, HUAZHEN; SONG, XINYUAN; LI, YI

    2014-01-01

    Latent variable models have been widely used for modeling the dependence structure of multiple outcomes data. However, the formulation of a latent variable model is often unknown a priori, the misspecification will distort the dependence structure and lead to unreliable model inference. Moreover, multiple outcomes with varying types present enormous analytical challenges. In this paper, we present a class of general latent variable models that can accommodate mixed types of outcomes. We propose a novel selection approach that simultaneously selects latent variables and estimates parameters. We show that the proposed estimator is consistent, asymptotically normal and has the oracle property. The practical utility of the methods is confirmed via simulations as well as an application to the analysis of the World Values Survey, a global research project that explores peoples’ values and beliefs and the social and personal characteristics that might influence them. PMID:27642219

  1. A Practical Illustration of Multidimensional Diagnostic Skills Profiling: Comparing Results from Confirmatory Factor Analysis and Diagnostic Classification Models

    ERIC Educational Resources Information Center

    Kunina-Habenicht, Olga; Rupp, Andre A.; Wilhelm, Oliver

    2009-01-01

    In recent years there has been an increasing international interest in fine-grained diagnostic inferences on multiple skills for formative purposes. A successful provision of such inferences that support meaningful instructional decision-making requires (a) careful diagnostic assessment design coupled with (b) empirical support for the structure…

  2. A Patch to MCNP5 for Multiplication Inference: Description and User Guide

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Solomon, Jr., Clell J.

    2014-05-05

    A patch to MCNP5 has been written to allow generation of multiple neutrons from a spontaneous-fission event and generate list-mode output. This report documents the implementation and usage of this patch.

  3. Formalizing the Role of Agent-Based Modeling in Causal Inference and Epidemiology

    PubMed Central

    Marshall, Brandon D. L.; Galea, Sandro

    2015-01-01

    Calls for the adoption of complex systems approaches, including agent-based modeling, in the field of epidemiology have largely centered on the potential for such methods to examine complex disease etiologies, which are characterized by feedback behavior, interference, threshold dynamics, and multiple interacting causal effects. However, considerable theoretical and practical issues impede the capacity of agent-based methods to examine and evaluate causal effects and thus illuminate new areas for intervention. We build on this work by describing how agent-based models can be used to simulate counterfactual outcomes in the presence of complexity. We show that these models are of particular utility when the hypothesized causal mechanisms exhibit a high degree of interdependence between multiple causal effects and when interference (i.e., one person's exposure affects the outcome of others) is present and of intrinsic scientific interest. Although not without challenges, agent-based modeling (and complex systems methods broadly) represent a promising novel approach to identify and evaluate complex causal effects, and they are thus well suited to complement other modern epidemiologic methods of etiologic inquiry. PMID:25480821

  4. Anchoring quartet-based phylogenetic distances and applications to species tree reconstruction.

    PubMed

    Sayyari, Erfan; Mirarab, Siavash

    2016-11-11

    Inferring species trees from gene trees using the coalescent-based summary methods has been the subject of much attention, yet new scalable and accurate methods are needed. We introduce DISTIQUE, a new statistically consistent summary method for inferring species trees from gene trees under the coalescent model. We generalize our results to arbitrary phylogenetic inference problems; we show that two arbitrarily chosen leaves, called anchors, can be used to estimate relative distances between all other pairs of leaves by inferring relevant quartet trees. This results in a family of distance-based tree inference methods, with running times ranging between quadratic to quartic in the number of leaves. We show in simulated studies that DISTIQUE has comparable accuracy to leading coalescent-based summary methods and reduced running times.

  5. Statistical inference of seabed sound-speed structure in the Gulf of Oman Basin.

    PubMed

    Sagers, Jason D; Knobles, David P

    2014-06-01

    Addressed is the statistical inference of the sound-speed depth profile of a thick soft seabed from broadband sound propagation data recorded in the Gulf of Oman Basin in 1977. The acoustic data are in the form of time series signals recorded on a sparse vertical line array and generated by explosive sources deployed along a 280 km track. The acoustic data offer a unique opportunity to study a deep-water bottom-limited thickly sedimented environment because of the large number of time series measurements, very low seabed attenuation, and auxiliary measurements. A maximum entropy method is employed to obtain a conditional posterior probability distribution (PPD) for the sound-speed ratio and the near-surface sound-speed gradient. The multiple data samples allow for a determination of the average error constraint value required to uniquely specify the PPD for each data sample. Two complicating features of the statistical inference study are addressed: (1) the need to develop an error function that can both utilize the measured multipath arrival structure and mitigate the effects of data errors and (2) the effect of small bathymetric slopes on the structure of the bottom interacting arrivals.

  6. Inferring social ties from geographic coincidences.

    PubMed

    Crandall, David J; Backstrom, Lars; Cosley, Dan; Suri, Siddharth; Huttenlocher, Daniel; Kleinberg, Jon

    2010-12-28

    We investigate the extent to which social ties between people can be inferred from co-occurrence in time and space: Given that two people have been in approximately the same geographic locale at approximately the same time, on multiple occasions, how likely are they to know each other? Furthermore, how does this likelihood depend on the spatial and temporal proximity of the co-occurrences? Such issues arise in data originating in both online and offline domains as well as settings that capture interfaces between online and offline behavior. Here we develop a framework for quantifying the answers to such questions, and we apply this framework to publicly available data from a social media site, finding that even a very small number of co-occurrences can result in a high empirical likelihood of a social tie. We then present probabilistic models showing how such large probabilities can arise from a natural model of proximity and co-occurrence in the presence of social ties. In addition to providing a method for establishing some of the first quantifiable estimates of these measures, our findings have potential privacy implications, particularly for the ways in which social structures can be inferred from public online records that capture individuals' physical locations over time.

  7. A Quantitative Systems Pharmacology Approach to Infer Pathways Involved in Complex Disease Phenotypes.

    PubMed

    Schurdak, Mark E; Pei, Fen; Lezon, Timothy R; Carlisle, Diane; Friedlander, Robert; Taylor, D Lansing; Stern, Andrew M

    2018-01-01

    Designing effective therapeutic strategies for complex diseases such as cancer and neurodegeneration that involve tissue context-specific interactions among multiple gene products presents a major challenge for precision medicine. Safe and selective pharmacological modulation of individual molecular entities associated with a disease often fails to provide efficacy in the clinic. Thus, development of optimized therapeutic strategies for individual patients with complex diseases requires a more comprehensive, systems-level understanding of disease progression. Quantitative systems pharmacology (QSP) is an approach to drug discovery that integrates computational and experimental methods to understand the molecular pathogenesis of a disease at the systems level more completely. Described here is the chemogenomic component of QSP for the inference of biological pathways involved in the modulation of the disease phenotype. The approach involves testing sets of compounds of diverse mechanisms of action in a disease-relevant phenotypic assay, and using the mechanistic information known for the active compounds, to infer pathways and networks associated with the phenotype. The example used here is for monogenic Huntington's disease (HD), which due to the pleiotropic nature of the mutant phenotype has a complex pathogenesis. The overall approach, however, is applicable to any complex disease.

  8. Algorithm of OMA for large-scale orthology inference

    PubMed Central

    Roth, Alexander CJ; Gonnet, Gaston H; Dessimoz, Christophe

    2008-01-01

    Background OMA is a project that aims to identify orthologs within publicly available, complete genomes. With 657 genomes analyzed to date, OMA is one of the largest projects of its kind. Results The algorithm of OMA improves upon standard bidirectional best-hit approach in several respects: it uses evolutionary distances instead of scores, considers distance inference uncertainty, includes many-to-many orthologous relations, and accounts for differential gene losses. Herein, we describe in detail the algorithm for inference of orthology and provide the rationale for parameter selection through multiple tests. Conclusion OMA contains several novel improvement ideas for orthology inference and provides a unique dataset of large-scale orthology assignments. PMID:19055798

  9. Tree-Structured Infinite Sparse Factor Model

    PubMed Central

    Zhang, XianXing; Dunson, David B.; Carin, Lawrence

    2013-01-01

    A tree-structured multiplicative gamma process (TMGP) is developed, for inferring the depth of a tree-based factor-analysis model. This new model is coupled with the nested Chinese restaurant process, to nonparametrically infer the depth and width (structure) of the tree. In addition to developing the model, theoretical properties of the TMGP are addressed, and a novel MCMC sampler is developed. The structure of the inferred tree is used to learn relationships between high-dimensional data, and the model is also applied to compressive sensing and interpolation of incomplete images. PMID:25279389

  10. BagReg: Protein inference through machine learning.

    PubMed

    Zhao, Can; Liu, Dao; Teng, Ben; He, Zengyou

    2015-08-01

    Protein inference from the identified peptides is of primary importance in the shotgun proteomics. The target of protein inference is to identify whether each candidate protein is truly present in the sample. To date, many computational methods have been proposed to solve this problem. However, there is still no method that can fully utilize the information hidden in the input data. In this article, we propose a learning-based method named BagReg for protein inference. The method firstly artificially extracts five features from the input data, and then chooses each feature as the class feature to separately build models to predict the presence probabilities of proteins. Finally, the weak results from five prediction models are aggregated to obtain the final result. We test our method on six public available data sets. The experimental results show that our method is superior to the state-of-the-art protein inference algorithms. Copyright © 2015 Elsevier Ltd. All rights reserved.

  11. Phylogenetic place of guinea pigs: no support of the rodent-polyphyly hypothesis from maximum-likelihood analyses of multiple protein sequences.

    PubMed

    Cao, Y; Adachi, J; Yano, T; Hasegawa, M

    1994-07-01

    Graur et al.'s (1991) hypothesis that the guinea pig-like rodents have an evolutionary origin within mammals that is separate from that of other rodents (the rodent-polyphyly hypothesis) was reexamined by the maximum-likelihood method for protein phylogeny, as well as by the maximum-parsimony and neighbor-joining methods. The overall evidence does not support Graur et al.'s hypothesis, which radically contradicts the traditional view of rodent monophyly. This work demonstrates that we must be careful in choosing a proper method for phylogenetic inference and that an argument based on a small data set (with respect to the length of the sequence and especially the number of species) may be unstable.

  12. Bayesian modeling and inference for diagnostic accuracy and probability of disease based on multiple diagnostic biomarkers with and without a perfect reference standard.

    PubMed

    Jafarzadeh, S Reza; Johnson, Wesley O; Gardner, Ian A

    2016-03-15

    The area under the receiver operating characteristic (ROC) curve (AUC) is used as a performance metric for quantitative tests. Although multiple biomarkers may be available for diagnostic or screening purposes, diagnostic accuracy is often assessed individually rather than in combination. In this paper, we consider the interesting problem of combining multiple biomarkers for use in a single diagnostic criterion with the goal of improving the diagnostic accuracy above that of an individual biomarker. The diagnostic criterion created from multiple biomarkers is based on the predictive probability of disease, conditional on given multiple biomarker outcomes. If the computed predictive probability exceeds a specified cutoff, the corresponding subject is allocated as 'diseased'. This defines a standard diagnostic criterion that has its own ROC curve, namely, the combined ROC (cROC). The AUC metric for cROC, namely, the combined AUC (cAUC), is used to compare the predictive criterion based on multiple biomarkers to one based on fewer biomarkers. A multivariate random-effects model is proposed for modeling multiple normally distributed dependent scores. Bayesian methods for estimating ROC curves and corresponding (marginal) AUCs are developed when a perfect reference standard is not available. In addition, cAUCs are computed to compare the accuracy of different combinations of biomarkers for diagnosis. The methods are evaluated using simulations and are applied to data for Johne's disease (paratuberculosis) in cattle. Copyright © 2015 John Wiley & Sons, Ltd.

  13. Bayesian estimation of the transmissivity spatial structure from pumping test data

    NASA Astrophysics Data System (ADS)

    Demir, Mehmet Taner; Copty, Nadim K.; Trinchero, Paolo; Sanchez-Vila, Xavier

    2017-06-01

    Estimating the statistical parameters (mean, variance, and integral scale) that define the spatial structure of the transmissivity or hydraulic conductivity fields is a fundamental step for the accurate prediction of subsurface flow and contaminant transport. In practice, the determination of the spatial structure is a challenge because of spatial heterogeneity and data scarcity. In this paper, we describe a novel approach that uses time drawdown data from multiple pumping tests to determine the transmissivity statistical spatial structure. The method builds on the pumping test interpretation procedure of Copty et al. (2011) (Continuous Derivation method, CD), which uses the time-drawdown data and its time derivative to estimate apparent transmissivity values as a function of radial distance from the pumping well. A Bayesian approach is then used to infer the statistical parameters of the transmissivity field by combining prior information about the parameters and the likelihood function expressed in terms of radially-dependent apparent transmissivities determined from pumping tests. A major advantage of the proposed Bayesian approach is that the likelihood function is readily determined from randomly generated multiple realizations of the transmissivity field, without the need to solve the groundwater flow equation. Applying the method to synthetically-generated pumping test data, we demonstrate that, through a relatively simple procedure, information on the spatial structure of the transmissivity may be inferred from pumping tests data. It is also shown that the prior parameter distribution has a significant influence on the estimation procedure, given the non-uniqueness of the estimation procedure. Results also indicate that the reliability of the estimated transmissivity statistical parameters increases with the number of available pumping tests.

  14. Statistical inference for extended or shortened phase II studies based on Simon's two-stage designs.

    PubMed

    Zhao, Junjun; Yu, Menggang; Feng, Xi-Ping

    2015-06-07

    Simon's two-stage designs are popular choices for conducting phase II clinical trials, especially in the oncology trials to reduce the number of patients placed on ineffective experimental therapies. Recently Koyama and Chen (2008) discussed how to conduct proper inference for such studies because they found that inference procedures used with Simon's designs almost always ignore the actual sampling plan used. In particular, they proposed an inference method for studies when the actual second stage sample sizes differ from planned ones. We consider an alternative inference method based on likelihood ratio. In particular, we order permissible sample paths under Simon's two-stage designs using their corresponding conditional likelihood. In this way, we can calculate p-values using the common definition: the probability of obtaining a test statistic value at least as extreme as that observed under the null hypothesis. In addition to providing inference for a couple of scenarios where Koyama and Chen's method can be difficult to apply, the resulting estimate based on our method appears to have certain advantage in terms of inference properties in many numerical simulations. It generally led to smaller biases and narrower confidence intervals while maintaining similar coverages. We also illustrated the two methods in a real data setting. Inference procedures used with Simon's designs almost always ignore the actual sampling plan. Reported P-values, point estimates and confidence intervals for the response rate are not usually adjusted for the design's adaptiveness. Proper statistical inference procedures should be used.

  15. A Test of Transitive Inferences in Free-Flying Honeybees: Unsuccessful Performance Due to Memory Constraints

    ERIC Educational Resources Information Center

    Benard, Julie; Giurfa, Martin

    2004-01-01

    We asked whether honeybees, "Apis mellifera," could solve a transitive inference problem. Individual free-flying bees were conditioned with four overlapping premise pairs of five visual patterns in a multiple discrimination task (A+ vs. B-, B+ vs. C-, C+ vs. D-, D+ vs. E-, where + and - indicate sucrose reward or absence of it,…

  16. Inference for Continuous-Time Probabilistic Programming

    DTIC Science & Technology

    2017-12-01

    Parzen window density estimator to jointly model the inter-camera travel time intervals, locations of exit/entrances, and velocities of ob- jects...asked to travel across the scene multiple times . Even in such a scenario they formed groups and made social interactions, which Fig. 7: Topology of...INFERENCE FOR CONTINUOUS- TIME PROBABILISTIC PROGRAMMING UNIVERSITY OF CALIFORNIA AT RIVERSIDE DECEMBER 2017 FINAL TECHNICAL REPORT APPROVED FOR

  17. Non-Bayesian Noun Generalization in 3-to 5-Year-Old Children: Probing the Role of Prior Knowledge in the Suspicious Coincidence Effect

    ERIC Educational Resources Information Center

    Jenkins, Gavin W.; Samuelson, Larissa K.; Smith, Jodi R.; Spencer, John P.

    2015-01-01

    It is unclear how children learn labels for multiple overlapping categories such as "Labrador," "dog," and "animal." Xu and Tenenbaum (2007a) suggested that learners infer correct meanings with the help of Bayesian inference. They instantiated these claims in a Bayesian model, which they tested with preschoolers and…

  18. The Importance of Being Coherent: Category Coherence, Cross-Classification, and Reasoning

    ERIC Educational Resources Information Center

    Patalano, Andrea L.; Chin-Parker, Seth; Ross, Brian H.

    2006-01-01

    Category-based inference is crucial for using past experiences to make sense of new ones. One challenge to inference of this kind is that most entities in the world belong to multiple categories (e.g., a jogger, a professor, and a vegetarian). We tested the hypothesis that the "degree of coherence" of a category-the degree to which category…

  19. Functional characterization of somatic mutations in cancer using network-based inference of protein activity | Office of Cancer Genomics

    Cancer.gov

    Identifying the multiple dysregulated oncoproteins that contribute to tumorigenesis in a given patient is crucial for developing personalized treatment plans. However, accurate inference of aberrant protein activity in biological samples is still challenging as genetic alterations are only partially predictive and direct measurements of protein activity are generally not feasible.

  20. Modification of the USLE K factor for soil erodibility assessment on calcareous soils in Iran

    NASA Astrophysics Data System (ADS)

    Ostovari, Yaser; Ghorbani-Dashtaki, Shoja; Bahrami, Hossein-Ali; Naderi, Mehdi; Dematte, Jose Alexandre M.; Kerry, Ruth

    2016-11-01

    The measurement of soil erodibility (K) in the field is tedious, time-consuming and expensive; therefore, its prediction through pedotransfer functions (PTFs) could be far less costly and time-consuming. The aim of this study was to develop new PTFs to estimate the K factor using multiple linear regression, Mamdani fuzzy inference systems, and artificial neural networks. For this purpose, K was measured in 40 erosion plots with natural rainfall. Various soil properties including the soil particle size distribution, calcium carbonate equivalent, organic matter, permeability, and wet-aggregate stability were measured. The results showed that the mean measured K was 0.014 t h MJ- 1 mm- 1 and 2.08 times less than the estimated mean K (0.030 t h MJ- 1 mm- 1) using the USLE model. Permeability, wet-aggregate stability, very fine sand, and calcium carbonate were selected as independent variables by forward stepwise regression in order to assess the ability of multiple linear regression, Mamdani fuzzy inference systems and artificial neural networks to predict K. The calcium carbonate equivalent, which is not accounted for in the USLE model, had a significant impact on K in multiple linear regression due to its strong influence on the stability of aggregates and soil permeability. Statistical indices in validation and calibration datasets determined that the artificial neural networks method with the highest R2, lowest RMSE, and lowest ME was the best model for estimating the K factor. A strong correlation (R2 = 0.81, n = 40, p < 0.05) between the estimated K from multiple linear regression and measured K indicates that the use of calcium carbonate equivalent as a predictor variable gives a better estimation of K in areas with calcareous soils.

  1. Joint estimation over multiple individuals improves behavioural state inference from animal movement data.

    PubMed

    Jonsen, Ian

    2016-02-08

    State-space models provide a powerful way to scale up inference of movement behaviours from individuals to populations when the inference is made across multiple individuals. Here, I show how a joint estimation approach that assumes individuals share identical movement parameters can lead to improved inference of behavioural states associated with different movement processes. I use simulated movement paths with known behavioural states to compare estimation error between nonhierarchical and joint estimation formulations of an otherwise identical state-space model. Behavioural state estimation error was strongly affected by the degree of similarity between movement patterns characterising the behavioural states, with less error when movements were strongly dissimilar between states. The joint estimation model improved behavioural state estimation relative to the nonhierarchical model for simulated data with heavy-tailed Argos location errors. When applied to Argos telemetry datasets from 10 Weddell seals, the nonhierarchical model estimated highly uncertain behavioural state switching probabilities for most individuals whereas the joint estimation model yielded substantially less uncertainty. The joint estimation model better resolved the behavioural state sequences across all seals. Hierarchical or joint estimation models should be the preferred choice for estimating behavioural states from animal movement data, especially when location data are error-prone.

  2. Learning language from within: Children use semantic generalizations to infer word meanings.

    PubMed

    Srinivasan, Mahesh; Al-Mughairy, Sara; Foushee, Ruthe; Barner, David

    2017-02-01

    One reason that word learning presents a challenge for children is because pairings between word forms and meanings are arbitrary conventions that children must learn via observation - e.g., the fact that "shovel" labels shovels. The present studies explore cases in which children might bypass observational learning and spontaneously infer new word meanings: By exploiting the fact that many words are flexible and systematically encode multiple, related meanings. For example, words like shovel and hammer are nouns for instruments, and verbs for activities involving those instruments. The present studies explored whether 3- to 5-year-old children possess semantic generalizations about lexical flexibility, and can use these generalizations to infer new word meanings: Upon learning that dax labels an activity involving an instrument, do children spontaneously infer that dax can also label the instrument itself? Across four studies, we show that at least by age four, children spontaneously generalize instrument-activity flexibility to new words. Together, our findings point to a powerful way in which children may build their vocabulary, by leveraging the fact that words are linked to multiple meanings in systematic ways. Copyright © 2016 Elsevier B.V. All rights reserved.

  3. The average receiver operating characteristic curve in multireader multicase imaging studies

    PubMed Central

    Samuelson, F W

    2014-01-01

    Objective: In multireader, multicase (MRMC) receiver operating characteristic (ROC) studies for evaluating medical imaging systems, the area under the ROC curve (AUC) is often used as a summary metric. Owing to the limitations of AUC, plotting the average ROC curve to accompany the rigorous statistical inference on AUC is recommended. The objective of this article is to investigate methods for generating the average ROC curve from ROC curves of individual readers. Methods: We present both a non-parametric method and a parametric method for averaging ROC curves that produce a ROC curve, the area under which is equal to the average AUC of individual readers (a property we call area preserving). We use hypothetical examples, simulated data and a real-world imaging data set to illustrate these methods and their properties. Results: We show that our proposed methods are area preserving. We also show that the method of averaging the ROC parameters, either the conventional bi-normal parameters (a, b) or the proper bi-normal parameters (c, da), is generally not area preserving and may produce a ROC curve that is intuitively not an average of multiple curves. Conclusion: Our proposed methods are useful for making plots of average ROC curves in MRMC studies as a companion to the rigorous statistical inference on the AUC end point. The software implementing these methods is freely available from the authors. Advances in knowledge: Methods for generating the average ROC curve in MRMC ROC studies are formally investigated. The area-preserving criterion we defined is useful to evaluate such methods. PMID:24884728

  4. A Weight of Evidence Framework for Environmental Assessments: Inferring Quantities

    EPA Science Inventory

    Environmental assessments require the generation of quantitative parameters such as degradation rates and assessment products may be quantities such as criterion values or magnitudes of effects. When multiple data sets or outputs of multiple models are available, it may be appro...

  5. Category-based predictions: influence of uncertainty and feature associations.

    PubMed

    Ross, B H; Murphy, G L

    1996-05-01

    Four experiments examined how people make inductive inferences using categories. Subjects read stories in which 2 categories were mentioned as possible identities of an object. The less likely category was varied to determine if people were using it, as well as the most likely category, in making predictions about the object. Experiment 1 showed that even when categorization uncertainty was emphasized, subjects used only 1 category as the basis for their prediction. Experiments 2-4 examined whether people would use multiple categories for making predictions when the feature to be predicted was associated to the less likely category. Multiple categories were used in this case, but only in limited circumstances; furthermore, using multiple categories in 1 prediction did not cause subjects to use them for subsequent predictions. The results increase the understanding of how categories are used in inductive inference.

  6. Identification of driving network of cellular differentiation from single sample time course gene expression data

    NASA Astrophysics Data System (ADS)

    Chen, Ye; Wolanyk, Nathaniel; Ilker, Tunc; Gao, Shouguo; Wang, Xujing

    Methods developed based on bifurcation theory have demonstrated their potential in driving network identification for complex human diseases, including the work by Chen, et al. Recently bifurcation theory has been successfully applied to model cellular differentiation. However, there one often faces a technical challenge in driving network prediction: time course cellular differentiation study often only contains one sample at each time point, while driving network prediction typically require multiple samples at each time point to infer the variation and interaction structures of candidate genes for the driving network. In this study, we investigate several methods to identify both the critical time point and the driving network through examination of how each time point affects the autocorrelation and phase locking. We apply these methods to a high-throughput sequencing (RNA-Seq) dataset of 42 subsets of thymocytes and mature peripheral T cells at multiple time points during their differentiation (GSE48138 from GEO). We compare the predicted driving genes with known transcription regulators of cellular differentiation. We will discuss the advantages and limitations of our proposed methods, as well as potential further improvements of our methods.

  7. Linear time-varying models can reveal non-linear interactions of biomolecular regulatory networks using multiple time-series data.

    PubMed

    Kim, Jongrae; Bates, Declan G; Postlethwaite, Ian; Heslop-Harrison, Pat; Cho, Kwang-Hyun

    2008-05-15

    Inherent non-linearities in biomolecular interactions make the identification of network interactions difficult. One of the principal problems is that all methods based on the use of linear time-invariant models will have fundamental limitations in their capability to infer certain non-linear network interactions. Another difficulty is the multiplicity of possible solutions, since, for a given dataset, there may be many different possible networks which generate the same time-series expression profiles. A novel algorithm for the inference of biomolecular interaction networks from temporal expression data is presented. Linear time-varying models, which can represent a much wider class of time-series data than linear time-invariant models, are employed in the algorithm. From time-series expression profiles, the model parameters are identified by solving a non-linear optimization problem. In order to systematically reduce the set of possible solutions for the optimization problem, a filtering process is performed using a phase-portrait analysis with random numerical perturbations. The proposed approach has the advantages of not requiring the system to be in a stable steady state, of using time-series profiles which have been generated by a single experiment, and of allowing non-linear network interactions to be identified. The ability of the proposed algorithm to correctly infer network interactions is illustrated by its application to three examples: a non-linear model for cAMP oscillations in Dictyostelium discoideum, the cell-cycle data for Saccharomyces cerevisiae and a large-scale non-linear model of a group of synchronized Dictyostelium cells. The software used in this article is available from http://sbie.kaist.ac.kr/software

  8. On the use of Godhavn H-component as an indicator of the interplanetary sector polarity

    NASA Technical Reports Server (NTRS)

    Svalgaard, L.

    1974-01-01

    An objective method of inferring the polarity of the interplanetary magnetic field using the H-component at Godhavn is presented. The objectively inferred polarities are compared with a subjective index inferred earlier. It is concluded that no significant difference exists between the two methods. The inferred polarities derived from Godhavn H is biased by the (slp) sub q signature in the sense that during summer prolonged intervals of geomagnetic calm will result in inferred Away polarity regardless of the actual sector polarity. This bias does not significantly alter the large scale structure of the inferred sector structure.

  9. Advanced statistics: linear regression, part I: simple linear regression.

    PubMed

    Marill, Keith A

    2004-01-01

    Simple linear regression is a mathematical technique used to model the relationship between a single independent predictor variable and a single dependent outcome variable. In this, the first of a two-part series exploring concepts in linear regression analysis, the four fundamental assumptions and the mechanics of simple linear regression are reviewed. The most common technique used to derive the regression line, the method of least squares, is described. The reader will be acquainted with other important concepts in simple linear regression, including: variable transformations, dummy variables, relationship to inference testing, and leverage. Simplified clinical examples with small datasets and graphic models are used to illustrate the points. This will provide a foundation for the second article in this series: a discussion of multiple linear regression, in which there are multiple predictor variables.

  10. Limited utility of residue masking for positive-selection inference.

    PubMed

    Spielman, Stephanie J; Dawson, Eric T; Wilke, Claus O

    2014-09-01

    Errors in multiple sequence alignments (MSAs) can reduce accuracy in positive-selection inference. Therefore, it has been suggested to filter MSAs before conducting further analyses. One widely used filter, Guidance, allows users to remove MSA positions aligned with low confidence. However, Guidance's utility in positive-selection inference has been disputed in the literature. We have conducted an extensive simulation-based study to characterize fully how Guidance impacts positive-selection inference, specifically for protein-coding sequences of realistic divergence levels. We also investigated whether novel scoring algorithms, which phylogenetically corrected confidence scores, and a new gap-penalization score-normalization scheme improved Guidance's performance. We found that no filter, including original Guidance, consistently benefitted positive-selection inferences. Moreover, all improvements detected were exceedingly minimal, and in certain circumstances, Guidance-based filters worsened inferences. © The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  11. An improved method for bivariate meta-analysis when within-study correlations are unknown.

    PubMed

    Hong, Chuan; D Riley, Richard; Chen, Yong

    2018-03-01

    Multivariate meta-analysis, which jointly analyzes multiple and possibly correlated outcomes in a single analysis, is becoming increasingly popular in recent years. An attractive feature of the multivariate meta-analysis is its ability to account for the dependence between multiple estimates from the same study. However, standard inference procedures for multivariate meta-analysis require the knowledge of within-study correlations, which are usually unavailable. This limits standard inference approaches in practice. Riley et al proposed a working model and an overall synthesis correlation parameter to account for the marginal correlation between outcomes, where the only data needed are those required for a separate univariate random-effects meta-analysis. As within-study correlations are not required, the Riley method is applicable to a wide variety of evidence synthesis situations. However, the standard variance estimator of the Riley method is not entirely correct under many important settings. As a consequence, the coverage of a function of pooled estimates may not reach the nominal level even when the number of studies in the multivariate meta-analysis is large. In this paper, we improve the Riley method by proposing a robust variance estimator, which is asymptotically correct even when the model is misspecified (ie, when the likelihood function is incorrect). Simulation studies of a bivariate meta-analysis, in a variety of settings, show a function of pooled estimates has improved performance when using the proposed robust variance estimator. In terms of individual pooled estimates themselves, the standard variance estimator and robust variance estimator give similar results to the original method, with appropriate coverage. The proposed robust variance estimator performs well when the number of studies is relatively large. Therefore, we recommend the use of the robust method for meta-analyses with a relatively large number of studies (eg, m≥50). When the sample size is relatively small, we recommend the use of the robust method under the working independence assumption. We illustrate the proposed method through 2 meta-analyses. Copyright © 2017 John Wiley & Sons, Ltd.

  12. A Hierarchical Poisson Log-Normal Model for Network Inference from RNA Sequencing Data

    PubMed Central

    Gallopin, Mélina; Rau, Andrea; Jaffrézic, Florence

    2013-01-01

    Gene network inference from transcriptomic data is an important methodological challenge and a key aspect of systems biology. Although several methods have been proposed to infer networks from microarray data, there is a need for inference methods able to model RNA-seq data, which are count-based and highly variable. In this work we propose a hierarchical Poisson log-normal model with a Lasso penalty to infer gene networks from RNA-seq data; this model has the advantage of directly modelling discrete data and accounting for inter-sample variance larger than the sample mean. Using real microRNA-seq data from breast cancer tumors and simulations, we compare this method to a regularized Gaussian graphical model on log-transformed data, and a Poisson log-linear graphical model with a Lasso penalty on power-transformed data. For data simulated with large inter-sample dispersion, the proposed model performs better than the other methods in terms of sensitivity, specificity and area under the ROC curve. These results show the necessity of methods specifically designed for gene network inference from RNA-seq data. PMID:24147011

  13. Statistical Inference and Reverse Engineering of Gene Regulatory Networks from Observational Expression Data

    PubMed Central

    Emmert-Streib, Frank; Glazko, Galina V.; Altay, Gökmen; de Matos Simoes, Ricardo

    2012-01-01

    In this paper, we present a systematic and conceptual overview of methods for inferring gene regulatory networks from observational gene expression data. Further, we discuss two classic approaches to infer causal structures and compare them with contemporary methods by providing a conceptual categorization thereof. We complement the above by surveying global and local evaluation measures for assessing the performance of inference algorithms. PMID:22408642

  14. OncoNEM: inferring tumor evolution from single-cell sequencing data.

    PubMed

    Ross, Edith M; Markowetz, Florian

    2016-04-15

    Single-cell sequencing promises a high-resolution view of genetic heterogeneity and clonal evolution in cancer. However, methods to infer tumor evolution from single-cell sequencing data lag behind methods developed for bulk-sequencing data. Here, we present OncoNEM, a probabilistic method for inferring intra-tumor evolutionary lineage trees from somatic single nucleotide variants of single cells. OncoNEM identifies homogeneous cellular subpopulations and infers their genotypes as well as a tree describing their evolutionary relationships. In simulation studies, we assess OncoNEM's robustness and benchmark its performance against competing methods. Finally, we show its applicability in case studies of muscle-invasive bladder cancer and essential thrombocythemia.

  15. Inference of Vohradský's Models of Genetic Networks by Solving Two-Dimensional Function Optimization Problems

    PubMed Central

    Kimura, Shuhei; Sato, Masanao; Okada-Hatakeyama, Mariko

    2013-01-01

    The inference of a genetic network is a problem in which mutual interactions among genes are inferred from time-series of gene expression levels. While a number of models have been proposed to describe genetic networks, this study focuses on a mathematical model proposed by Vohradský. Because of its advantageous features, several researchers have proposed the inference methods based on Vohradský's model. When trying to analyze large-scale networks consisting of dozens of genes, however, these methods must solve high-dimensional non-linear function optimization problems. In order to resolve the difficulty of estimating the parameters of the Vohradský's model, this study proposes a new method that defines the problem as several two-dimensional function optimization problems. Through numerical experiments on artificial genetic network inference problems, we showed that, although the computation time of the proposed method is not the shortest, the method has the ability to estimate parameters of Vohradský's models more effectively with sufficiently short computation times. This study then applied the proposed method to an actual inference problem of the bacterial SOS DNA repair system, and succeeded in finding several reasonable regulations. PMID:24386175

  16. MultiSETTER: web server for multiple RNA structure comparison.

    PubMed

    Čech, Petr; Hoksza, David; Svozil, Daniel

    2015-08-12

    Understanding the architecture and function of RNA molecules requires methods for comparing and analyzing their tertiary and quaternary structures. While structural superposition of short RNAs is achievable in a reasonable time, large structures represent much bigger challenge. Therefore, we have developed a fast and accurate algorithm for RNA pairwise structure superposition called SETTER and implemented it in the SETTER web server. However, though biological relationships can be inferred by a pairwise structure alignment, key features preserved by evolution can be identified only from a multiple structure alignment. Thus, we extended the SETTER algorithm to the alignment of multiple RNA structures and developed the MultiSETTER algorithm. In this paper, we present the updated version of the SETTER web server that implements a user friendly interface to the MultiSETTER algorithm. The server accepts RNA structures either as the list of PDB IDs or as user-defined PDB files. After the superposition is computed, structures are visualized in 3D and several reports and statistics are generated. To the best of our knowledge, the MultiSETTER web server is the first publicly available tool for a multiple RNA structure alignment. The MultiSETTER server offers the visual inspection of an alignment in 3D space which may reveal structural and functional relationships not captured by other multiple alignment methods based either on a sequence or on secondary structure motifs.

  17. Assessing NARCCAP climate model effects using spatial confidence regions.

    PubMed

    French, Joshua P; McGinnis, Seth; Schwartzman, Armin

    2017-01-01

    We assess similarities and differences between model effects for the North American Regional Climate Change Assessment Program (NARCCAP) climate models using varying classes of linear regression models. Specifically, we consider how the average temperature effect differs for the various global and regional climate model combinations, including assessment of possible interaction between the effects of global and regional climate models. We use both pointwise and simultaneous inference procedures to identify regions where global and regional climate model effects differ. We also show conclusively that results from pointwise inference are misleading, and that accounting for multiple comparisons is important for making proper inference.

  18. A Detailed History of Intron-rich Eukaryotic Ancestors Inferred from a Global Survey of 100 Complete Genomes

    PubMed Central

    Csuros, Miklos; Rogozin, Igor B.; Koonin, Eugene V.

    2011-01-01

    Protein-coding genes in eukaryotes are interrupted by introns, but intron densities widely differ between eukaryotic lineages. Vertebrates, some invertebrates and green plants have intron-rich genes, with 6–7 introns per kilobase of coding sequence, whereas most of the other eukaryotes have intron-poor genes. We reconstructed the history of intron gain and loss using a probabilistic Markov model (Markov Chain Monte Carlo, MCMC) on 245 orthologous genes from 99 genomes representing the three of the five supergroups of eukaryotes for which multiple genome sequences are available. Intron-rich ancestors are confidently reconstructed for each major group, with 53 to 74% of the human intron density inferred with 95% confidence for the Last Eukaryotic Common Ancestor (LECA). The results of the MCMC reconstruction are compared with the reconstructions obtained using Maximum Likelihood (ML) and Dollo parsimony methods. An excellent agreement between the MCMC and ML inferences is demonstrated whereas Dollo parsimony introduces a noticeable bias in the estimations, typically yielding lower ancestral intron densities than MCMC and ML. Evolution of eukaryotic genes was dominated by intron loss, with substantial gain only at the bases of several major branches including plants and animals. The highest intron density, 120 to 130% of the human value, is inferred for the last common ancestor of animals. The reconstruction shows that the entire line of descent from LECA to mammals was intron-rich, a state conducive to the evolution of alternative splicing. PMID:21935348

  19. Strong Inference in Mathematical Modeling: A Method for Robust Science in the Twenty-First Century.

    PubMed

    Ganusov, Vitaly V

    2016-01-01

    While there are many opinions on what mathematical modeling in biology is, in essence, modeling is a mathematical tool, like a microscope, which allows consequences to logically follow from a set of assumptions. Only when this tool is applied appropriately, as microscope is used to look at small items, it may allow to understand importance of specific mechanisms/assumptions in biological processes. Mathematical modeling can be less useful or even misleading if used inappropriately, for example, when a microscope is used to study stars. According to some philosophers (Oreskes et al., 1994), the best use of mathematical models is not when a model is used to confirm a hypothesis but rather when a model shows inconsistency of the model (defined by a specific set of assumptions) and data. Following the principle of strong inference for experimental sciences proposed by Platt (1964), I suggest "strong inference in mathematical modeling" as an effective and robust way of using mathematical modeling to understand mechanisms driving dynamics of biological systems. The major steps of strong inference in mathematical modeling are (1) to develop multiple alternative models for the phenomenon in question; (2) to compare the models with available experimental data and to determine which of the models are not consistent with the data; (3) to determine reasons why rejected models failed to explain the data, and (4) to suggest experiments which would allow to discriminate between remaining alternative models. The use of strong inference is likely to provide better robustness of predictions of mathematical models and it should be strongly encouraged in mathematical modeling-based publications in the Twenty-First century.

  20. Strong Inference in Mathematical Modeling: A Method for Robust Science in the Twenty-First Century

    PubMed Central

    Ganusov, Vitaly V.

    2016-01-01

    While there are many opinions on what mathematical modeling in biology is, in essence, modeling is a mathematical tool, like a microscope, which allows consequences to logically follow from a set of assumptions. Only when this tool is applied appropriately, as microscope is used to look at small items, it may allow to understand importance of specific mechanisms/assumptions in biological processes. Mathematical modeling can be less useful or even misleading if used inappropriately, for example, when a microscope is used to study stars. According to some philosophers (Oreskes et al., 1994), the best use of mathematical models is not when a model is used to confirm a hypothesis but rather when a model shows inconsistency of the model (defined by a specific set of assumptions) and data. Following the principle of strong inference for experimental sciences proposed by Platt (1964), I suggest “strong inference in mathematical modeling” as an effective and robust way of using mathematical modeling to understand mechanisms driving dynamics of biological systems. The major steps of strong inference in mathematical modeling are (1) to develop multiple alternative models for the phenomenon in question; (2) to compare the models with available experimental data and to determine which of the models are not consistent with the data; (3) to determine reasons why rejected models failed to explain the data, and (4) to suggest experiments which would allow to discriminate between remaining alternative models. The use of strong inference is likely to provide better robustness of predictions of mathematical models and it should be strongly encouraged in mathematical modeling-based publications in the Twenty-First century. PMID:27499750

  1. Multivariate and repeated measures (MRM): A new toolbox for dependent and multimodal group-level neuroimaging data

    PubMed Central

    McFarquhar, Martyn; McKie, Shane; Emsley, Richard; Suckling, John; Elliott, Rebecca; Williams, Stephen

    2016-01-01

    Repeated measurements and multimodal data are common in neuroimaging research. Despite this, conventional approaches to group level analysis ignore these repeated measurements in favour of multiple between-subject models using contrasts of interest. This approach has a number of drawbacks as certain designs and comparisons of interest are either not possible or complex to implement. Unfortunately, even when attempting to analyse group level data within a repeated-measures framework, the methods implemented in popular software packages make potentially unrealistic assumptions about the covariance structure across the brain. In this paper, we describe how this issue can be addressed in a simple and efficient manner using the multivariate form of the familiar general linear model (GLM), as implemented in a new MATLAB toolbox. This multivariate framework is discussed, paying particular attention to methods of inference by permutation. Comparisons with existing approaches and software packages for dependent group-level neuroimaging data are made. We also demonstrate how this method is easily adapted for dependency at the group level when multiple modalities of imaging are collected from the same individuals. Follow-up of these multimodal models using linear discriminant functions (LDA) is also discussed, with applications to future studies wishing to integrate multiple scanning techniques into investigating populations of interest. PMID:26921716

  2. Phylogeny of the cycads based on multiple single-copy nuclear genes: congruence of concatenated parsimony, likelihood and species tree inference methods

    PubMed Central

    Salas-Leiva, Dayana E.; Meerow, Alan W.; Calonje, Michael; Griffith, M. Patrick; Francisco-Ortega, Javier; Nakamura, Kyoko; Stevenson, Dennis W.; Lewis, Carl E.; Namoff, Sandra

    2013-01-01

    Background and aims Despite a recent new classification, a stable phylogeny for the cycads has been elusive, particularly regarding resolution of Bowenia, Stangeria and Dioon. In this study, five single-copy nuclear genes (SCNGs) are applied to the phylogeny of the order Cycadales. The specific aim is to evaluate several gene tree–species tree reconciliation approaches for developing an accurate phylogeny of the order, to contrast them with concatenated parsimony analysis and to resolve the erstwhile problematic phylogenetic position of these three genera. Methods DNA sequences of five SCNGs were obtained for 20 cycad species representing all ten genera of Cycadales. These were analysed with parsimony, maximum likelihood (ML) and three Bayesian methods of gene tree–species tree reconciliation, using Cycas as the outgroup. A calibrated date estimation was developed with Bayesian methods, and biogeographic analysis was also conducted. Key Results Concatenated parsimony, ML and three species tree inference methods resolve exactly the same tree topology with high support at most nodes. Dioon and Bowenia are the first and second branches of Cycadales after Cycas, respectively, followed by an encephalartoid clade (Macrozamia–Lepidozamia–Encephalartos), which is sister to a zamioid clade, of which Ceratozamia is the first branch, and in which Stangeria is sister to Microcycas and Zamia. Conclusions A single, well-supported phylogenetic hypothesis of the generic relationships of the Cycadales is presented. However, massive extinction events inferred from the fossil record that eliminated broader ancestral distributions within Zamiaceae compromise accurate optimization of ancestral biogeographical areas for that hypothesis. While major lineages of Cycadales are ancient, crown ages of all modern genera are no older than 12 million years, supporting a recent hypothesis of mostly Miocene radiations. This phylogeny can contribute to an accurate infrafamilial classification of Zamiaceae. PMID:23997230

  3. Multimodel inference to quantify the relative importance of abiotic factors in the population dynamics of marine zooplankton

    NASA Astrophysics Data System (ADS)

    Everaert, Gert; Deschutter, Yana; De Troch, Marleen; Janssen, Colin R.; De Schamphelaere, Karel

    2018-05-01

    The effect of multiple stressors on marine ecosystems remains poorly understood and most of the knowledge available is related to phytoplankton. To partly address this knowledge gap, we tested if combining multimodel inference with generalized additive modelling could quantify the relative contribution of environmental variables on the population dynamics of a zooplankton species in the Belgian part of the North Sea. Hence, we have quantified the relative contribution of oceanographic variables (e.g. water temperature, salinity, nutrient concentrations, and chlorophyll a concentrations) and anthropogenic chemicals (i.e. polychlorinated biphenyls) to the density of Acartia clausi. We found that models with water temperature and chlorophyll a concentration explained ca. 73% of the population density of the marine copepod. Multimodel inference in combination with regression-based models are a generic way to disentangle and quantify multiple stressor-induced changes in marine ecosystems. Future-oriented simulations of copepod densities suggested increased copepod densities under predicted environmental changes.

  4. Sensing multiple ligands with single receptor

    NASA Astrophysics Data System (ADS)

    Singh, Vijay; Nemenman, Ilya

    2015-03-01

    Cells use surface receptors to measure concentrations of external ligand molecules. Limits on the accuracy of such sensing are well-known for the scenario where concentration of one molecular species is being determined by one receptor [Endres]. However, in more realistic scenarios, a cognate (high-affinity) ligand competes with many non-cognate (low-affinity) ligands for binding to the receptor. We analyze effects of this competition on the accuracy of sensing. We show that maximum-likelihood statistical inference allows determination of concentrations of multiple ligands, cognate and non-cognate, by the same receptor concurrently. While it is unclear if traditional biochemical circuitry downstream of the receptor can implement such inference exactly, we show that an approximate inference can be performed by coupling the receptor to a kinetic proofreading cascade. We characterize the accuracy of such kinetic proofreading sensing in comparison to the exact maximum-likelihood approach. We acknowledge the support from the James S. McDonnell Foundation and the Human Frontier Science Program.

  5. A Method of Trigonometric Modelling of Seasonal Variation Demonstrated with Multiple Sclerosis Relapse Data.

    PubMed

    Spelman, Tim; Gray, Orla; Lucas, Robyn; Butzkueven, Helmut

    2015-12-09

    This report describes a novel Stata-based application of trigonometric regression modelling to 55 years of multiple sclerosis relapse data from 46 clinical centers across 20 countries located in both hemispheres. Central to the success of this method was the strategic use of plot analysis to guide and corroborate the statistical regression modelling. Initial plot analysis was necessary for establishing realistic hypotheses regarding the presence and structural form of seasonal and latitudinal influences on relapse probability and then testing the performance of the resultant models. Trigonometric regression was then necessary to quantify these relationships, adjust for important confounders and provide a measure of certainty as to how plausible these associations were. Synchronization of graphing techniques with regression modelling permitted a systematic refinement of models until best-fit convergence was achieved, enabling novel inferences to be made regarding the independent influence of both season and latitude in predicting relapse onset timing in MS. These methods have the potential for application across other complex disease and epidemiological phenomena suspected or known to vary systematically with season and/or geographic location.

  6. Selection of species and sampling areas: The importance of inference

    Treesearch

    Paul Stephen Corn

    2009-01-01

    Inductive inference, the process of drawing general conclusions from specific observations, is fundamental to the scientific method. Platt (1964) termed conclusions obtained through rigorous application of the scientific method as "strong inference" and noted the following basic steps: generating alternative hypotheses; devising experiments, the...

  7. Multi-objective calibration and uncertainty analysis of hydrologic models; A comparative study between formal and informal methods

    NASA Astrophysics Data System (ADS)

    Shafii, M.; Tolson, B.; Matott, L. S.

    2012-04-01

    Hydrologic modeling has benefited from significant developments over the past two decades. This has resulted in building of higher levels of complexity into hydrologic models, which eventually makes the model evaluation process (parameter estimation via calibration and uncertainty analysis) more challenging. In order to avoid unreasonable parameter estimates, many researchers have suggested implementation of multi-criteria calibration schemes. Furthermore, for predictive hydrologic models to be useful, proper consideration of uncertainty is essential. Consequently, recent research has emphasized comprehensive model assessment procedures in which multi-criteria parameter estimation is combined with statistically-based uncertainty analysis routines such as Bayesian inference using Markov Chain Monte Carlo (MCMC) sampling. Such a procedure relies on the use of formal likelihood functions based on statistical assumptions, and moreover, the Bayesian inference structured on MCMC samplers requires a considerably large number of simulations. Due to these issues, especially in complex non-linear hydrological models, a variety of alternative informal approaches have been proposed for uncertainty analysis in the multi-criteria context. This study aims at exploring a number of such informal uncertainty analysis techniques in multi-criteria calibration of hydrological models. The informal methods addressed in this study are (i) Pareto optimality which quantifies the parameter uncertainty using the Pareto solutions, (ii) DDS-AU which uses the weighted sum of objective functions to derive the prediction limits, and (iii) GLUE which describes the total uncertainty through identification of behavioral solutions. The main objective is to compare such methods with MCMC-based Bayesian inference with respect to factors such as computational burden, and predictive capacity, which are evaluated based on multiple comparative measures. The measures for comparison are calculated both for calibration and evaluation periods. The uncertainty analysis methodologies are applied to a simple 5-parameter rainfall-runoff model, called HYMOD.

  8. Clumpak: a program for identifying clustering modes and packaging population structure inferences across K.

    PubMed

    Kopelman, Naama M; Mayzel, Jonathan; Jakobsson, Mattias; Rosenberg, Noah A; Mayrose, Itay

    2015-09-01

    The identification of the genetic structure of populations from multilocus genotype data has become a central component of modern population-genetic data analysis. Application of model-based clustering programs often entails a number of steps, in which the user considers different modelling assumptions, compares results across different predetermined values of the number of assumed clusters (a parameter typically denoted K), examines multiple independent runs for each fixed value of K, and distinguishes among runs belonging to substantially distinct clustering solutions. Here, we present Clumpak (Cluster Markov Packager Across K), a method that automates the postprocessing of results of model-based population structure analyses. For analysing multiple independent runs at a single K value, Clumpak identifies sets of highly similar runs, separating distinct groups of runs that represent distinct modes in the space of possible solutions. This procedure, which generates a consensus solution for each distinct mode, is performed by the use of a Markov clustering algorithm that relies on a similarity matrix between replicate runs, as computed by the software Clumpp. Next, Clumpak identifies an optimal alignment of inferred clusters across different values of K, extending a similar approach implemented for a fixed K in Clumpp and simplifying the comparison of clustering results across different K values. Clumpak incorporates additional features, such as implementations of methods for choosing K and comparing solutions obtained by different programs, models, or data subsets. Clumpak, available at http://clumpak.tau.ac.il, simplifies the use of model-based analyses of population structure in population genetics and molecular ecology. © 2015 John Wiley & Sons Ltd.

  9. A fuzzy-logic-based model to predict biogas and methane production rates in a pilot-scale mesophilic UASB reactor treating molasses wastewater.

    PubMed

    Turkdogan-Aydinol, F Ilter; Yetilmezsoy, Kaan

    2010-10-15

    A MIMO (multiple inputs and multiple outputs) fuzzy-logic-based model was developed to predict biogas and methane production rates in a pilot-scale 90-L mesophilic up-flow anaerobic sludge blanket (UASB) reactor treating molasses wastewater. Five input variables such as volumetric organic loading rate (OLR), volumetric total chemical oxygen demand (TCOD) removal rate (R(V)), influent alkalinity, influent pH and effluent pH were fuzzified by the use of an artificial intelligence-based approach. Trapezoidal membership functions with eight levels were conducted for the fuzzy subsets, and a Mamdani-type fuzzy inference system was used to implement a total of 134 rules in the IF-THEN format. The product (prod) and the centre of gravity (COG, centroid) methods were employed as the inference operator and defuzzification methods, respectively. Fuzzy-logic predicted results were compared with the outputs of two exponential non-linear regression models derived in this study. The UASB reactor showed a remarkable performance on the treatment of molasses wastewater, with an average TCOD removal efficiency of 93 (+/-3)% and an average volumetric TCOD removal rate of 6.87 (+/-3.93) kg TCOD(removed)/m(3)-day, respectively. Findings of this study clearly indicated that, compared to non-linear regression models, the proposed MIMO fuzzy-logic-based model produced smaller deviations and exhibited a superior predictive performance on forecasting of both biogas and methane production rates with satisfactory determination coefficients over 0.98. 2010 Elsevier B.V. All rights reserved.

  10. Simultaneous learning of instantaneous and time-delayed genetic interactions using novel information theoretic scoring technique

    PubMed Central

    2012-01-01

    Background Understanding gene interactions is a fundamental question in systems biology. Currently, modeling of gene regulations using the Bayesian Network (BN) formalism assumes that genes interact either instantaneously or with a certain amount of time delay. However in reality, biological regulations, both instantaneous and time-delayed, occur simultaneously. A framework that can detect and model both these two types of interactions simultaneously would represent gene regulatory networks more accurately. Results In this paper, we introduce a framework based on the Bayesian Network (BN) formalism that can represent both instantaneous and time-delayed interactions between genes simultaneously. A novel scoring metric having firm mathematical underpinnings is also proposed that, unlike other recent methods, can score both interactions concurrently and takes into account the reality that multiple regulators can regulate a gene jointly, rather than in an isolated pair-wise manner. Further, a gene regulatory network (GRN) inference method employing an evolutionary search that makes use of the framework and the scoring metric is also presented. Conclusion By taking into consideration the biological fact that both instantaneous and time-delayed regulations can occur among genes, our approach models gene interactions with greater accuracy. The proposed framework is efficient and can be used to infer gene networks having multiple orders of instantaneous and time-delayed regulations simultaneously. Experiments are carried out using three different synthetic networks (with three different mechanisms for generating synthetic data) as well as real life networks of Saccharomyces cerevisiae, E. coli and cyanobacteria gene expression data. The results show the effectiveness of our approach. PMID:22691450

  11. Determination of hyporheic travel time distributions and other parameters from concurrent conservative and reactive tracer tests by local-in-global optimization

    NASA Astrophysics Data System (ADS)

    Knapp, Julia L. A.; Cirpka, Olaf A.

    2017-06-01

    The complexity of hyporheic flow paths requires reach-scale models of solute transport in streams that are flexible in their representation of the hyporheic passage. We use a model that couples advective-dispersive in-stream transport to hyporheic exchange with a shape-free distribution of hyporheic travel times. The model also accounts for two-site sorption and transformation of reactive solutes. The coefficients of the model are determined by fitting concurrent stream-tracer tests of conservative (fluorescein) and reactive (resazurin/resorufin) compounds. The flexibility of the shape-free models give rise to multiple local minima of the objective function in parameter estimation, thus requiring global-search algorithms, which is hindered by the large number of parameter values to be estimated. We present a local-in-global optimization approach, in which we use a Markov-Chain Monte Carlo method as global-search method to estimate a set of in-stream and hyporheic parameters. Nested therein, we infer the shape-free distribution of hyporheic travel times by a local Gauss-Newton method. The overall approach is independent of the initial guess and provides the joint posterior distribution of all parameters. We apply the described local-in-global optimization method to recorded tracer breakthrough curves of three consecutive stream sections, and infer section-wise hydraulic parameter distributions to analyze how hyporheic exchange processes differ between the stream sections.

  12. Real-Time Detection and Tracking of Multiple People in Laser Scan Frames

    NASA Astrophysics Data System (ADS)

    Cui, J.; Song, X.; Zhao, H.; Zha, H.; Shibasaki, R.

    This chapter presents an approach to detect and track multiple people ro bustly in real time using laser scan frames. The detection and tracking of people in real time is a problem that arises in a variety of different contexts. Examples in clude intelligent surveillance for security purposes, scene analysis for service robot, and crowd behavior analysis for human behavior study. Over the last several years, an increasing number of laser-based people-tracking systems have been developed in both mobile robotics platforms and fixed platforms using one or multiple laser scanners. It has been proved that processing on laser scanner data makes the tracker much faster and more robust than a vision-only based one in complex situations. In this chapter, we present a novel robust tracker to detect and track multiple people in a crowded and open area in real time. First, raw data are obtained that measures two legs for each people at a height of 16 cm from horizontal ground with multiple registered laser scanners. A stable feature is extracted using accumulated distribu tion of successive laser frames. In this way, the noise that generates split and merged measurements is smoothed well, and the pattern of rhythmic swinging legs is uti lized to extract each leg. Second, a probabilistic tracking model is presented, and then a sequential inference process using a Bayesian rule is described. A sequential inference process is difficult to compute analytically, so two strategies are presented to simplify the computation. In the case of independent tracking, the Kalman fil ter is used with a more efficient measurement likelihood model based on a region coherency property. Finally, to deal with trajectory fragments we present a concise approach to fuse just a little visual information from synchronized video camera to laser data. Evaluation with real data shows that the proposed method is robust and effective. It achieves a significant improvement compared with existing laser-based trackers.

  13. Multiple imputation to account for measurement error in marginal structural models

    PubMed Central

    Edwards, Jessie K.; Cole, Stephen R.; Westreich, Daniel; Crane, Heidi; Eron, Joseph J.; Mathews, W. Christopher; Moore, Richard; Boswell, Stephen L.; Lesko, Catherine R.; Mugavero, Michael J.

    2015-01-01

    Background Marginal structural models are an important tool for observational studies. These models typically assume that variables are measured without error. We describe a method to account for differential and non-differential measurement error in a marginal structural model. Methods We illustrate the method estimating the joint effects of antiretroviral therapy initiation and current smoking on all-cause mortality in a United States cohort of 12,290 patients with HIV followed for up to 5 years between 1998 and 2011. Smoking status was likely measured with error, but a subset of 3686 patients who reported smoking status on separate questionnaires composed an internal validation subgroup. We compared a standard joint marginal structural model fit using inverse probability weights to a model that also accounted for misclassification of smoking status using multiple imputation. Results In the standard analysis, current smoking was not associated with increased risk of mortality. After accounting for misclassification, current smoking without therapy was associated with increased mortality [hazard ratio (HR): 1.2 (95% CI: 0.6, 2.3)]. The HR for current smoking and therapy (0.4 (95% CI: 0.2, 0.7)) was similar to the HR for no smoking and therapy (0.4; 95% CI: 0.2, 0.6). Conclusions Multiple imputation can be used to account for measurement error in concert with methods for causal inference to strengthen results from observational studies. PMID:26214338

  14. An algebra-based method for inferring gene regulatory networks.

    PubMed

    Vera-Licona, Paola; Jarrah, Abdul; Garcia-Puente, Luis David; McGee, John; Laubenbacher, Reinhard

    2014-03-26

    The inference of gene regulatory networks (GRNs) from experimental observations is at the heart of systems biology. This includes the inference of both the network topology and its dynamics. While there are many algorithms available to infer the network topology from experimental data, less emphasis has been placed on methods that infer network dynamics. Furthermore, since the network inference problem is typically underdetermined, it is essential to have the option of incorporating into the inference process, prior knowledge about the network, along with an effective description of the search space of dynamic models. Finally, it is also important to have an understanding of how a given inference method is affected by experimental and other noise in the data used. This paper contains a novel inference algorithm using the algebraic framework of Boolean polynomial dynamical systems (BPDS), meeting all these requirements. The algorithm takes as input time series data, including those from network perturbations, such as knock-out mutant strains and RNAi experiments. It allows for the incorporation of prior biological knowledge while being robust to significant levels of noise in the data used for inference. It uses an evolutionary algorithm for local optimization with an encoding of the mathematical models as BPDS. The BPDS framework allows an effective representation of the search space for algebraic dynamic models that improves computational performance. The algorithm is validated with both simulated and experimental microarray expression profile data. Robustness to noise is tested using a published mathematical model of the segment polarity gene network in Drosophila melanogaster. Benchmarking of the algorithm is done by comparison with a spectrum of state-of-the-art network inference methods on data from the synthetic IRMA network to demonstrate that our method has good precision and recall for the network reconstruction task, while also predicting several of the dynamic patterns present in the network. Boolean polynomial dynamical systems provide a powerful modeling framework for the reverse engineering of gene regulatory networks, that enables a rich mathematical structure on the model search space. A C++ implementation of the method, distributed under LPGL license, is available, together with the source code, at http://www.paola-vera-licona.net/Software/EARevEng/REACT.html.

  15. Detection of multiple damages employing best achievable eigenvectors under Bayesian inference

    NASA Astrophysics Data System (ADS)

    Prajapat, Kanta; Ray-Chaudhuri, Samit

    2018-05-01

    A novel approach is presented in this work to localize simultaneously multiple damaged elements in a structure along with the estimation of damage severity for each of the damaged elements. For detection of damaged elements, a best achievable eigenvector based formulation has been derived. To deal with noisy data, Bayesian inference is employed in the formulation wherein the likelihood of the Bayesian algorithm is formed on the basis of errors between the best achievable eigenvectors and the measured modes. In this approach, the most probable damage locations are evaluated under Bayesian inference by generating combinations of various possible damaged elements. Once damage locations are identified, damage severities are estimated using a Bayesian inference Markov chain Monte Carlo simulation. The efficiency of the proposed approach has been demonstrated by carrying out a numerical study involving a 12-story shear building. It has been found from this study that damage scenarios involving as low as 10% loss of stiffness in multiple elements are accurately determined (localized and severities quantified) even when 2% noise contaminated modal data are utilized. Further, this study introduces a term parameter impact (evaluated based on sensitivity of modal parameters towards structural parameters) to decide the suitability of selecting a particular mode, if some idea about the damaged elements are available. It has been demonstrated here that the accuracy and efficiency of the Bayesian quantification algorithm increases if damage localization is carried out a-priori. An experimental study involving a laboratory scale shear building and different stiffness modification scenarios shows that the proposed approach is efficient enough to localize the stories with stiffness modification.

  16. Meta-learning framework applied in bioinformatics inference system design.

    PubMed

    Arredondo, Tomás; Ormazábal, Wladimir

    2015-01-01

    This paper describes a meta-learner inference system development framework which is applied and tested in the implementation of bioinformatic inference systems. These inference systems are used for the systematic classification of the best candidates for inclusion in bacterial metabolic pathway maps. This meta-learner-based approach utilises a workflow where the user provides feedback with final classification decisions which are stored in conjunction with analysed genetic sequences for periodic inference system training. The inference systems were trained and tested with three different data sets related to the bacterial degradation of aromatic compounds. The analysis of the meta-learner-based framework involved contrasting several different optimisation methods with various different parameters. The obtained inference systems were also contrasted with other standard classification methods with accurate prediction capabilities observed.

  17. A sub-space greedy search method for efficient Bayesian Network inference.

    PubMed

    Zhang, Qing; Cao, Yong; Li, Yong; Zhu, Yanming; Sun, Samuel S M; Guo, Dianjing

    2011-09-01

    Bayesian network (BN) has been successfully used to infer the regulatory relationships of genes from microarray dataset. However, one major limitation of BN approach is the computational cost because the calculation time grows more than exponentially with the dimension of the dataset. In this paper, we propose a sub-space greedy search method for efficient Bayesian Network inference. Particularly, this method limits the greedy search space by only selecting gene pairs with higher partial correlation coefficients. Using both synthetic and real data, we demonstrate that the proposed method achieved comparable results with standard greedy search method yet saved ∼50% of the computational time. We believe that sub-space search method can be widely used for efficient BN inference in systems biology. Copyright © 2011 Elsevier Ltd. All rights reserved.

  18. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Del Pozzo, Walter; Nikhef National Institute for Subatomic Physics, Science Park 105, 1098 XG Amsterdam; Veitch, John

    Second-generation interferometric gravitational-wave detectors, such as Advanced LIGO and Advanced Virgo, are expected to begin operation by 2015. Such instruments plan to reach sensitivities that will offer the unique possibility to test general relativity in the dynamical, strong-field regime and investigate departures from its predictions, in particular, using the signal from coalescing binary systems. We introduce a statistical framework based on Bayesian model selection in which the Bayes factor between two competing hypotheses measures which theory is favored by the data. Probability density functions of the model parameters are then used to quantify the inference on individual parameters. We alsomore » develop a method to combine the information coming from multiple independent observations of gravitational waves, and show how much stronger inference could be. As an introduction and illustration of this framework-and a practical numerical implementation through the Monte Carlo integration technique of nested sampling-we apply it to gravitational waves from the inspiral phase of coalescing binary systems as predicted by general relativity and a very simple alternative theory in which the graviton has a nonzero mass. This method can (and should) be extended to more realistic and physically motivated theories.« less

  19. Predictive inference for best linear combination of biomarkers subject to limits of detection.

    PubMed

    Coolen-Maturi, Tahani

    2017-08-15

    Measuring the accuracy of diagnostic tests is crucial in many application areas including medicine, machine learning and credit scoring. The receiver operating characteristic (ROC) curve is a useful tool to assess the ability of a diagnostic test to discriminate between two classes or groups. In practice, multiple diagnostic tests or biomarkers are combined to improve diagnostic accuracy. Often, biomarker measurements are undetectable either below or above the so-called limits of detection (LoD). In this paper, nonparametric predictive inference (NPI) for best linear combination of two or more biomarkers subject to limits of detection is presented. NPI is a frequentist statistical method that is explicitly aimed at using few modelling assumptions, enabled through the use of lower and upper probabilities to quantify uncertainty. The NPI lower and upper bounds for the ROC curve subject to limits of detection are derived, where the objective function to maximize is the area under the ROC curve. In addition, the paper discusses the effect of restriction on the linear combination's coefficients on the analysis. Examples are provided to illustrate the proposed method. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.

  20. Quantification of downscaled precipitation uncertainties via Bayesian inference

    NASA Astrophysics Data System (ADS)

    Nury, A. H.; Sharma, A.; Marshall, L. A.

    2017-12-01

    Prediction of precipitation from global climate model (GCM) outputs remains critical to decision-making in water-stressed regions. In this regard, downscaling of GCM output has been a useful tool for analysing future hydro-climatological states. Several downscaling approaches have been developed for precipitation downscaling, including those using dynamical or statistical downscaling methods. Frequently, outputs from dynamical downscaling are not readily transferable across regions for significant methodical and computational difficulties. Statistical downscaling approaches provide a flexible and efficient alternative, providing hydro-climatological outputs across multiple temporal and spatial scales in many locations. However these approaches are subject to significant uncertainty, arising due to uncertainty in the downscaled model parameters and in the use of different reanalysis products for inferring appropriate model parameters. Consequently, these will affect the performance of simulation in catchment scale. This study develops a Bayesian framework for modelling downscaled daily precipitation from GCM outputs. This study aims to introduce uncertainties in downscaling evaluating reanalysis datasets against observational rainfall data over Australia. In this research a consistent technique for quantifying downscaling uncertainties by means of Bayesian downscaling frame work has been proposed. The results suggest that there are differences in downscaled precipitation occurrences and extremes.

  1. Development of a Research Reactor Protocol for Neutron Multiplication Measurements

    DOE PAGES

    Arthur, Jennifer Ann; Bahran, Rian Mustafa; Hutchinson, Jesson D.; ...

    2018-03-20

    A new series of subcritical measurements has been conducted at the zero-power Walthousen Reactor Critical Facility (RCF) at Rensselaer Polytechnic Institute (RPI) using a 3He neutron multiplicity detector. The Critical and Subcritical 0-Power Experiment at Rensselaer (CaSPER) campaign establishes a protocol for advanced subcritical neutron multiplication measurements involving research reactors for validation of neutron multiplication inference techniques, Monte Carlo codes, and associated nuclear data. There has been increased attention and expanded efforts related to subcritical measurements and analyses, and this work provides yet another data set at known reactivity states that can be used in the validation of state-of-the-art Montemore » Carlo computer simulation tools. The diverse (mass, spatial, spectral) subcritical measurement configurations have been analyzed to produce parameters of interest such as singles rates, doubles rates, and leakage multiplication. MCNP ®6.2 was used to simulate the experiment and the resulting simulated data has been compared to the measured results. Comparison of the simulated and measured observables (singles rates, doubles rates, and leakage multiplication) show good agreement. This work builds upon the previous years of collaborative subcritical experiments and outlines a protocol for future subcritical neutron multiplication inference and subcriticality monitoring measurements on pool-type reactor systems.« less

  2. Development of a Research Reactor Protocol for Neutron Multiplication Measurements

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Arthur, Jennifer Ann; Bahran, Rian Mustafa; Hutchinson, Jesson D.

    A new series of subcritical measurements has been conducted at the zero-power Walthousen Reactor Critical Facility (RCF) at Rensselaer Polytechnic Institute (RPI) using a 3He neutron multiplicity detector. The Critical and Subcritical 0-Power Experiment at Rensselaer (CaSPER) campaign establishes a protocol for advanced subcritical neutron multiplication measurements involving research reactors for validation of neutron multiplication inference techniques, Monte Carlo codes, and associated nuclear data. There has been increased attention and expanded efforts related to subcritical measurements and analyses, and this work provides yet another data set at known reactivity states that can be used in the validation of state-of-the-art Montemore » Carlo computer simulation tools. The diverse (mass, spatial, spectral) subcritical measurement configurations have been analyzed to produce parameters of interest such as singles rates, doubles rates, and leakage multiplication. MCNP ®6.2 was used to simulate the experiment and the resulting simulated data has been compared to the measured results. Comparison of the simulated and measured observables (singles rates, doubles rates, and leakage multiplication) show good agreement. This work builds upon the previous years of collaborative subcritical experiments and outlines a protocol for future subcritical neutron multiplication inference and subcriticality monitoring measurements on pool-type reactor systems.« less

  3. Review of Recent Methodological Developments in Group-Randomized Trials: Part 2-Analysis.

    PubMed

    Turner, Elizabeth L; Prague, Melanie; Gallis, John A; Li, Fan; Murray, David M

    2017-07-01

    In 2004, Murray et al. reviewed methodological developments in the design and analysis of group-randomized trials (GRTs). We have updated that review with developments in analysis of the past 13 years, with a companion article to focus on developments in design. We discuss developments in the topics of the earlier review (e.g., methods for parallel-arm GRTs, individually randomized group-treatment trials, and missing data) and in new topics, including methods to account for multiple-level clustering and alternative estimation methods (e.g., augmented generalized estimating equations, targeted maximum likelihood, and quadratic inference functions). In addition, we describe developments in analysis of alternative group designs (including stepped-wedge GRTs, network-randomized trials, and pseudocluster randomized trials), which require clustering to be accounted for in their design and analysis.

  4. Brain Imaging, Forward Inference, and Theories of Reasoning

    PubMed Central

    Heit, Evan

    2015-01-01

    This review focuses on the issue of how neuroimaging studies address theoretical accounts of reasoning, through the lens of the method of forward inference (Henson, 2005, 2006). After theories of deductive and inductive reasoning are briefly presented, the method of forward inference for distinguishing between psychological theories based on brain imaging evidence is critically reviewed. Brain imaging studies of reasoning, comparing deductive and inductive arguments, comparing meaningful versus non-meaningful material, investigating hemispheric localization, and comparing conditional and relational arguments, are assessed in light of the method of forward inference. Finally, conclusions are drawn with regard to future research opportunities. PMID:25620926

  5. Brain imaging, forward inference, and theories of reasoning.

    PubMed

    Heit, Evan

    2014-01-01

    This review focuses on the issue of how neuroimaging studies address theoretical accounts of reasoning, through the lens of the method of forward inference (Henson, 2005, 2006). After theories of deductive and inductive reasoning are briefly presented, the method of forward inference for distinguishing between psychological theories based on brain imaging evidence is critically reviewed. Brain imaging studies of reasoning, comparing deductive and inductive arguments, comparing meaningful versus non-meaningful material, investigating hemispheric localization, and comparing conditional and relational arguments, are assessed in light of the method of forward inference. Finally, conclusions are drawn with regard to future research opportunities.

  6. CompareSVM: supervised, Support Vector Machine (SVM) inference of gene regularity networks.

    PubMed

    Gillani, Zeeshan; Akash, Muhammad Sajid Hamid; Rahaman, M D Matiur; Chen, Ming

    2014-11-30

    Predication of gene regularity network (GRN) from expression data is a challenging task. There are many methods that have been developed to address this challenge ranging from supervised to unsupervised methods. Most promising methods are based on support vector machine (SVM). There is a need for comprehensive analysis on prediction accuracy of supervised method SVM using different kernels on different biological experimental conditions and network size. We developed a tool (CompareSVM) based on SVM to compare different kernel methods for inference of GRN. Using CompareSVM, we investigated and evaluated different SVM kernel methods on simulated datasets of microarray of different sizes in detail. The results obtained from CompareSVM showed that accuracy of inference method depends upon the nature of experimental condition and size of the network. For network with nodes (<200) and average (over all sizes of networks), SVM Gaussian kernel outperform on knockout, knockdown, and multifactorial datasets compared to all the other inference methods. For network with large number of nodes (~500), choice of inference method depend upon nature of experimental condition. CompareSVM is available at http://bis.zju.edu.cn/CompareSVM/ .

  7. Current sequencing technology makes microhaplotypes a powerful new type of genetic marker for forensics.

    PubMed

    Kidd, Kenneth K; Pakstis, Andrew J; Speed, William C; Lagacé, Robert; Chang, Joseph; Wootton, Sharon; Haigh, Eva; Kidd, Judith R

    2014-09-01

    SNPs that are molecularly very close (<10kb) will generally have extremely low recombination rates, much less than 10(-4). Multiple haplotypes will often exist because of the history of the origins of the variants at the different sites, rare recombinants, and the vagaries of random genetic drift and/or selection. Such multiallelic haplotype loci are potentially important in forensic work for individual identification, for defining ancestry, and for identifying familial relationships. The new DNA sequencing capabilities currently available make possible continuous runs of a few hundred base pairs so that we can now determine the allelic combination of multiple SNPs on each chromosome of an individual, i.e., the phase, for multiple SNPs within a small segment of DNA. Therefore, we have begun to identify regions, encompassing two to four SNPs with an extent of <200bp that define multiallelic haplotype loci. We have identified candidate regions and have collected pilot data on many candidate microhaplotype loci. Here we present 31 microhaplotype loci that have at least three alleles, have high heterozygosity, are globally informative, and are statistically independent at the population level. This study of microhaplotype loci (microhaps) provides proof of principle that such markers exist and validates their usefulness for ancestry inference, lineage-clan-family inference, and individual identification. The true value of microhaplotypes will come with sequencing methods that can establish alleles unambiguously, including disentangling of mixtures, because a single sequencing run on a single strand of DNA will encompass all of the SNPs. Copyright © 2014 The Authors. Published by Elsevier Ireland Ltd.. All rights reserved.

  8. Differences in the rotational properties of multiple stellar populations in M13: a faster rotation for the `extreme' chemical subpopulation

    NASA Astrophysics Data System (ADS)

    Cordero, M. J.; Hénault-Brunet, V.; Pilachowski, C. A.; Balbinot, E.; Johnson, C. I.; Varri, A. L.

    2017-03-01

    We use radial velocities from spectra of giants obtained with the WIYN telescope, coupled with existing chemical abundance measurements of Na and O for the same stars, to probe the presence of kinematic differences among the multiple populations of the globular cluster (GC) M13. To characterize the kinematics of various chemical subsamples, we introduce a method using Bayesian inference along with a Markov chain Monte Carlo algorithm to fit a six-parameter kinematic model (including rotation) to these subsamples. We find that the so-called extreme population (Na-enhanced and extremely O-depleted) exhibits faster rotation around the centre of the cluster than the other cluster stars, in particular, when compared with the dominant `intermediate' population (moderately Na-enhanced and O-depleted). The most likely difference between the rotational amplitude of this extreme population and that of the intermediate population is found to be ˜4 km s-1 , with a 98.4 per cent probability that the rotational amplitude of the extreme population is larger than that of the intermediate population. We argue that the observed difference in rotational amplitudes, obtained when splitting subsamples according to their chemistry, is not a product of the long-term dynamical evolution of the cluster, but more likely a surviving feature imprinted early in the formation history of this GC and its multiple populations. We also find an agreement (within uncertainties) in the inferred position angle of the rotation axis of the different subpopulations considered. We discuss the constraints that these results may place on various formation scenarios.

  9. Multiple-try differential evolution adaptive Metropolis for efficient solution of highly parameterized models

    NASA Astrophysics Data System (ADS)

    Eric, L.; Vrugt, J. A.

    2010-12-01

    Spatially distributed hydrologic models potentially contain hundreds of parameters that need to be derived by calibration against a historical record of input-output data. The quality of this calibration strongly determines the predictive capability of the model and thus its usefulness for science-based decision making and forecasting. Unfortunately, high-dimensional optimization problems are typically difficult to solve. Here we present our recent developments to the Differential Evolution Adaptive Metropolis (DREAM) algorithm (Vrugt et al., 2009) to warrant efficient solution of high-dimensional parameter estimation problems. The algorithm samples from an archive of past states (Ter Braak and Vrugt, 2008), and uses multiple-try Metropolis sampling (Liu et al., 2000) to decrease the required burn-in time for each individual chain and increase efficiency of posterior sampling. This approach is hereafter referred to as MT-DREAM. We present results for 2 synthetic mathematical case studies, and 2 real-world examples involving from 10 to 240 parameters. Results for those cases show that our multiple-try sampler, MT-DREAM, can consistently find better solutions than other Bayesian MCMC methods. Moreover, MT-DREAM is admirably suited to be implemented and ran on a parallel machine and is therefore a powerful method for posterior inference.

  10. Accounting for Movement between Childcare Classrooms: Does it Change Teacher Effects Interpretations?

    PubMed Central

    Messan Setodji, Claude; Le, Vi-Nhuan; Schaack, Diana

    2012-01-01

    Child care studies that have examined links between teachers' qualifications and children's outcomes often ignore teachers’ and children’s transitions between classrooms at a center throughout the day and only take into account head teacher qualifications. The objective of this investigation was to examine these traditional assumptions and to compare inferences made from these traditional models to methods accounting for transitions between classrooms and multiple teachers in a classroom. The study examined the receptive language, letter-word identification, and passage comprehension skills of 307 children enrolled in 49 community-based childcare centers serving primarily low-income families in Colorado. Results suggest that nearly one-third of children and over 80% of teachers moved daily between classrooms. Findings also reveal that failure to account for daily transitions between classrooms can affect interpretations of the relationship between teacher qualifications and child outcomes, with the model accounting for movement providing significant improvements in model fit and inference. PMID:22389546

  11. The 4D hyperspherical diffusion wavelet: A new method for the detection of localized anatomical variation.

    PubMed

    Hosseinbor, Ameer Pasha; Kim, Won Hwa; Adluru, Nagesh; Acharya, Amit; Vorperian, Houri K; Chung, Moo K

    2014-01-01

    Recently, the HyperSPHARM algorithm was proposed to parameterize multiple disjoint objects in a holistic manner using the 4D hyperspherical harmonics. The HyperSPHARM coefficients are global; they cannot be used to directly infer localized variations in signal. In this paper, we present a unified wavelet framework that links Hyper-SPHARM to the diffusion wavelet transform. Specifically, we will show that the HyperSPHARM basis forms a subset of a wavelet-based multiscale representation of surface-based signals. This wavelet, termed the hyperspherical diffusion wavelet, is a consequence of the equivalence of isotropic heat diffusion smoothing and the diffusion wavelet transform on the hypersphere. Our framework allows for the statistical inference of highly localized anatomical changes, which we demonstrate in the first-ever developmental study on the hyoid bone investigating gender and age effects. We also show that the hyperspherical wavelet successfully picks up group-wise differences that are barely detectable using SPHARM.

  12. The 4D Hyperspherical Diffusion Wavelet: A New Method for the Detection of Localized Anatomical Variation

    PubMed Central

    Hosseinbor, A. Pasha; Kim, Won Hwa; Adluru, Nagesh; Acharya, Amit; Vorperian, Houri K.; Chung, Moo K.

    2014-01-01

    Recently, the HyperSPHARM algorithm was proposed to parameterize multiple disjoint objects in a holistic manner using the 4D hyperspherical harmonics. The HyperSPHARM coefficients are global; they cannot be used to directly infer localized variations in signal. In this paper, we present a unified wavelet framework that links HyperSPHARM to the diffusion wavelet transform. Specifically, we will show that the HyperSPHARM basis forms a subset of a wavelet-based multiscale representation of surface-based signals. This wavelet, termed the hyperspherical diffusion wavelet, is a consequence of the equivalence of isotropic heat diffusion smoothing and the diffusion wavelet transform on the hypersphere. Our framework allows for the statistical inference of highly localized anatomical changes, which we demonstrate in the firstever developmental study on the hyoid bone investigating gender and age effects. We also show that the hyperspherical wavelet successfully picks up group-wise differences that are barely detectable using SPHARM. PMID:25320783

  13. Inferring the temperature dependence of population parameters: the effects of experimental design and inference algorithm

    PubMed Central

    Palamara, Gian Marco; Childs, Dylan Z; Clements, Christopher F; Petchey, Owen L; Plebani, Marco; Smith, Matthew J

    2014-01-01

    Understanding and quantifying the temperature dependence of population parameters, such as intrinsic growth rate and carrying capacity, is critical for predicting the ecological responses to environmental change. Many studies provide empirical estimates of such temperature dependencies, but a thorough investigation of the methods used to infer them has not been performed yet. We created artificial population time series using a stochastic logistic model parameterized with the Arrhenius equation, so that activation energy drives the temperature dependence of population parameters. We simulated different experimental designs and used different inference methods, varying the likelihood functions and other aspects of the parameter estimation methods. Finally, we applied the best performing inference methods to real data for the species Paramecium caudatum. The relative error of the estimates of activation energy varied between 5% and 30%. The fraction of habitat sampled played the most important role in determining the relative error; sampling at least 1% of the habitat kept it below 50%. We found that methods that simultaneously use all time series data (direct methods) and methods that estimate population parameters separately for each temperature (indirect methods) are complementary. Indirect methods provide a clearer insight into the shape of the functional form describing the temperature dependence of population parameters; direct methods enable a more accurate estimation of the parameters of such functional forms. Using both methods, we found that growth rate and carrying capacity of Paramecium caudatum scale with temperature according to different activation energies. Our study shows how careful choice of experimental design and inference methods can increase the accuracy of the inferred relationships between temperature and population parameters. The comparison of estimation methods provided here can increase the accuracy of model predictions, with important implications in understanding and predicting the effects of temperature on the dynamics of populations. PMID:25558365

  14. Modeling the Evolution of Beliefs Using an Attentional Focus Mechanism

    PubMed Central

    Marković, Dimitrije; Gläscher, Jan; Bossaerts, Peter; O’Doherty, John; Kiebel, Stefan J.

    2015-01-01

    For making decisions in everyday life we often have first to infer the set of environmental features that are relevant for the current task. Here we investigated the computational mechanisms underlying the evolution of beliefs about the relevance of environmental features in a dynamical and noisy environment. For this purpose we designed a probabilistic Wisconsin card sorting task (WCST) with belief solicitation, in which subjects were presented with stimuli composed of multiple visual features. At each moment in time a particular feature was relevant for obtaining reward, and participants had to infer which feature was relevant and report their beliefs accordingly. To test the hypothesis that attentional focus modulates the belief update process, we derived and fitted several probabilistic and non-probabilistic behavioral models, which either incorporate a dynamical model of attentional focus, in the form of a hierarchical winner-take-all neuronal network, or a diffusive model, without attention-like features. We used Bayesian model selection to identify the most likely generative model of subjects’ behavior and found that attention-like features in the behavioral model are essential for explaining subjects’ responses. Furthermore, we demonstrate a method for integrating both connectionist and Bayesian models of decision making within a single framework that allowed us to infer hidden belief processes of human subjects. PMID:26495984

  15. Irrational exuberance for resolved species trees.

    PubMed

    Hahn, Matthew W; Nakhleh, Luay

    2016-01-01

    Phylogenomics has largely succeeded in its aim of accurately inferring species trees, even when there are high levels of discordance among individual gene trees. These resolved species trees can be used to ask many questions about trait evolution, including the direction of change and number of times traits have evolved. However, the mapping of traits onto trees generally uses only a single representation of the species tree, ignoring variation in the gene trees used to construct it. Recognizing that genes underlie traits, these results imply that many traits follow topologies that are discordant with the species topology. As a consequence, standard methods for character mapping will incorrectly infer the number of times a trait has evolved. This phenomenon, dubbed "hemiplasy," poses many problems in analyses of character evolution. Here we outline these problems, explaining where and when they are likely to occur. We offer several ways in which the possible presence of hemiplasy can be diagnosed, and discuss multiple approaches to dealing with the problems presented by underlying gene tree discordance when carrying out character mapping. Finally, we discuss the implications of hemiplasy for general phylogenetic inference, including the possible drawbacks of the widespread push for "resolved" species trees. © 2015 The Author(s). Evolution © 2015 The Society for the Study of Evolution.

  16. Nuclear and cpDNA sequences combined provide strong inference of higher phylogenetic relationships in the phlox family (Polemoniaceae).

    PubMed

    Johnson, Leigh A; Chan, Lauren M; Weese, Terri L; Busby, Lisa D; McMurry, Samuel

    2008-09-01

    Members of the phlox family (Polemoniaceae) serve as useful models for studying various evolutionary and biological processes. Despite its biological importance, no family-wide phylogenetic estimate based on multiple DNA regions with complete generic sampling is available. Here, we analyze one nuclear and five chloroplast DNA sequence regions (nuclear ITS, chloroplast matK, trnL intron plus trnL-trnF intergeneric spacer, and the trnS-trnG, trnD-trnT, and psbM-trnD intergenic spacers) using parsimony and Bayesian methods, as well as assessments of congruence and long branch attraction, to explore phylogenetic relationships among 84 ingroup species representing all currently recognized Polemoniaceae genera. Relationships inferred from the ITS and concatenated chloroplast regions are similar overall. A combined analysis provides strong support for the monophyly of Polemoniaceae and subfamilies Acanthogilioideae, Cobaeoideae, and Polemonioideae. Relationships among subfamilies, and thus for the precise root of Polemoniaceae, remain poorly supported. Within the largest subfamily, Polemonioideae, four clades corresponding to tribes Polemonieae, Phlocideae, Gilieae, and Loeselieae receive strong support. The monogeneric Polemonieae appears sister to Phlocideae. Relationships within Polemonieae, Phlocideae, and Gilieae are mostly consistent between analyses and data permutations. Many relationships within Loeselieae remain uncertain. Overall, inferred phylogenetic relationships support a higher-level classification for Polemoniaceae proposed in 2000.

  17. And the first one now will later be last: Time-reversal in cormack-jolly-seber models

    USGS Publications Warehouse

    Nichols, James D.

    2016-01-01

    The models of Cormack, Jolly and Seber (CJS) are remarkable in providing a rich set of inferences about population survival, recruitment, abundance and even sampling probabilities from a seemingly limited data source: a matrix of 1's and 0's reflecting animal captures and recaptures at multiple sampling occasions. Survival and sampling probabilities are estimated directly in CJS models, whereas estimators for recruitment and abundance were initially obtained as derived quantities. Various investigators have noted that just as standard modeling provides direct inferences about survival, reversing the time order of capture history data permits direct modeling and inference about recruitment. Here we review the development of reverse-time modeling efforts, emphasizing the kinds of inferences and questions to which they seem well suited.

  18. Robust EM Continual Reassessment Method in Oncology Dose Finding

    PubMed Central

    Yuan, Ying; Yin, Guosheng

    2012-01-01

    The continual reassessment method (CRM) is a commonly used dose-finding design for phase I clinical trials. Practical applications of this method have been restricted by two limitations: (1) the requirement that the toxicity outcome needs to be observed shortly after the initiation of the treatment; and (2) the potential sensitivity to the prespecified toxicity probability at each dose. To overcome these limitations, we naturally treat the unobserved toxicity outcomes as missing data, and use the expectation-maximization (EM) algorithm to estimate the dose toxicity probabilities based on the incomplete data to direct dose assignment. To enhance the robustness of the design, we propose prespecifying multiple sets of toxicity probabilities, each set corresponding to an individual CRM model. We carry out these multiple CRMs in parallel, across which model selection and model averaging procedures are used to make more robust inference. We evaluate the operating characteristics of the proposed robust EM-CRM designs through simulation studies and show that the proposed methods satisfactorily resolve both limitations of the CRM. Besides improving the MTD selection percentage, the new designs dramatically shorten the duration of the trial, and are robust to the prespecification of the toxicity probabilities. PMID:22375092

  19. Bias Characterization in Probabilistic Genotype Data and Improved Signal Detection with Multiple Imputation

    PubMed Central

    Palmer, Cameron; Pe’er, Itsik

    2016-01-01

    Missing data are an unavoidable component of modern statistical genetics. Different array or sequencing technologies cover different single nucleotide polymorphisms (SNPs), leading to a complicated mosaic pattern of missingness where both individual genotypes and entire SNPs are sporadically absent. Such missing data patterns cannot be ignored without introducing bias, yet cannot be inferred exclusively from nonmissing data. In genome-wide association studies, the accepted solution to missingness is to impute missing data using external reference haplotypes. The resulting probabilistic genotypes may be analyzed in the place of genotype calls. A general-purpose paradigm, called Multiple Imputation (MI), is known to model uncertainty in many contexts, yet it is not widely used in association studies. Here, we undertake a systematic evaluation of existing imputed data analysis methods and MI. We characterize biases related to uncertainty in association studies, and find that bias is introduced both at the imputation level, when imputation algorithms generate inconsistent genotype probabilities, and at the association level, when analysis methods inadequately model genotype uncertainty. We find that MI performs at least as well as existing methods or in some cases much better, and provides a straightforward paradigm for adapting existing genotype association methods to uncertain data. PMID:27310603

  20. Spatiotemporal patterns of precipitation inferred from streamflow observations across the Sierra Nevada mountain range

    NASA Astrophysics Data System (ADS)

    Henn, Brian; Clark, Martyn P.; Kavetski, Dmitri; Newman, Andrew J.; Hughes, Mimi; McGurk, Bruce; Lundquist, Jessica D.

    2018-01-01

    Given uncertainty in precipitation gauge-based gridded datasets over complex terrain, we use multiple streamflow observations as an additional source of information about precipitation, in order to identify spatial and temporal differences between a gridded precipitation dataset and precipitation inferred from streamflow. We test whether gridded datasets capture across-crest and regional spatial patterns of variability, as well as year-to-year variability and trends in precipitation, in comparison to precipitation inferred from streamflow. We use a Bayesian model calibration routine with multiple lumped hydrologic model structures to infer the most likely basin-mean, water-year total precipitation for 56 basins with long-term (>30 year) streamflow records in the Sierra Nevada mountain range of California. We compare basin-mean precipitation derived from this approach with basin-mean precipitation from a precipitation gauge-based, 1/16° gridded dataset that has been used to simulate and evaluate trends in Western United States streamflow and snowpack over the 20th century. We find that the long-term average spatial patterns differ: in particular, there is less precipitation in the gridded dataset in higher-elevation basins whose aspect faces prevailing cool-season winds, as compared to precipitation inferred from streamflow. In a few years and basins, there is less gridded precipitation than there is observed streamflow. Lower-elevation, southern, and east-of-crest basins show better agreement between gridded and inferred precipitation. Implied actual evapotranspiration (calculated as precipitation minus streamflow) then also varies between the streamflow-based estimates and the gridded dataset. Absolute uncertainty in precipitation inferred from streamflow is substantial, but the signal of basin-to-basin and year-to-year differences are likely more robust. The findings suggest that considering streamflow when spatially distributing precipitation in complex terrain may improve its representation, particularly for basins whose orientations (e.g., windward-facing) are favored for orographic precipitation enhancement.

  1. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer.

    PubMed

    Bernard, Guillaume; Chan, Cheong Xin; Ragan, Mark A

    2016-07-01

    Alignment-free (AF) approaches have recently been highlighted as alternatives to methods based on multiple sequence alignment in phylogenetic inference. However, the sensitivity of AF methods to genome-scale evolutionary scenarios is little known. Here, using simulated microbial genome data we systematically assess the sensitivity of nine AF methods to three important evolutionary scenarios: sequence divergence, lateral genetic transfer (LGT) and genome rearrangement. Among these, AF methods are most sensitive to the extent of sequence divergence, less sensitive to low and moderate frequencies of LGT, and most robust against genome rearrangement. We describe the application of AF methods to three well-studied empirical genome datasets, and introduce a new application of the jackknife to assess node support. Our results demonstrate that AF phylogenomics is computationally scalable to multi-genome data and can generate biologically meaningful phylogenies and insights into microbial evolution.

  2. Component isolation for multi-component signal analysis using a non-parametric gaussian latent feature model

    NASA Astrophysics Data System (ADS)

    Yang, Yang; Peng, Zhike; Dong, Xingjian; Zhang, Wenming; Clifton, David A.

    2018-03-01

    A challenge in analysing non-stationary multi-component signals is to isolate nonlinearly time-varying signals especially when they are overlapped in time and frequency plane. In this paper, a framework integrating time-frequency analysis-based demodulation and a non-parametric Gaussian latent feature model is proposed to isolate and recover components of such signals. The former aims to remove high-order frequency modulation (FM) such that the latter is able to infer demodulated components while simultaneously discovering the number of the target components. The proposed method is effective in isolating multiple components that have the same FM behavior. In addition, the results show that the proposed method is superior to generalised demodulation with singular-value decomposition-based method, parametric time-frequency analysis with filter-based method and empirical model decomposition base method, in recovering the amplitude and phase of superimposed components.

  3. Sufficient Forecasting Using Factor Models

    PubMed Central

    Fan, Jianqing; Xue, Lingzhou; Yao, Jiawei

    2017-01-01

    We consider forecasting a single time series when there is a large number of predictors and a possible nonlinear effect. The dimensionality was first reduced via a high-dimensional (approximate) factor model implemented by the principal component analysis. Using the extracted factors, we develop a novel forecasting method called the sufficient forecasting, which provides a set of sufficient predictive indices, inferred from high-dimensional predictors, to deliver additional predictive power. The projected principal component analysis will be employed to enhance the accuracy of inferred factors when a semi-parametric (approximate) factor model is assumed. Our method is also applicable to cross-sectional sufficient regression using extracted factors. The connection between the sufficient forecasting and the deep learning architecture is explicitly stated. The sufficient forecasting correctly estimates projection indices of the underlying factors even in the presence of a nonparametric forecasting function. The proposed method extends the sufficient dimension reduction to high-dimensional regimes by condensing the cross-sectional information through factor models. We derive asymptotic properties for the estimate of the central subspace spanned by these projection directions as well as the estimates of the sufficient predictive indices. We further show that the natural method of running multiple regression of target on estimated factors yields a linear estimate that actually falls into this central subspace. Our method and theory allow the number of predictors to be larger than the number of observations. We finally demonstrate that the sufficient forecasting improves upon the linear forecasting in both simulation studies and an empirical study of forecasting macroeconomic variables. PMID:29731537

  4. Light Scattering by Fractal Dust Aggregates. II. Opacity and Asymmetry Parameter

    NASA Astrophysics Data System (ADS)

    Tazaki, Ryo; Tanaka, Hidekazu

    2018-06-01

    Optical properties of dust aggregates are important at various astrophysical environments. To find a reliable approximation method for optical properties of dust aggregates, we calculate the opacity and the asymmetry parameter of dust aggregates by using a rigorous numerical method, the T-Matrix Method, and then the results are compared to those obtained by approximate methods: the Rayleigh–Gans–Debye (RGD) theory, the effective medium theory (EMT), and the distribution of hollow spheres method (DHS). First of all, we confirm that the RGD theory breaks down when multiple scattering is important. In addition, we find that both EMT and DHS fail to reproduce the optical properties of dust aggregates with fractal dimensions of 2 when the incident wavelength is shorter than the aggregate radius. In order to solve these problems, we test the mean field theory (MFT), where multiple scattering can be taken into account. We show that the extinction opacity of dust aggregates can be well reproduced by MFT. However, it is also shown that MFT is not able to reproduce the scattering and absorption opacities when multiple scattering is important. We successfully resolve this weak point of MFT, by newly developing a modified mean field theory (MMF). Hence, we conclude that MMF can be a useful tool to investigate radiative transfer properties of various astrophysical environments. We also point out an enhancement of the absorption opacity of dust aggregates in the Rayleigh domain, which would be important to explain the large millimeter-wave opacity inferred from observations of protoplanetary disks.

  5. Statistical testing and power analysis for brain-wide association study.

    PubMed

    Gong, Weikang; Wan, Lin; Lu, Wenlian; Ma, Liang; Cheng, Fan; Cheng, Wei; Grünewald, Stefan; Feng, Jianfeng

    2018-04-05

    The identification of connexel-wise associations, which involves examining functional connectivities between pairwise voxels across the whole brain, is both statistically and computationally challenging. Although such a connexel-wise methodology has recently been adopted by brain-wide association studies (BWAS) to identify connectivity changes in several mental disorders, such as schizophrenia, autism and depression, the multiple correction and power analysis methods designed specifically for connexel-wise analysis are still lacking. Therefore, we herein report the development of a rigorous statistical framework for connexel-wise significance testing based on the Gaussian random field theory. It includes controlling the family-wise error rate (FWER) of multiple hypothesis testings using topological inference methods, and calculating power and sample size for a connexel-wise study. Our theoretical framework can control the false-positive rate accurately, as validated empirically using two resting-state fMRI datasets. Compared with Bonferroni correction and false discovery rate (FDR), it can reduce false-positive rate and increase statistical power by appropriately utilizing the spatial information of fMRI data. Importantly, our method bypasses the need of non-parametric permutation to correct for multiple comparison, thus, it can efficiently tackle large datasets with high resolution fMRI images. The utility of our method is shown in a case-control study. Our approach can identify altered functional connectivities in a major depression disorder dataset, whereas existing methods fail. A software package is available at https://github.com/weikanggong/BWAS. Copyright © 2018 Elsevier B.V. All rights reserved.

  6. A review of causal inference for biomedical informatics

    PubMed Central

    Kleinberg, Samantha; Hripcsak, George

    2011-01-01

    Causality is an important concept throughout the health sciences and is particularly vital for informatics work such as finding adverse drug events or risk factors for disease using electronic health records. While philosophers and scientists working for centuries on formalizing what makes something a cause have not reached a consensus, new methods for inference show that we can make progress in this area in many practical cases. This article reviews core concepts in understanding and identifying causality and then reviews current computational methods for inference and explanation, focusing on inference from large-scale observational data. While the problem is not fully solved, we show that graphical models and Granger causality provide useful frameworks for inference and that a more recent approach based on temporal logic addresses some of the limitations of these methods. PMID:21782035

  7. Integrative approach for inference of gene regulatory networks using lasso-based random featuring and application to psychiatric disorders.

    PubMed

    Kim, Dongchul; Kang, Mingon; Biswas, Ashis; Liu, Chunyu; Gao, Jean

    2016-08-10

    Inferring gene regulatory networks is one of the most interesting research areas in the systems biology. Many inference methods have been developed by using a variety of computational models and approaches. However, there are two issues to solve. First, depending on the structural or computational model of inference method, the results tend to be inconsistent due to innately different advantages and limitations of the methods. Therefore the combination of dissimilar approaches is demanded as an alternative way in order to overcome the limitations of standalone methods through complementary integration. Second, sparse linear regression that is penalized by the regularization parameter (lasso) and bootstrapping-based sparse linear regression methods were suggested in state of the art methods for network inference but they are not effective for a small sample size data and also a true regulator could be missed if the target gene is strongly affected by an indirect regulator with high correlation or another true regulator. We present two novel network inference methods based on the integration of three different criteria, (i) z-score to measure the variation of gene expression from knockout data, (ii) mutual information for the dependency between two genes, and (iii) linear regression-based feature selection. Based on these criterion, we propose a lasso-based random feature selection algorithm (LARF) to achieve better performance overcoming the limitations of bootstrapping as mentioned above. In this work, there are three main contributions. First, our z score-based method to measure gene expression variations from knockout data is more effective than similar criteria of related works. Second, we confirmed that the true regulator selection can be effectively improved by LARF. Lastly, we verified that an integrative approach can clearly outperform a single method when two different methods are effectively jointed. In the experiments, our methods were validated by outperforming the state of the art methods on DREAM challenge data, and then LARF was applied to inferences of gene regulatory network associated with psychiatric disorders.

  8. Bayesian Inference for Functional Dynamics Exploring in fMRI Data.

    PubMed

    Guo, Xuan; Liu, Bing; Chen, Le; Chen, Guantao; Pan, Yi; Zhang, Jing

    2016-01-01

    This paper aims to review state-of-the-art Bayesian-inference-based methods applied to functional magnetic resonance imaging (fMRI) data. Particularly, we focus on one specific long-standing challenge in the computational modeling of fMRI datasets: how to effectively explore typical functional interactions from fMRI time series and the corresponding boundaries of temporal segments. Bayesian inference is a method of statistical inference which has been shown to be a powerful tool to encode dependence relationships among the variables with uncertainty. Here we provide an introduction to a group of Bayesian-inference-based methods for fMRI data analysis, which were designed to detect magnitude or functional connectivity change points and to infer their functional interaction patterns based on corresponding temporal boundaries. We also provide a comparison of three popular Bayesian models, that is, Bayesian Magnitude Change Point Model (BMCPM), Bayesian Connectivity Change Point Model (BCCPM), and Dynamic Bayesian Variable Partition Model (DBVPM), and give a summary of their applications. We envision that more delicate Bayesian inference models will be emerging and play increasingly important roles in modeling brain functions in the years to come.

  9. Updating Parameters for Volcanic Hazard Assessment Using Multi-parameter Monitoring Data Streams And Bayesian Belief Networks

    NASA Astrophysics Data System (ADS)

    Odbert, Henry; Aspinall, Willy

    2014-05-01

    Evidence-based hazard assessment at volcanoes assimilates knowledge about the physical processes of hazardous phenomena and observations that indicate the current state of a volcano. Incorporating both these lines of evidence can inform our belief about the likelihood (probability) and consequences (impact) of possible hazardous scenarios, forming a basis for formal quantitative hazard assessment. However, such evidence is often uncertain, indirect or incomplete. Approaches to volcano monitoring have advanced substantially in recent decades, increasing the variety and resolution of multi-parameter timeseries data recorded at volcanoes. Interpreting these multiple strands of parallel, partial evidence thus becomes increasingly complex. In practice, interpreting many timeseries requires an individual to be familiar with the idiosyncrasies of the volcano, monitoring techniques, configuration of recording instruments, observations from other datasets, and so on. In making such interpretations, an individual must consider how different volcanic processes may manifest as measureable observations, and then infer from the available data what can or cannot be deduced about those processes. We examine how parts of this process may be synthesised algorithmically using Bayesian inference. Bayesian Belief Networks (BBNs) use probability theory to treat and evaluate uncertainties in a rational and auditable scientific manner, but only to the extent warranted by the strength of the available evidence. The concept is a suitable framework for marshalling multiple strands of evidence (e.g. observations, model results and interpretations) and their associated uncertainties in a methodical manner. BBNs are usually implemented in graphical form and could be developed as a tool for near real-time, ongoing use in a volcano observatory, for example. We explore the application of BBNs in analysing volcanic data from the long-lived eruption at Soufriere Hills Volcano, Montserrat. We discuss the uncertainty of inferences, and how our method provides a route to formal propagation of uncertainties in hazard models. Such approaches provide an attractive route to developing an interface between volcano monitoring analyses and probabilistic hazard scenario analysis. We discuss the use of BBNs in hazard analysis as a tractable and traceable tool for fast, rational assimilation of complex, multi-parameter data sets in the context of timely volcanic crisis decision support.

  10. Robust inference in summary data Mendelian randomization via the zero modal pleiotropy assumption.

    PubMed

    Hartwig, Fernando Pires; Davey Smith, George; Bowden, Jack

    2017-12-01

    Mendelian randomization (MR) is being increasingly used to strengthen causal inference in observational studies. Availability of summary data of genetic associations for a variety of phenotypes from large genome-wide association studies (GWAS) allows straightforward application of MR using summary data methods, typically in a two-sample design. In addition to the conventional inverse variance weighting (IVW) method, recently developed summary data MR methods, such as the MR-Egger and weighted median approaches, allow a relaxation of the instrumental variable assumptions. Here, a new method - the mode-based estimate (MBE) - is proposed to obtain a single causal effect estimate from multiple genetic instruments. The MBE is consistent when the largest number of similar (identical in infinite samples) individual-instrument causal effect estimates comes from valid instruments, even if the majority of instruments are invalid. We evaluate the performance of the method in simulations designed to mimic the two-sample summary data setting, and demonstrate its use by investigating the causal effect of plasma lipid fractions and urate levels on coronary heart disease risk. The MBE presented less bias and lower type-I error rates than other methods under the null in many situations. Its power to detect a causal effect was smaller compared with the IVW and weighted median methods, but was larger than that of MR-Egger regression, with sample size requirements typically smaller than those available from GWAS consortia. The MBE relaxes the instrumental variable assumptions, and should be used in combination with other approaches in sensitivity analyses. © The Author 2017. Published by Oxford University Press on behalf of the International Epidemiological Association

  11. Multiple imputation for multivariate data with missing and below-threshold measurements: time-series concentrations of pollutants in the Arctic.

    PubMed

    Hopke, P K; Liu, C; Rubin, D B

    2001-03-01

    Many chemical and environmental data sets are complicated by the existence of fully missing values or censored values known to lie below detection thresholds. For example, week-long samples of airborne particulate matter were obtained at Alert, NWT, Canada, between 1980 and 1991, where some of the concentrations of 24 particulate constituents were coarsened in the sense of being either fully missing or below detection limits. To facilitate scientific analysis, it is appealing to create complete data by filling in missing values so that standard complete-data methods can be applied. We briefly review commonly used strategies for handling missing values and focus on the multiple-imputation approach, which generally leads to valid inferences when faced with missing data. Three statistical models are developed for multiply imputing the missing values of airborne particulate matter. We expect that these models are useful for creating multiple imputations in a variety of incomplete multivariate time series data sets.

  12. Network Analysis of Rodent Transcriptomes in Spaceflight

    NASA Technical Reports Server (NTRS)

    Ramachandran, Maya; Fogle, Homer; Costes, Sylvain

    2017-01-01

    Network analysis methods leverage prior knowledge of cellular systems and the statistical and conceptual relationships between analyte measurements to determine gene connectivity. Correlation and conditional metrics are used to infer a network topology and provide a systems-level context for cellular responses. Integration across multiple experimental conditions and omics domains can reveal the regulatory mechanisms that underlie gene expression. GeneLab has assembled rich multi-omic (transcriptomics, proteomics, epigenomics, and epitranscriptomics) datasets for multiple murine tissues from the Rodent Research 1 (RR-1) experiment. RR-1 assesses the impact of 37 days of spaceflight on gene expression across a variety of tissue types, such as adrenal glands, quadriceps, gastrocnemius, tibalius anterior, extensor digitorum longus, soleus, eye, and kidney. Network analysis is particularly useful for RR-1 -omics datasets because it reinforces subtle relationships that may be overlooked in isolated analyses and subdues confounding factors. Our objective is to use network analysis to determine potential target nodes for therapeutic intervention and identify similarities with existing disease models. Multiple network algorithms are used for a higher confidence consensus.

  13. Bayesian Estimation of Pneumonia Etiology: Epidemiologic Considerations and Applications to the Pneumonia Etiology Research for Child Health Study

    PubMed Central

    Fu, Wei; Shi, Qiyuan; Prosperi, Christine; Wu, Zhenke; Hammitt, Laura L.; Feikin, Daniel R.; Baggett, Henry C.; Howie, Stephen R.C.; Scott, J. Anthony G.; Murdoch, David R.; Madhi, Shabir A.; Thea, Donald M.; Brooks, W. Abdullah; Kotloff, Karen L.; Li, Mengying; Park, Daniel E.; Lin, Wenyi; Levine, Orin S.; O’Brien, Katherine L.; Zeger, Scott L.

    2017-01-01

    Abstract In pneumonia, specimens are rarely obtained directly from the infection site, the lung, so the pathogen causing infection is determined indirectly from multiple tests on peripheral clinical specimens, which may have imperfect and uncertain sensitivity and specificity, so inference about the cause is complex. Analytic approaches have included expert review of case-only results, case–control logistic regression, latent class analysis, and attributable fraction, but each has serious limitations and none naturally integrate multiple test results. The Pneumonia Etiology Research for Child Health (PERCH) study required an analytic solution appropriate for a case–control design that could incorporate evidence from multiple specimens from cases and controls and that accounted for measurement error. We describe a Bayesian integrated approach we developed that combined and extended elements of attributable fraction and latent class analyses to meet some of these challenges and illustrate the advantage it confers regarding the challenges identified for other methods. PMID:28575370

  14. PyPanda: a Python package for gene regulatory network reconstruction

    PubMed Central

    van IJzendoorn, David G.P.; Glass, Kimberly; Quackenbush, John; Kuijjer, Marieke L.

    2016-01-01

    Summary: PANDA (Passing Attributes between Networks for Data Assimilation) is a gene regulatory network inference method that uses message-passing to integrate multiple sources of ‘omics data. PANDA was originally coded in C ++. In this application note we describe PyPanda, the Python version of PANDA. PyPanda runs considerably faster than the C ++ version and includes additional features for network analysis. Availability and implementation: The open source PyPanda Python package is freely available at http://github.com/davidvi/pypanda. Contact: mkuijjer@jimmy.harvard.edu or d.g.p.van_ijzendoorn@lumc.nl PMID:27402905

  15. PyPanda: a Python package for gene regulatory network reconstruction.

    PubMed

    van IJzendoorn, David G P; Glass, Kimberly; Quackenbush, John; Kuijjer, Marieke L

    2016-11-01

    PANDA (Passing Attributes between Networks for Data Assimilation) is a gene regulatory network inference method that uses message-passing to integrate multiple sources of 'omics data. PANDA was originally coded in C ++. In this application note we describe PyPanda, the Python version of PANDA. PyPanda runs considerably faster than the C ++ version and includes additional features for network analysis. The open source PyPanda Python package is freely available at http://github.com/davidvi/pypanda CONTACT: mkuijjer@jimmy.harvard.edu or d.g.p.van_ijzendoorn@lumc.nl. © The Author 2016. Published by Oxford University Press.

  16. Intelligence's likelihood and evolutionary time frame

    NASA Astrophysics Data System (ADS)

    Bogonovich, Marc

    2011-04-01

    This paper outlines hypotheses relevant to the evolution of intelligent life and encephalization in the Phanerozoic. If general principles are inferable from patterns of Earth life, implications could be drawn for astrobiology. Many of the outlined hypotheses, relevant data, and associated evolutionary and ecological theory are not frequently cited in astrobiological journals. Thus opportunity exists to evaluate reviewed hypotheses with an astrobiological perspective. A quantitative method is presented for testing one of the reviewed hypotheses (hypothesis i; the diffusion hypothesis). Questions are presented throughout, which illustrate that the question of intelligent life's likelihood can be expressed as multiple, broadly ranging, more tractable questions.

  17. Assessment of User Home Location Geoinference Methods

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Harrison, Joshua J.; Bell, Eric B.; Corley, Courtney D.

    2015-05-29

    This study presents an assessment of multiple approaches to determine the home and/or other important locations to a Twitter user. In this study, we present a unique approach to the problem of geotagged data sparsity in social media when performing geoinferencing tasks. Given the sparsity of explicitly geotagged Twitter data, the ability to perform accurate and reliable user geolocation from a limited number of geotagged posts has proven to be quite useful. In our survey, we have achieved accuracy rates of over 86% in matching Twitter user profile locations with their inferred home locations derived from geotagged posts.

  18. Assessing NARCCAP climate model effects using spatial confidence regions

    PubMed Central

    French, Joshua P.; McGinnis, Seth; Schwartzman, Armin

    2017-01-01

    We assess similarities and differences between model effects for the North American Regional Climate Change Assessment Program (NARCCAP) climate models using varying classes of linear regression models. Specifically, we consider how the average temperature effect differs for the various global and regional climate model combinations, including assessment of possible interaction between the effects of global and regional climate models. We use both pointwise and simultaneous inference procedures to identify regions where global and regional climate model effects differ. We also show conclusively that results from pointwise inference are misleading, and that accounting for multiple comparisons is important for making proper inference. PMID:28936474

  19. Prescription-induced jump distributions in multiplicative Poisson processes.

    PubMed

    Suweis, Samir; Porporato, Amilcare; Rinaldo, Andrea; Maritan, Amos

    2011-06-01

    Generalized Langevin equations (GLE) with multiplicative white Poisson noise pose the usual prescription dilemma leading to different evolution equations (master equations) for the probability distribution. Contrary to the case of multiplicative Gaussian white noise, the Stratonovich prescription does not correspond to the well-known midpoint (or any other intermediate) prescription. By introducing an inertial term in the GLE, we show that the Itô and Stratonovich prescriptions naturally arise depending on two time scales, one induced by the inertial term and the other determined by the jump event. We also show that, when the multiplicative noise is linear in the random variable, one prescription can be made equivalent to the other by a suitable transformation in the jump probability distribution. We apply these results to a recently proposed stochastic model describing the dynamics of primary soil salinization, in which the salt mass balance within the soil root zone requires the analysis of different prescriptions arising from the resulting stochastic differential equation forced by multiplicative white Poisson noise, the features of which are tailored to the characters of the daily precipitation. A method is finally suggested to infer the most appropriate prescription from the data.

  20. Prescription-induced jump distributions in multiplicative Poisson processes

    NASA Astrophysics Data System (ADS)

    Suweis, Samir; Porporato, Amilcare; Rinaldo, Andrea; Maritan, Amos

    2011-06-01

    Generalized Langevin equations (GLE) with multiplicative white Poisson noise pose the usual prescription dilemma leading to different evolution equations (master equations) for the probability distribution. Contrary to the case of multiplicative Gaussian white noise, the Stratonovich prescription does not correspond to the well-known midpoint (or any other intermediate) prescription. By introducing an inertial term in the GLE, we show that the Itô and Stratonovich prescriptions naturally arise depending on two time scales, one induced by the inertial term and the other determined by the jump event. We also show that, when the multiplicative noise is linear in the random variable, one prescription can be made equivalent to the other by a suitable transformation in the jump probability distribution. We apply these results to a recently proposed stochastic model describing the dynamics of primary soil salinization, in which the salt mass balance within the soil root zone requires the analysis of different prescriptions arising from the resulting stochastic differential equation forced by multiplicative white Poisson noise, the features of which are tailored to the characters of the daily precipitation. A method is finally suggested to infer the most appropriate prescription from the data.

  1. Detection of multiple perturbations in multi-omics biological networks.

    PubMed

    Griffin, Paula J; Zhang, Yuqing; Johnson, William Evan; Kolaczyk, Eric D

    2018-05-17

    Cellular mechanism-of-action is of fundamental concern in many biological studies. It is of particular interest for identifying the cause of disease and learning the way in which treatments act against disease. However, pinpointing such mechanisms is difficult, due to the fact that small perturbations to the cell can have wide-ranging downstream effects. Given a snapshot of cellular activity, it can be challenging to tell where a disturbance originated. The presence of an ever-greater variety of high-throughput biological data offers an opportunity to examine cellular behavior from multiple angles, but also presents the statistical challenge of how to effectively analyze data from multiple sources. In this setting, we propose a method for mechanism-of-action inference by extending network filtering to multi-attribute data. We first estimate a joint Gaussian graphical model across multiple data types using penalized regression and filter for network effects. We then apply a set of likelihood ratio tests to identify the most likely site of the original perturbation. In addition, we propose a conditional testing procedure to allow for detection of multiple perturbations. We demonstrate this methodology on paired gene expression and methylation data from The Cancer Genome Atlas (TCGA). © 2018, The International Biometric Society.

  2. Topology, divergence dates, and macroevolutionary inferences vary between different tip-dating approaches applied to fossil theropods (Dinosauria).

    PubMed

    Bapst, D W; Wright, A M; Matzke, N J; Lloyd, G T

    2016-07-01

    Dated phylogenies of fossil taxa allow palaeobiologists to estimate the timing of major divergences and placement of extinct lineages, and to test macroevolutionary hypotheses. Recently developed Bayesian 'tip-dating' methods simultaneously infer and date the branching relationships among fossil taxa, and infer putative ancestral relationships. Using a previously published dataset for extinct theropod dinosaurs, we contrast the dated relationships inferred by several tip-dating approaches and evaluate potential downstream effects on phylogenetic comparative methods. We also compare tip-dating analyses to maximum-parsimony trees time-scaled via alternative a posteriori approaches including via the probabilistic cal3 method. Among tip-dating analyses, we find opposing but strongly supported relationships, despite similarity in inferred ancestors. Overall, tip-dating methods infer divergence dates often millions (or tens of millions) of years older than the earliest stratigraphic appearance of that clade. Model-comparison analyses of the pattern of body-size evolution found that the support for evolutionary mode can vary across and between tree samples from cal3 and tip-dating approaches. These differences suggest that model and software choice in dating analyses can have a substantial impact on the dated phylogenies obtained and broader evolutionary inferences. © 2016 The Author(s).

  3. A new theory of phylogeny inference through construction of multidimensional vector space.

    PubMed

    Kitazoe, Y; Kurihara, Y; Narita, Y; Okuhara, Y; Tominaga, A; Suzuki, T

    2001-05-01

    Here, a new theory of molecular phylogeny is developed in a multidimensional vector space (MVS). The molecular evolution is represented as a successive splitting of branch vectors in the MVS. The end points of these vectors are the extant species and indicate the specific directions reflected by their individual histories of evolution in the past. This representation makes it possible to infer the phylogeny (evolutionary histories) from the spatial positions of the end points. Search vectors are introduced to draw out the groups of species distributed around them. These groups are classified according to the nearby order of branches with them. A law of physics is applied to determine the species positions in the MVS. The species are regarded as the particles moving in time according to the equation of motion, finally falling into the lowest-energy state in spite of their randomly distributed initial condition. This falling into the ground state results in the construction of an MVS in which the relative distances between two particles are equal to the substitution distances. The species positions are obtained prior to the phylogeny inference. Therefore, as the number of species increases, the species vectors can be more specific in an MVS of a larger size, such that the vector analysis gives a more stable and reliable topology. The efficacy of the present method was examined by using computer simulations of molecular evolution in which all the branch- and end-point sequences of the trees are known in advance. In the phylogeny inference from the end points with 100 multiple data sets, the present method consistently reconstructed the correct topologies, in contrast to standard methods. In applications to 185 vertebrates in the alpha-hemoglobin, the vector analysis drew out the two lineage groups of birds and mammals. A core member of the mammalian radiation appeared at the base of the mammalian lineage. Squamates were isolated from the bird lineage to compose the outgroup, while the other living reptilians were directly coupled with birds without forming any sister groups. This result is in contrast to the morphological phylogeny and is also different from those of recent molecular analyses.

  4. Connectivity inference from neural recording data: Challenges, mathematical bases and research directions.

    PubMed

    Magrans de Abril, Ildefons; Yoshimoto, Junichiro; Doya, Kenji

    2018-06-01

    This article presents a review of computational methods for connectivity inference from neural activity data derived from multi-electrode recordings or fluorescence imaging. We first identify biophysical and technical challenges in connectivity inference along the data processing pipeline. We then review connectivity inference methods based on two major mathematical foundations, namely, descriptive model-free approaches and generative model-based approaches. We investigate representative studies in both categories and clarify which challenges have been addressed by which method. We further identify critical open issues and possible research directions. Copyright © 2018 The Author(s). Published by Elsevier Ltd.. All rights reserved.

  5. An analysis of multi-type relational interactions in FMA using graph motifs with disjointness constraints.

    PubMed

    Zhang, Guo-Qiang; Luo, Lingyun; Ogbuji, Chime; Joslyn, Cliff; Mejino, Jose; Sahoo, Satya S

    2012-01-01

    The interaction of multiple types of relationships among anatomical classes in the Foundational Model of Anatomy (FMA) can provide inferred information valuable for quality assurance. This paper introduces a method called Motif Checking (MOCH) to study the effects of such multi-relation type interactions for detecting logical inconsistencies as well as other anomalies represented by the motifs. MOCH represents patterns of multi-type interaction as small labeled (with multiple types of edges) sub-graph motifs, whose nodes represent class variables, and labeled edges represent relational types. By representing FMA as an RDF graph and motifs as SPARQL queries, fragments of FMA are automatically obtained as auditing candidates. Leveraging the scalability and reconfigurability of Semantic Web Technology, we performed exhaustive analyses of a variety of labeled sub-graph motifs. The quality assurance feature of MOCH comes from the distinct use of a subset of the edges of the graph motifs as constraints for disjointness, whereby bringing in rule-based flavor to the approach as well. With possible disjointness implied by antonyms, we performed manual inspection of the resulting FMA fragments and tracked down sources of abnormal inferred conclusions (logical inconsistencies), which are amendable for programmatic revision of the FMA. Our results demonstrate that MOCH provides a unique source of valuable information for quality assurance. Since our approach is general, it is applicable to any ontological system with an OWL representation.

  6. An Analysis of Multi-type Relational Interactions in FMA Using Graph Motifs with Disjointness Constraints

    PubMed Central

    Zhang, Guo-Qiang; Luo, Lingyun; Ogbuji, Chime; Joslyn, Cliff; Mejino, Jose; Sahoo, Satya S

    2012-01-01

    The interaction of multiple types of relationships among anatomical classes in the Foundational Model of Anatomy (FMA) can provide inferred information valuable for quality assurance. This paper introduces a method called Motif Checking (MOCH) to study the effects of such multi-relation type interactions for detecting logical inconsistencies as well as other anomalies represented by the motifs. MOCH represents patterns of multi-type interaction as small labeled (with multiple types of edges) sub-graph motifs, whose nodes represent class variables, and labeled edges represent relational types. By representing FMA as an RDF graph and motifs as SPARQL queries, fragments of FMA are automatically obtained as auditing candidates. Leveraging the scalability and reconfigurability of Semantic Web Technology, we performed exhaustive analyses of a variety of labeled sub-graph motifs. The quality assurance feature of MOCH comes from the distinct use of a subset of the edges of the graph motifs as constraints for disjointness, whereby bringing in rule-based flavor to the approach as well. With possible disjointness implied by antonyms, we performed manual inspection of the resulting FMA fragments and tracked down sources of abnormal inferred conclusions (logical inconsistencies), which are amendable for programmatic revision of the FMA. Our results demonstrate that MOCH provides a unique source of valuable information for quality assurance. Since our approach is general, it is applicable to any ontological system with an OWL representation. PMID:23304382

  7. Is the extremely rare Iberian endemic plant species Castrilanthemum debeauxii (Compositae, Anthemideae) a 'living fossil'? Evidence from a multi-locus species tree reconstruction.

    PubMed

    Tomasello, Salvatore; Álvarez, Inés; Vargas, Pablo; Oberprieler, Christoph

    2015-01-01

    The present study provides results of multi-species coalescent species tree analyses of DNA sequences sampled from multiple nuclear and plastid regions to infer the phylogenetic relationships among the members of the subtribe Leucanthemopsidinae (Compositae, Anthemideae), to which besides the annual Castrilanthemum debeauxii (Degen, Hervier & É.Rev.) Vogt & Oberp., one of the rarest flowering plant species of the Iberian Peninsula, two other unispecific genera (Hymenostemma, Prolongoa), and the polyploidy complex of the genus Leucanthemopsis belong. Based on sequence information from two single- to low-copy nuclear regions (C16, D35, characterised by Chapman et al. (2007)), the multi-copy region of the nrDNA internal transcribed spacer regions ITS1 and ITS2, and two intergenic spacer regions of the cpDNA gene trees were reconstructed using Bayesian inference methods. For the reconstruction of a multi-locus species tree we applied three different methods: (a) analysis of concatenated sequences using Bayesian inference (MrBayes), (b) a tree reconciliation approach by minimizing the number of deep coalescences (PhyloNet), and (c) a coalescent-based species-tree method in a Bayesian framework ((∗)BEAST). All three species tree reconstruction methods unequivocally support the close relationship of the subtribe with the hitherto unclassified genus Phalacrocarpum, the sister-group relationship of Castrilanthemum with the three remaining genera of the subtribe, and the further sister-group relationship of the clade of Hymenostemma+Prolongoa with a monophyletic genus Leucanthemopsis. Dating of the (∗)BEAST phylogeny supports the long-lasting (Early Miocene, 15-22Ma) taxonomical independence and the switch from the plesiomorphic perennial to the apomorphic annual life-form assumed for the Castrilanthemum lineage that may have occurred not earlier than in the Pliocene (3Ma) when the establishment of a Mediterranean climate with summer droughts triggered evolution towards annuality. Copyright © 2014 Elsevier Inc. All rights reserved.

  8. An adaptive sparse-grid high-order stochastic collocation method for Bayesian inference in groundwater reactive transport modeling

    NASA Astrophysics Data System (ADS)

    Zhang, Guannan; Lu, Dan; Ye, Ming; Gunzburger, Max; Webster, Clayton

    2013-10-01

    Bayesian analysis has become vital to uncertainty quantification in groundwater modeling, but its application has been hindered by the computational cost associated with numerous model executions required by exploring the posterior probability density function (PPDF) of model parameters. This is particularly the case when the PPDF is estimated using Markov Chain Monte Carlo (MCMC) sampling. In this study, a new approach is developed to improve the computational efficiency of Bayesian inference by constructing a surrogate of the PPDF, using an adaptive sparse-grid high-order stochastic collocation (aSG-hSC) method. Unlike previous works using first-order hierarchical basis, this paper utilizes a compactly supported higher-order hierarchical basis to construct the surrogate system, resulting in a significant reduction in the number of required model executions. In addition, using the hierarchical surplus as an error indicator allows locally adaptive refinement of sparse grids in the parameter space, which further improves computational efficiency. To efficiently build the surrogate system for the PPDF with multiple significant modes, optimization techniques are used to identify the modes, for which high-probability regions are defined and components of the aSG-hSC approximation are constructed. After the surrogate is determined, the PPDF can be evaluated by sampling the surrogate system directly without model execution, resulting in improved efficiency of the surrogate-based MCMC compared with conventional MCMC. The developed method is evaluated using two synthetic groundwater reactive transport models. The first example involves coupled linear reactions and demonstrates the accuracy of our high-order hierarchical basis approach in approximating high-dimensional posteriori distribution. The second example is highly nonlinear because of the reactions of uranium surface complexation, and demonstrates how the iterative aSG-hSC method is able to capture multimodal and non-Gaussian features of PPDF caused by model nonlinearity. Both experiments show that aSG-hSC is an effective and efficient tool for Bayesian inference.

  9. Design-based and model-based inference in surveys of freshwater mollusks

    USGS Publications Warehouse

    Dorazio, R.M.

    1999-01-01

    Well-known concepts in statistical inference and sampling theory are used to develop recommendations for planning and analyzing the results of quantitative surveys of freshwater mollusks. Two methods of inference commonly used in survey sampling (design-based and model-based) are described and illustrated using examples relevant in surveys of freshwater mollusks. The particular objectives of a survey and the type of information observed in each unit of sampling can be used to help select the sampling design and the method of inference. For example, the mean density of a sparsely distributed population of mollusks can be estimated with higher precision by using model-based inference or by using design-based inference with adaptive cluster sampling than by using design-based inference with conventional sampling. More experience with quantitative surveys of natural assemblages of freshwater mollusks is needed to determine the actual benefits of different sampling designs and inferential procedures.

  10. Inference of the sparse kinetic Ising model using the decimation method

    NASA Astrophysics Data System (ADS)

    Decelle, Aurélien; Zhang, Pan

    2015-05-01

    In this paper we study the inference of the kinetic Ising model on sparse graphs by the decimation method. The decimation method, which was first proposed in Decelle and Ricci-Tersenghi [Phys. Rev. Lett. 112, 070603 (2014), 10.1103/PhysRevLett.112.070603] for the static inverse Ising problem, tries to recover the topology of the inferred system by setting the weakest couplings to zero iteratively. During the decimation process the likelihood function is maximized over the remaining couplings. Unlike the ℓ1-optimization-based methods, the decimation method does not use the Laplace distribution as a heuristic choice of prior to select a sparse solution. In our case, the whole process can be done auto-matically without fixing any parameters by hand. We show that in the dynamical inference problem, where the task is to reconstruct the couplings of an Ising model given the data, the decimation process can be applied naturally into a maximum-likelihood optimization algorithm, as opposed to the static case where pseudolikelihood method needs to be adopted. We also use extensive numerical studies to validate the accuracy of our methods in dynamical inference problems. Our results illustrate that, on various topologies and with different distribution of couplings, the decimation method outperforms the widely used ℓ1-optimization-based methods.

  11. An algebra-based method for inferring gene regulatory networks

    PubMed Central

    2014-01-01

    Background The inference of gene regulatory networks (GRNs) from experimental observations is at the heart of systems biology. This includes the inference of both the network topology and its dynamics. While there are many algorithms available to infer the network topology from experimental data, less emphasis has been placed on methods that infer network dynamics. Furthermore, since the network inference problem is typically underdetermined, it is essential to have the option of incorporating into the inference process, prior knowledge about the network, along with an effective description of the search space of dynamic models. Finally, it is also important to have an understanding of how a given inference method is affected by experimental and other noise in the data used. Results This paper contains a novel inference algorithm using the algebraic framework of Boolean polynomial dynamical systems (BPDS), meeting all these requirements. The algorithm takes as input time series data, including those from network perturbations, such as knock-out mutant strains and RNAi experiments. It allows for the incorporation of prior biological knowledge while being robust to significant levels of noise in the data used for inference. It uses an evolutionary algorithm for local optimization with an encoding of the mathematical models as BPDS. The BPDS framework allows an effective representation of the search space for algebraic dynamic models that improves computational performance. The algorithm is validated with both simulated and experimental microarray expression profile data. Robustness to noise is tested using a published mathematical model of the segment polarity gene network in Drosophila melanogaster. Benchmarking of the algorithm is done by comparison with a spectrum of state-of-the-art network inference methods on data from the synthetic IRMA network to demonstrate that our method has good precision and recall for the network reconstruction task, while also predicting several of the dynamic patterns present in the network. Conclusions Boolean polynomial dynamical systems provide a powerful modeling framework for the reverse engineering of gene regulatory networks, that enables a rich mathematical structure on the model search space. A C++ implementation of the method, distributed under LPGL license, is available, together with the source code, at http://www.paola-vera-licona.net/Software/EARevEng/REACT.html. PMID:24669835

  12. Toward Webscale, Rule-Based Inference on the Semantic Web Via Data Parallelism

    DTIC Science & Technology

    2013-02-01

    Another work distinct from its peers is the work on approximate reasoning by Rudolph et al. [34] in which multiple inference sys- tems were combined not...Workshop Scalable Semantic Web Knowledge Base Systems, 2010, pp. 17–31. [34] S. Rudolph , T. Tserendorj, and P. Hitzler, “What is approximate reasoning...2013] [55] M. Duerst and M. Suignard. (2005, Jan .). RFC 3987 – internationalized resource identifiers (IRIs). IETF. [Online]. Available: http

  13. Terminal-Area Aircraft Intent Inference Approach Based on Online Trajectory Clustering.

    PubMed

    Yang, Yang; Zhang, Jun; Cai, Kai-quan

    2015-01-01

    Terminal-area aircraft intent inference (T-AII) is a prerequisite to detect and avoid potential aircraft conflict in the terminal airspace. T-AII challenges the state-of-the-art AII approaches due to the uncertainties of air traffic situation, in particular due to the undefined flight routes and frequent maneuvers. In this paper, a novel T-AII approach is introduced to address the limitations by solving the problem with two steps that are intent modeling and intent inference. In the modeling step, an online trajectory clustering procedure is designed for recognizing the real-time available routes in replacing of the missed plan routes. In the inference step, we then present a probabilistic T-AII approach based on the multiple flight attributes to improve the inference performance in maneuvering scenarios. The proposed approach is validated with real radar trajectory and flight attributes data of 34 days collected from Chengdu terminal area in China. Preliminary results show the efficacy of the presented approach.

  14. The influence of category coherence on inference about cross-classified entities.

    PubMed

    Patalano, Andrea L; Wengrovitz, Steven M; Sharpes, Kirsten M

    2009-01-01

    A critical function of categories is their use in property inference (Heit, 2000). However, one challenge to using categories in inference is that most entities in the world belong to multiple categories (e.g., Fido could be a dog, a pet, a mammal, or a security system). Building on Patalano, Chin-Parker, and Ross (2006), we tested the hypothesis that category coherence (the extent to which category features go together in light of prior knowledge) influences the selection of categories for use in property inference about cross-classified entities. In two experiments, we directly contrasted coherent and incoherent categories, both of which included cross-classified entities as members, and we found that the coherent categories were used more readily as the source of both property transfer and property extension. We conclude that category coherence, which has been found to be a potent influence on strength of inference for singly classified entities (Rehder & Hastie, 2004), is also central to category use in reasoning about novel cross-classified ones.

  15. Unconscious relational inference recruits the hippocampus.

    PubMed

    Reber, Thomas P; Luechinger, Roger; Boesiger, Peter; Henke, Katharina

    2012-05-02

    Relational inference denotes the capacity to encode, flexibly retrieve, and integrate multiple memories to combine past experiences to update knowledge and improve decision-making in new situations. Although relational inference is thought to depend on the hippocampus and consciousness, we now show in young, healthy men that it may occur outside consciousness but still recruits the hippocampus. In temporally distinct and unique subliminal episodes, we presented word pairs that either overlapped ("winter-red", "red-computer") or not. Effects of unconscious relational inference emerged in reaction times recorded during unconscious encoding and in the outcome of decisions made 1 min later at test, when participants judged the semantic relatedness of two supraliminal words. These words were either episodically related through a common word ("winter-computer" related through "red") or unrelated. Hippocampal activity increased during the unconscious encoding of overlapping versus nonoverlapping word pairs and during the unconscious retrieval of episodically related versus unrelated words. Furthermore, hippocampal activity during unconscious encoding predicted the outcome of decisions made at test. Hence, unconscious inference may influence decision-making in new situations.

  16. Tidally induced variations in vertical and horizontal motion on Rutford Ice Stream, West Antarctica, inferred from remotely sensed observations

    NASA Astrophysics Data System (ADS)

    Minchew, B. M.; Simons, M.; Riel, B.; Milillo, P.

    2017-01-01

    To better understand the influence of stress changes over floating ice shelves on grounded ice streams, we develop a Bayesian method for inferring time-dependent 3-D surface velocity fields from synthetic aperture radar (SAR) and optical remote sensing data. Our specific goal is to observe ocean tide-induced variability in vertical ice shelf position and horizontal ice stream flow. Thus, we consider the special case where observed surface displacement at a given location can be defined by a 3-D secular velocity vector, a family of 3-D sinusoidal functions, and a correction to the digital elevation model used to process the SAR data. Using nearly 9 months of SAR data collected from multiple satellite viewing geometries with the COSMO-SkyMed 4-satellite constellation, we infer the spatiotemporal response of Rutford Ice Stream, West Antarctica, to ocean tidal forcing. Consistent with expected tidal uplift, inferred vertical motion over the ice shelf is dominated by semidiurnal and diurnal tidal constituents. Horizontal ice flow variability, on the other hand, occurs primarily at the fortnightly spring-neap tidal period (Msf). We propose that periodic grounding of the ice shelf is the primary mechanism for translating vertical tidal motion into horizontal flow variability, causing ice flow to accelerate first and most strongly over the ice shelf. Flow variations then propagate through the grounded ice stream at a mean rate of ˜29 km/d and decay quasi-linearly with distance over ˜85 km upstream of the grounding zone.

  17. Inferring the mode of origin of polyploid species from next-generation sequence data.

    PubMed

    Roux, Camille; Pannell, John R

    2015-03-01

    Many eukaryote organisms are polyploid. However, despite their importance, evolutionary inference of polyploid origins and modes of inheritance has been limited by a need for analyses of allele segregation at multiple loci using crosses. The increasing availability of sequence data for nonmodel species now allows the application of established approaches for the analysis of genomic data in polyploids. Here, we ask whether approximate Bayesian computation (ABC), applied to realistic traditional and next-generation sequence data, allows correct inference of the evolutionary and demographic history of polyploids. Using simulations, we evaluate the robustness of evolutionary inference by ABC for tetraploid species as a function of the number of individuals and loci sampled, and the presence or absence of an outgroup. We find that ABC adequately retrieves the recent evolutionary history of polyploid species on the basis of both old and new sequencing technologies. The application of ABC to sequence data from diploid and polyploid species of the plant genus Capsella confirms its utility. Our analysis strongly supports an allopolyploid origin of C. bursa-pastoris about 80 000 years ago. This conclusion runs contrary to previous findings based on the same data set but using an alternative approach and is in agreement with recent findings based on whole-genome sequencing. Our results indicate that ABC is a promising and powerful method for revealing the evolution of polyploid species, without the need to attribute alleles to a homeologous chromosome pair. The approach can readily be extended to more complex scenarios involving higher ploidy levels. © 2015 John Wiley & Sons Ltd.

  18. On the Inference of Functional Circadian Networks Using Granger Causality

    PubMed Central

    Pourzanjani, Arya; Herzog, Erik D.; Petzold, Linda R.

    2015-01-01

    Being able to infer one way direct connections in an oscillatory network such as the suprachiastmatic nucleus (SCN) of the mammalian brain using time series data is difficult but crucial to understanding network dynamics. Although techniques have been developed for inferring networks from time series data, there have been no attempts to adapt these techniques to infer directional connections in oscillatory time series, while accurately distinguishing between direct and indirect connections. In this paper an adaptation of Granger Causality is proposed that allows for inference of circadian networks and oscillatory networks in general called Adaptive Frequency Granger Causality (AFGC). Additionally, an extension of this method is proposed to infer networks with large numbers of cells called LASSO AFGC. The method was validated using simulated data from several different networks. For the smaller networks the method was able to identify all one way direct connections without identifying connections that were not present. For larger networks of up to twenty cells the method shows excellent performance in identifying true and false connections; this is quantified by an area-under-the-curve (AUC) 96.88%. We note that this method like other Granger Causality-based methods, is based on the detection of high frequency signals propagating between cell traces. Thus it requires a relatively high sampling rate and a network that can propagate high frequency signals. PMID:26413748

  19. Ensemble gene function prediction database reveals genes important for complex I formation in Arabidopsis thaliana.

    PubMed

    Hansen, Bjoern Oest; Meyer, Etienne H; Ferrari, Camilla; Vaid, Neha; Movahedi, Sara; Vandepoele, Klaas; Nikoloski, Zoran; Mutwil, Marek

    2018-03-01

    Recent advances in gene function prediction rely on ensemble approaches that integrate results from multiple inference methods to produce superior predictions. Yet, these developments remain largely unexplored in plants. We have explored and compared two methods to integrate 10 gene co-function networks for Arabidopsis thaliana and demonstrate how the integration of these networks produces more accurate gene function predictions for a larger fraction of genes with unknown function. These predictions were used to identify genes involved in mitochondrial complex I formation, and for five of them, we confirmed the predictions experimentally. The ensemble predictions are provided as a user-friendly online database, EnsembleNet. The methods presented here demonstrate that ensemble gene function prediction is a powerful method to boost prediction performance, whereas the EnsembleNet database provides a cutting-edge community tool to guide experimentalists. © 2017 The Authors. New Phytologist © 2017 New Phytologist Trust.

  20. Screening Models of Aquifer Heterogeneity Using the Flow Dimension

    NASA Astrophysics Data System (ADS)

    Walker, D. D.; Cello, P. A.; Roberts, R. M.; Valocchi, A. J.

    2007-12-01

    Despite advances in test interpretation and modeling, typical groundwater modeling studies only indirectly use the parameters and information inferred from hydraulic tests. In particular, the Generalized Radial Flow approach to test interpretation infers the flow dimension, a parameter describing the geometry of the flow field during a hydraulic test. Noninteger values of the flow dimension often are inferred for tests in highly heterogeneous aquifers, yet subsequent modeling studies typically ignore the flow dimension. Monte Carlo analyses of detailed numerical models of aquifer tests examine the flow dimension for several stochastic models of heterogeneous transmissivity, T(x). These include multivariate lognormal, fractional Brownian motion, a site percolation network, and discrete linear features with lengths distributed as power-law. The behavior of the simulated flow dimensions are compared to the flow dimensions observed for multiple aquifer tests in a fractured dolomite aquifer in the Great Lakes region of North America. The combination of multiple hydraulic tests, observed fracture patterns, and the Monte Carlo results are used to screen models of heterogeneity and their parameters for subsequent groundwater flow modeling.

  1. An empirical evaluation of two-stage species tree inference strategies using a multilocus dataset from North American pines

    Treesearch

    Michael DeGiorgio; John Syring; Andrew J. Eckert; Aaron Liston; Richard Cronn; David B. Neale; Noah A. Rosenberg

    2014-01-01

    Background: As it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse sets of taxa, species trees are frequently being inferred from multilocus data. However, the behavior of many methods for performing this inference has remained largely unexplored. Some methods have been proven to be consistent given certain evolutionary models,...

  2. How to select combination operators for fuzzy expert systems using CRI

    NASA Technical Reports Server (NTRS)

    Turksen, I. B.; Tian, Y.

    1992-01-01

    A method to select combination operators for fuzzy expert systems using the Compositional Rule of Inference (CRI) is proposed. First, fuzzy inference processes based on CRI are classified into three categories in terms of their inference results: the Expansion Type Inference, the Reduction Type Inference, and Other Type Inferences. Further, implication operators under Sup-T composition are classified as the Expansion Type Operator, the Reduction Type Operator, and the Other Type Operators. Finally, the combination of rules or their consequences is investigated for inference processes based on CRI.

  3. Bayesian CP Factorization of Incomplete Tensors with Automatic Rank Determination.

    PubMed

    Zhao, Qibin; Zhang, Liqing; Cichocki, Andrzej

    2015-09-01

    CANDECOMP/PARAFAC (CP) tensor factorization of incomplete data is a powerful technique for tensor completion through explicitly capturing the multilinear latent factors. The existing CP algorithms require the tensor rank to be manually specified, however, the determination of tensor rank remains a challenging problem especially for CP rank . In addition, existing approaches do not take into account uncertainty information of latent factors, as well as missing entries. To address these issues, we formulate CP factorization using a hierarchical probabilistic model and employ a fully Bayesian treatment by incorporating a sparsity-inducing prior over multiple latent factors and the appropriate hyperpriors over all hyperparameters, resulting in automatic rank determination. To learn the model, we develop an efficient deterministic Bayesian inference algorithm, which scales linearly with data size. Our method is characterized as a tuning parameter-free approach, which can effectively infer underlying multilinear factors with a low-rank constraint, while also providing predictive distributions over missing entries. Extensive simulations on synthetic data illustrate the intrinsic capability of our method to recover the ground-truth of CP rank and prevent the overfitting problem, even when a large amount of entries are missing. Moreover, the results from real-world applications, including image inpainting and facial image synthesis, demonstrate that our method outperforms state-of-the-art approaches for both tensor factorization and tensor completion in terms of predictive performance.

  4. Modelling hourly dissolved oxygen concentration (DO) using dynamic evolving neural-fuzzy inference system (DENFIS)-based approach: case study of Klamath River at Miller Island Boat Ramp, OR, USA.

    PubMed

    Heddam, Salim

    2014-01-01

    In this study, we present application of an artificial intelligence (AI) technique model called dynamic evolving neural-fuzzy inference system (DENFIS) based on an evolving clustering method (ECM), for modelling dissolved oxygen concentration in a river. To demonstrate the forecasting capability of DENFIS, a one year period from 1 January 2009 to 30 December 2009, of hourly experimental water quality data collected by the United States Geological Survey (USGS Station No: 420853121505500) station at Klamath River at Miller Island Boat Ramp, OR, USA, were used for model development. Two DENFIS-based models are presented and compared. The two DENFIS systems are: (1) offline-based system named DENFIS-OF, and (2) online-based system, named DENFIS-ON. The input variables used for the two models are water pH, temperature, specific conductance, and sensor depth. The performances of the models are evaluated using root mean square errors (RMSE), mean absolute error (MAE), Willmott index of agreement (d) and correlation coefficient (CC) statistics. The lowest root mean square error and highest correlation coefficient values were obtained with the DENFIS-ON method. The results obtained with DENFIS models are compared with linear (multiple linear regression, MLR) and nonlinear (multi-layer perceptron neural networks, MLPNN) methods. This study demonstrates that DENFIS-ON investigated herein outperforms all the proposed techniques for DO modelling.

  5. Structural interpretation in composite systems using powder X-ray diffraction: applications of error propagation to the pair distribution function.

    PubMed

    Moore, Michael D; Shi, Zhenqi; Wildfong, Peter L D

    2010-12-01

    To develop a method for drawing statistical inferences from differences between multiple experimental pair distribution function (PDF) transforms of powder X-ray diffraction (PXRD) data. The appropriate treatment of initial PXRD error estimates using traditional error propagation algorithms was tested using Monte Carlo simulations on amorphous ketoconazole. An amorphous felodipine:polyvinyl pyrrolidone:vinyl acetate (PVPva) physical mixture was prepared to define an error threshold. Co-solidified products of felodipine:PVPva and terfenadine:PVPva were prepared using a melt-quench method and subsequently analyzed using PXRD and PDF. Differential scanning calorimetry (DSC) was used as an additional characterization method. The appropriate manipulation of initial PXRD error estimates through the PDF transform were confirmed using the Monte Carlo simulations for amorphous ketoconazole. The felodipine:PVPva physical mixture PDF analysis determined ±3σ to be an appropriate error threshold. Using the PDF and error propagation principles, the felodipine:PVPva co-solidified product was determined to be completely miscible, and the terfenadine:PVPva co-solidified product, although having appearances of an amorphous molecular solid dispersion by DSC, was determined to be phase-separated. Statistically based inferences were successfully drawn from PDF transforms of PXRD patterns obtained from composite systems. The principles applied herein may be universally adapted to many different systems and provide a fundamentally sound basis for drawing structural conclusions from PDF studies.

  6. Bayesian inference of spectral induced polarization parameters for laboratory complex resistivity measurements of rocks and soils

    NASA Astrophysics Data System (ADS)

    Bérubé, Charles L.; Chouteau, Michel; Shamsipour, Pejman; Enkin, Randolph J.; Olivo, Gema R.

    2017-08-01

    Spectral induced polarization (SIP) measurements are now widely used to infer mineralogical or hydrogeological properties from the low-frequency electrical properties of the subsurface in both mineral exploration and environmental sciences. We present an open-source program that performs fast multi-model inversion of laboratory complex resistivity measurements using Markov-chain Monte Carlo simulation. Using this stochastic method, SIP parameters and their uncertainties may be obtained from the Cole-Cole and Dias models, or from the Debye and Warburg decomposition approaches. The program is tested on synthetic and laboratory data to show that the posterior distribution of a multiple Cole-Cole model is multimodal in particular cases. The Warburg and Debye decomposition approaches yield unique solutions in all cases. It is shown that an adaptive Metropolis algorithm performs faster and is less dependent on the initial parameter values than the Metropolis-Hastings step method when inverting SIP data through the decomposition schemes. There are no advantages in using an adaptive step method for well-defined Cole-Cole inversion. Finally, the influence of measurement noise on the recovered relaxation time distribution is explored. We provide the geophysics community with a open-source platform that can serve as a base for further developments in stochastic SIP data inversion and that may be used to perform parameter analysis with various SIP models.

  7. Using genetic data to strengthen causal inference in observational research.

    PubMed

    Pingault, Jean-Baptiste; O'Reilly, Paul F; Schoeler, Tabea; Ploubidis, George B; Rijsdijk, Frühling; Dudbridge, Frank

    2018-06-05

    Causal inference is essential across the biomedical, behavioural and social sciences.By progressing from confounded statistical associations to evidence of causal relationships, causal inference can reveal complex pathways underlying traits and diseases and help to prioritize targets for intervention. Recent progress in genetic epidemiology - including statistical innovation, massive genotyped data sets and novel computational tools for deep data mining - has fostered the intense development of methods exploiting genetic data and relatedness to strengthen causal inference in observational research. In this Review, we describe how such genetically informed methods differ in their rationale, applicability and inherent limitations and outline how they should be integrated in the future to offer a rich causal inference toolbox.

  8. Assessing Species Boundaries Using Multilocus Species Delimitation in a Morphologically Conserved Group of Neotropical Freshwater Fishes, the Poecilia sphenops Species Complex (Poeciliidae)

    PubMed Central

    Bagley, Justin C.; Alda, Fernando; Breitman, M. Florencia; Bermingham, Eldredge; van den Berghe, Eric P.; Johnson, Jerald B.

    2015-01-01

    Accurately delimiting species is fundamentally important for understanding species diversity and distributions and devising effective strategies to conserve biodiversity. However, species delimitation is problematic in many taxa, including ‘non-adaptive radiations’ containing morphologically cryptic lineages. Fortunately, coalescent-based species delimitation methods hold promise for objectively estimating species limits in such radiations, using multilocus genetic data. Using coalescent-based approaches, we delimit species and infer evolutionary relationships in a morphologically conserved group of Central American freshwater fishes, the Poecilia sphenops species complex. Phylogenetic analyses of multiple genetic markers (sequences of two mitochondrial DNA genes and five nuclear loci) from 10/15 species and genetic lineages recognized in the group support the P. sphenops species complex as monophyletic with respect to outgroups, with eight mitochondrial ‘major-lineages’ diverged by ≥2% pairwise genetic distances. From general mixed Yule-coalescent models, we discovered (conservatively) 10 species within our concatenated mitochondrial DNA dataset, 9 of which were strongly supported by subsequent multilocus Bayesian species delimitation and species tree analyses. Results suggested species-level diversity is underestimated or overestimated by at least ~15% in different lineages in the complex. Nonparametric statistics and coalescent simulations indicate genealogical discordance among our gene tree results has mainly derived from interspecific hybridization in the nuclear genome. However, mitochondrial DNA show little evidence for introgression, and our species delimitation results appear robust to effects of this process. Overall, our findings support the utility of combining multiple lines of genetic evidence and broad phylogeographical sampling to discover and validate species using coalescent-based methods. Our study also highlights the importance of testing for hybridization versus incomplete lineage sorting, which aids inference of not only species limits but also evolutionary processes influencing genetic diversity. PMID:25849959

  9. Assessing species boundaries using multilocus species delimitation in a morphologically conserved group of neotropical freshwater fishes, the Poecilia sphenops species complex (Poeciliidae).

    PubMed

    Bagley, Justin C; Alda, Fernando; Breitman, M Florencia; Bermingham, Eldredge; van den Berghe, Eric P; Johnson, Jerald B

    2015-01-01

    Accurately delimiting species is fundamentally important for understanding species diversity and distributions and devising effective strategies to conserve biodiversity. However, species delimitation is problematic in many taxa, including 'non-adaptive radiations' containing morphologically cryptic lineages. Fortunately, coalescent-based species delimitation methods hold promise for objectively estimating species limits in such radiations, using multilocus genetic data. Using coalescent-based approaches, we delimit species and infer evolutionary relationships in a morphologically conserved group of Central American freshwater fishes, the Poecilia sphenops species complex. Phylogenetic analyses of multiple genetic markers (sequences of two mitochondrial DNA genes and five nuclear loci) from 10/15 species and genetic lineages recognized in the group support the P. sphenops species complex as monophyletic with respect to outgroups, with eight mitochondrial 'major-lineages' diverged by ≥2% pairwise genetic distances. From general mixed Yule-coalescent models, we discovered (conservatively) 10 species within our concatenated mitochondrial DNA dataset, 9 of which were strongly supported by subsequent multilocus Bayesian species delimitation and species tree analyses. Results suggested species-level diversity is underestimated or overestimated by at least ~15% in different lineages in the complex. Nonparametric statistics and coalescent simulations indicate genealogical discordance among our gene tree results has mainly derived from interspecific hybridization in the nuclear genome. However, mitochondrial DNA show little evidence for introgression, and our species delimitation results appear robust to effects of this process. Overall, our findings support the utility of combining multiple lines of genetic evidence and broad phylogeographical sampling to discover and validate species using coalescent-based methods. Our study also highlights the importance of testing for hybridization versus incomplete lineage sorting, which aids inference of not only species limits but also evolutionary processes influencing genetic diversity.

  10. Integration of Multiple Genomic and Phenotype Data to Infer Novel miRNA-Disease Associations

    PubMed Central

    Zhou, Meng; Cheng, Liang; Yang, Haixiu; Wang, Jing; Sun, Jie; Wang, Zhenzhen

    2016-01-01

    MicroRNAs (miRNAs) play an important role in the development and progression of human diseases. The identification of disease-associated miRNAs will be helpful for understanding the molecular mechanisms of diseases at the post-transcriptional level. Based on different types of genomic data sources, computational methods for miRNA-disease association prediction have been proposed. However, individual source of genomic data tends to be incomplete and noisy; therefore, the integration of various types of genomic data for inferring reliable miRNA-disease associations is urgently needed. In this study, we present a computational framework, CHNmiRD, for identifying miRNA-disease associations by integrating multiple genomic and phenotype data, including protein-protein interaction data, gene ontology data, experimentally verified miRNA-target relationships, disease phenotype information and known miRNA-disease connections. The performance of CHNmiRD was evaluated by experimentally verified miRNA-disease associations, which achieved an area under the ROC curve (AUC) of 0.834 for 5-fold cross-validation. In particular, CHNmiRD displayed excellent performance for diseases without any known related miRNAs. The results of case studies for three human diseases (glioblastoma, myocardial infarction and type 1 diabetes) showed that all of the top 10 ranked miRNAs having no known associations with these three diseases in existing miRNA-disease databases were directly or indirectly confirmed by our latest literature mining. All these results demonstrated the reliability and efficiency of CHNmiRD, and it is anticipated that CHNmiRD will serve as a powerful bioinformatics method for mining novel disease-related miRNAs and providing a new perspective into molecular mechanisms underlying human diseases at the post-transcriptional level. CHNmiRD is freely available at http://www.bio-bigdata.com/CHNmiRD. PMID:26849207

  11. Bayesian LASSO, scale space and decision making in association genetics.

    PubMed

    Pasanen, Leena; Holmström, Lasse; Sillanpää, Mikko J

    2015-01-01

    LASSO is a penalized regression method that facilitates model fitting in situations where there are as many, or even more explanatory variables than observations, and only a few variables are relevant in explaining the data. We focus on the Bayesian version of LASSO and consider four problems that need special attention: (i) controlling false positives, (ii) multiple comparisons, (iii) collinearity among explanatory variables, and (iv) the choice of the tuning parameter that controls the amount of shrinkage and the sparsity of the estimates. The particular application considered is association genetics, where LASSO regression can be used to find links between chromosome locations and phenotypic traits in a biological organism. However, the proposed techniques are relevant also in other contexts where LASSO is used for variable selection. We separate the true associations from false positives using the posterior distribution of the effects (regression coefficients) provided by Bayesian LASSO. We propose to solve the multiple comparisons problem by using simultaneous inference based on the joint posterior distribution of the effects. Bayesian LASSO also tends to distribute an effect among collinear variables, making detection of an association difficult. We propose to solve this problem by considering not only individual effects but also their functionals (i.e. sums and differences). Finally, whereas in Bayesian LASSO the tuning parameter is often regarded as a random variable, we adopt a scale space view and consider a whole range of fixed tuning parameters, instead. The effect estimates and the associated inference are considered for all tuning parameters in the selected range and the results are visualized with color maps that provide useful insights into data and the association problem considered. The methods are illustrated using two sets of artificial data and one real data set, all representing typical settings in association genetics.

  12. Integration of Multiple Genomic and Phenotype Data to Infer Novel miRNA-Disease Associations.

    PubMed

    Shi, Hongbo; Zhang, Guangde; Zhou, Meng; Cheng, Liang; Yang, Haixiu; Wang, Jing; Sun, Jie; Wang, Zhenzhen

    2016-01-01

    MicroRNAs (miRNAs) play an important role in the development and progression of human diseases. The identification of disease-associated miRNAs will be helpful for understanding the molecular mechanisms of diseases at the post-transcriptional level. Based on different types of genomic data sources, computational methods for miRNA-disease association prediction have been proposed. However, individual source of genomic data tends to be incomplete and noisy; therefore, the integration of various types of genomic data for inferring reliable miRNA-disease associations is urgently needed. In this study, we present a computational framework, CHNmiRD, for identifying miRNA-disease associations by integrating multiple genomic and phenotype data, including protein-protein interaction data, gene ontology data, experimentally verified miRNA-target relationships, disease phenotype information and known miRNA-disease connections. The performance of CHNmiRD was evaluated by experimentally verified miRNA-disease associations, which achieved an area under the ROC curve (AUC) of 0.834 for 5-fold cross-validation. In particular, CHNmiRD displayed excellent performance for diseases without any known related miRNAs. The results of case studies for three human diseases (glioblastoma, myocardial infarction and type 1 diabetes) showed that all of the top 10 ranked miRNAs having no known associations with these three diseases in existing miRNA-disease databases were directly or indirectly confirmed by our latest literature mining. All these results demonstrated the reliability and efficiency of CHNmiRD, and it is anticipated that CHNmiRD will serve as a powerful bioinformatics method for mining novel disease-related miRNAs and providing a new perspective into molecular mechanisms underlying human diseases at the post-transcriptional level. CHNmiRD is freely available at http://www.bio-bigdata.com/CHNmiRD.

  13. Two approaches to incorporate clinical data uncertainty into multiple criteria decision analysis for benefit-risk assessment of medicinal products.

    PubMed

    Wen, Shihua; Zhang, Lanju; Yang, Bo

    2014-07-01

    The Problem formulation, Objectives, Alternatives, Consequences, Trade-offs, Uncertainties, Risk attitude, and Linked decisions (PrOACT-URL) framework and multiple criteria decision analysis (MCDA) have been recommended by the European Medicines Agency for structured benefit-risk assessment of medicinal products undergoing regulatory review. The objective of this article was to provide solutions to incorporate the uncertainty from clinical data into the MCDA model when evaluating the overall benefit-risk profiles among different treatment options. Two statistical approaches, the δ-method approach and the Monte-Carlo approach, were proposed to construct the confidence interval of the overall benefit-risk score from the MCDA model as well as other probabilistic measures for comparing the benefit-risk profiles between treatment options. Both approaches can incorporate the correlation structure between clinical parameters (criteria) in the MCDA model and are straightforward to implement. The two proposed approaches were applied to a case study to evaluate the benefit-risk profile of an add-on therapy for rheumatoid arthritis (drug X) relative to placebo. It demonstrated a straightforward way to quantify the impact of the uncertainty from clinical data to the benefit-risk assessment and enabled statistical inference on evaluating the overall benefit-risk profiles among different treatment options. The δ-method approach provides a closed form to quantify the variability of the overall benefit-risk score in the MCDA model, whereas the Monte-Carlo approach is more computationally intensive but can yield its true sampling distribution for statistical inference. The obtained confidence intervals and other probabilistic measures from the two approaches enhance the benefit-risk decision making of medicinal products. Copyright © 2014 International Society for Pharmacoeconomics and Outcomes Research (ISPOR). Published by Elsevier Inc. All rights reserved.

  14. Multivariate and repeated measures (MRM): A new toolbox for dependent and multimodal group-level neuroimaging data.

    PubMed

    McFarquhar, Martyn; McKie, Shane; Emsley, Richard; Suckling, John; Elliott, Rebecca; Williams, Stephen

    2016-05-15

    Repeated measurements and multimodal data are common in neuroimaging research. Despite this, conventional approaches to group level analysis ignore these repeated measurements in favour of multiple between-subject models using contrasts of interest. This approach has a number of drawbacks as certain designs and comparisons of interest are either not possible or complex to implement. Unfortunately, even when attempting to analyse group level data within a repeated-measures framework, the methods implemented in popular software packages make potentially unrealistic assumptions about the covariance structure across the brain. In this paper, we describe how this issue can be addressed in a simple and efficient manner using the multivariate form of the familiar general linear model (GLM), as implemented in a new MATLAB toolbox. This multivariate framework is discussed, paying particular attention to methods of inference by permutation. Comparisons with existing approaches and software packages for dependent group-level neuroimaging data are made. We also demonstrate how this method is easily adapted for dependency at the group level when multiple modalities of imaging are collected from the same individuals. Follow-up of these multimodal models using linear discriminant functions (LDA) is also discussed, with applications to future studies wishing to integrate multiple scanning techniques into investigating populations of interest. Copyright © 2016 The Authors. Published by Elsevier Inc. All rights reserved.

  15. A nonlinear control method based on ANFIS and multiple models for a class of SISO nonlinear systems and its application.

    PubMed

    Zhang, Yajun; Chai, Tianyou; Wang, Hong

    2011-11-01

    This paper presents a novel nonlinear control strategy for a class of uncertain single-input and single-output discrete-time nonlinear systems with unstable zero-dynamics. The proposed method combines adaptive-network-based fuzzy inference system (ANFIS) with multiple models, where a linear robust controller, an ANFIS-based nonlinear controller and a switching mechanism are integrated using multiple models technique. It has been shown that the linear controller can ensure the boundedness of the input and output signals and the nonlinear controller can improve the dynamic performance of the closed loop system. Moreover, it has also been shown that the use of the switching mechanism can simultaneously guarantee the closed loop stability and improve its performance. As a result, the controller has the following three outstanding features compared with existing control strategies. First, this method relaxes the assumption of commonly-used uniform boundedness on the unmodeled dynamics and thus enhances its applicability. Second, since ANFIS is used to estimate and compensate the effect caused by the unmodeled dynamics, the convergence rate of neural network learning has been increased. Third, a "one-to-one mapping" technique is adapted to guarantee the universal approximation property of ANFIS. The proposed controller is applied to a numerical example and a pulverizing process of an alumina sintering system, respectively, where its effectiveness has been justified.

  16. Convex Clustering: An Attractive Alternative to Hierarchical Clustering

    PubMed Central

    Chen, Gary K.; Chi, Eric C.; Ranola, John Michael O.; Lange, Kenneth

    2015-01-01

    The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/ PMID:25965340

  17. Inference of time-delayed gene regulatory networks based on dynamic Bayesian network hybrid learning method

    PubMed Central

    Yu, Bin; Xu, Jia-Meng; Li, Shan; Chen, Cheng; Chen, Rui-Xin; Wang, Lei; Zhang, Yan; Wang, Ming-Hui

    2017-01-01

    Gene regulatory networks (GRNs) research reveals complex life phenomena from the perspective of gene interaction, which is an important research field in systems biology. Traditional Bayesian networks have a high computational complexity, and the network structure scoring model has a single feature. Information-based approaches cannot identify the direction of regulation. In order to make up for the shortcomings of the above methods, this paper presents a novel hybrid learning method (DBNCS) based on dynamic Bayesian network (DBN) to construct the multiple time-delayed GRNs for the first time, combining the comprehensive score (CS) with the DBN model. DBNCS algorithm first uses CMI2NI (conditional mutual inclusive information-based network inference) algorithm for network structure profiles learning, namely the construction of search space. Then the redundant regulations are removed by using the recursive optimization algorithm (RO), thereby reduce the false positive rate. Secondly, the network structure profiles are decomposed into a set of cliques without loss, which can significantly reduce the computational complexity. Finally, DBN model is used to identify the direction of gene regulation within the cliques and search for the optimal network structure. The performance of DBNCS algorithm is evaluated by the benchmark GRN datasets from DREAM challenge as well as the SOS DNA repair network in Escherichia coli, and compared with other state-of-the-art methods. The experimental results show the rationality of the algorithm design and the outstanding performance of the GRNs. PMID:29113310

  18. Convex clustering: an attractive alternative to hierarchical clustering.

    PubMed

    Chen, Gary K; Chi, Eric C; Ranola, John Michael O; Lange, Kenneth

    2015-05-01

    The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/.

  19. Inference of time-delayed gene regulatory networks based on dynamic Bayesian network hybrid learning method.

    PubMed

    Yu, Bin; Xu, Jia-Meng; Li, Shan; Chen, Cheng; Chen, Rui-Xin; Wang, Lei; Zhang, Yan; Wang, Ming-Hui

    2017-10-06

    Gene regulatory networks (GRNs) research reveals complex life phenomena from the perspective of gene interaction, which is an important research field in systems biology. Traditional Bayesian networks have a high computational complexity, and the network structure scoring model has a single feature. Information-based approaches cannot identify the direction of regulation. In order to make up for the shortcomings of the above methods, this paper presents a novel hybrid learning method (DBNCS) based on dynamic Bayesian network (DBN) to construct the multiple time-delayed GRNs for the first time, combining the comprehensive score (CS) with the DBN model. DBNCS algorithm first uses CMI2NI (conditional mutual inclusive information-based network inference) algorithm for network structure profiles learning, namely the construction of search space. Then the redundant regulations are removed by using the recursive optimization algorithm (RO), thereby reduce the false positive rate. Secondly, the network structure profiles are decomposed into a set of cliques without loss, which can significantly reduce the computational complexity. Finally, DBN model is used to identify the direction of gene regulation within the cliques and search for the optimal network structure. The performance of DBNCS algorithm is evaluated by the benchmark GRN datasets from DREAM challenge as well as the SOS DNA repair network in Escherichia coli , and compared with other state-of-the-art methods. The experimental results show the rationality of the algorithm design and the outstanding performance of the GRNs.

  20. Inferring causal relationships between phenotypes using summary statistics from genome-wide association studies.

    PubMed

    Meng, Xiang-He; Shen, Hui; Chen, Xiang-Ding; Xiao, Hong-Mei; Deng, Hong-Wen

    2018-03-01

    Genome-wide association studies (GWAS) have successfully identified numerous genetic variants associated with diverse complex phenotypes and diseases, and provided tremendous opportunities for further analyses using summary association statistics. Recently, Pickrell et al. developed a robust method for causal inference using independent putative causal SNPs. However, this method may fail to infer the causal relationship between two phenotypes when only a limited number of independent putative causal SNPs identified. Here, we extended Pickrell's method to make it more applicable for the general situations. We extended the causal inference method by replacing the putative causal SNPs with the lead SNPs (the set of the most significant SNPs in each independent locus) and tested the performance of our extended method using both simulation and empirical data. Simulations suggested that when the same number of genetic variants is used, our extended method had similar distribution of test statistic under the null model as well as comparable power under the causal model compared with the original method by Pickrell et al. But in practice, our extended method would generally be more powerful because the number of independent lead SNPs was often larger than the number of independent putative causal SNPs. And including more SNPs, on the other hand, would not cause more false positives. By applying our extended method to summary statistics from GWAS for blood metabolites and femoral neck bone mineral density (FN-BMD), we successfully identified ten blood metabolites that may causally influence FN-BMD. We extended a causal inference method for inferring putative causal relationship between two phenotypes using summary statistics from GWAS, and identified a number of potential causal metabolites for FN-BMD, which may provide novel insights into the pathophysiological mechanisms underlying osteoporosis.

  1. An Application of Robust Method in Multiple Linear Regression Model toward Credit Card Debt

    NASA Astrophysics Data System (ADS)

    Amira Azmi, Nur; Saifullah Rusiman, Mohd; Khalid, Kamil; Roslan, Rozaini; Sufahani, Suliadi; Mohamad, Mahathir; Salleh, Rohayu Mohd; Hamzah, Nur Shamsidah Amir

    2018-04-01

    Credit card is a convenient alternative replaced cash or cheque, and it is essential component for electronic and internet commerce. In this study, the researchers attempt to determine the relationship and significance variables between credit card debt and demographic variables such as age, household income, education level, years with current employer, years at current address, debt to income ratio and other debt. The provided data covers 850 customers information. There are three methods that applied to the credit card debt data which are multiple linear regression (MLR) models, MLR models with least quartile difference (LQD) method and MLR models with mean absolute deviation method. After comparing among three methods, it is found that MLR model with LQD method became the best model with the lowest value of mean square error (MSE). According to the final model, it shows that the years with current employer, years at current address, household income in thousands and debt to income ratio are positively associated with the amount of credit debt. Meanwhile variables for age, level of education and other debt are negatively associated with amount of credit debt. This study may serve as a reference for the bank company by using robust methods, so that they could better understand their options and choice that is best aligned with their goals for inference regarding to the credit card debt.

  2. PhySIC_IST: cleaning source trees to infer more informative supertrees

    PubMed Central

    Scornavacca, Celine; Berry, Vincent; Lefort, Vincent; Douzery, Emmanuel JP; Ranwez, Vincent

    2008-01-01

    Background Supertree methods combine phylogenies with overlapping sets of taxa into a larger one. Topological conflicts frequently arise among source trees for methodological or biological reasons, such as long branch attraction, lateral gene transfers, gene duplication/loss or deep gene coalescence. When topological conflicts occur among source trees, liberal methods infer supertrees containing the most frequent alternative, while veto methods infer supertrees not contradicting any source tree, i.e. discard all conflicting resolutions. When the source trees host a significant number of topological conflicts or have a small taxon overlap, supertree methods of both kinds can propose poorly resolved, hence uninformative, supertrees. Results To overcome this problem, we propose to infer non-plenary supertrees, i.e. supertrees that do not necessarily contain all the taxa present in the source trees, discarding those whose position greatly differs among source trees or for which insufficient information is provided. We detail a variant of the PhySIC veto method called PhySIC_IST that can infer non-plenary supertrees. PhySIC_IST aims at inferring supertrees that satisfy the same appealing theoretical properties as with PhySIC, while being as informative as possible under this constraint. The informativeness of a supertree is estimated using a variation of the CIC (Cladistic Information Content) criterion, that takes into account both the presence of multifurcations and the absence of some taxa. Additionally, we propose a statistical preprocessing step called STC (Source Trees Correction) to correct the source trees prior to the supertree inference. STC is a liberal step that removes the parts of each source tree that significantly conflict with other source trees. Combining STC with a veto method allows an explicit trade-off between veto and liberal approaches, tuned by a single parameter. Performing large-scale simulations, we observe that STC+PhySIC_IST infers much more informative supertrees than PhySIC, while preserving low type I error compared to the well-known MRP method. Two biological case studies on animals confirm that the STC preprocess successfully detects anomalies in the source trees while STC+PhySIC_IST provides well-resolved supertrees agreeing with current knowledge in systematics. Conclusion The paper introduces and tests two new methodologies, PhySIC_IST and STC, that demonstrate the interest in inferring non-plenary supertrees as well as preprocessing the source trees. An implementation of the methods is available at: . PMID:18834542

  3. PhySIC_IST: cleaning source trees to infer more informative supertrees.

    PubMed

    Scornavacca, Celine; Berry, Vincent; Lefort, Vincent; Douzery, Emmanuel J P; Ranwez, Vincent

    2008-10-04

    Supertree methods combine phylogenies with overlapping sets of taxa into a larger one. Topological conflicts frequently arise among source trees for methodological or biological reasons, such as long branch attraction, lateral gene transfers, gene duplication/loss or deep gene coalescence. When topological conflicts occur among source trees, liberal methods infer supertrees containing the most frequent alternative, while veto methods infer supertrees not contradicting any source tree, i.e. discard all conflicting resolutions. When the source trees host a significant number of topological conflicts or have a small taxon overlap, supertree methods of both kinds can propose poorly resolved, hence uninformative, supertrees. To overcome this problem, we propose to infer non-plenary supertrees, i.e. supertrees that do not necessarily contain all the taxa present in the source trees, discarding those whose position greatly differs among source trees or for which insufficient information is provided. We detail a variant of the PhySIC veto method called PhySIC_IST that can infer non-plenary supertrees. PhySIC_IST aims at inferring supertrees that satisfy the same appealing theoretical properties as with PhySIC, while being as informative as possible under this constraint. The informativeness of a supertree is estimated using a variation of the CIC (Cladistic Information Content) criterion, that takes into account both the presence of multifurcations and the absence of some taxa. Additionally, we propose a statistical preprocessing step called STC (Source Trees Correction) to correct the source trees prior to the supertree inference. STC is a liberal step that removes the parts of each source tree that significantly conflict with other source trees. Combining STC with a veto method allows an explicit trade-off between veto and liberal approaches, tuned by a single parameter.Performing large-scale simulations, we observe that STC+PhySIC_IST infers much more informative supertrees than PhySIC, while preserving low type I error compared to the well-known MRP method. Two biological case studies on animals confirm that the STC preprocess successfully detects anomalies in the source trees while STC+PhySIC_IST provides well-resolved supertrees agreeing with current knowledge in systematics. The paper introduces and tests two new methodologies, PhySIC_IST and STC, that demonstrate the interest in inferring non-plenary supertrees as well as preprocessing the source trees. An implementation of the methods is available at: http://www.atgc-montpellier.fr/physic_ist/.

  4. Bayesian Networks Improve Causal Environmental Assessments for Evidence-Based Policy.

    PubMed

    Carriger, John F; Barron, Mace G; Newman, Michael C

    2016-12-20

    Rule-based weight of evidence approaches to ecological risk assessment may not account for uncertainties and generally lack probabilistic integration of lines of evidence. Bayesian networks allow causal inferences to be made from evidence by including causal knowledge about the problem, using this knowledge with probabilistic calculus to combine multiple lines of evidence, and minimizing biases in predicting or diagnosing causal relationships. Too often, sources of uncertainty in conventional weight of evidence approaches are ignored that can be accounted for with Bayesian networks. Specifying and propagating uncertainties improve the ability of models to incorporate strength of the evidence in the risk management phase of an assessment. Probabilistic inference from a Bayesian network allows evaluation of changes in uncertainty for variables from the evidence. The network structure and probabilistic framework of a Bayesian approach provide advantages over qualitative approaches in weight of evidence for capturing the impacts of multiple sources of quantifiable uncertainty on predictions of ecological risk. Bayesian networks can facilitate the development of evidence-based policy under conditions of uncertainty by incorporating analytical inaccuracies or the implications of imperfect information, structuring and communicating causal issues through qualitative directed graph formulations, and quantitatively comparing the causal power of multiple stressors on valued ecological resources. These aspects are demonstrated through hypothetical problem scenarios that explore some major benefits of using Bayesian networks for reasoning and making inferences in evidence-based policy.

  5. Teaching machines to find mantle composition

    NASA Astrophysics Data System (ADS)

    Atkins, Suzanne; Tackley, Paul; Trampert, Jeannot; Valentine, Andrew

    2017-04-01

    The composition of the mantle affects many geodynamical processes by altering factors such as the density, the location of phase changes, and melting temperature. The inferences we make about mantle composition also determine how we interpret the changes in velocity, reflections, attenuation and scattering seen by seismologists. However, the bulk composition of the mantle is very poorly constrained. Inferences are made from meteorite samples, rock samples from the Earth and inferences made from geophysical data. All of these approaches require significant assumptions and the inferences made are subject to large uncertainties. Here we present a new method for inferring mantle composition, based on pattern recognition machine learning, which uses large scale in situ observations of the mantle to make fully probabilistic inferences of composition for convection simulations. Our method has an advantage over other petrological approaches because we use large scale geophysical observations. This means that we average over much greater length scales and do not need to rely on extrapolating from localised samples of the mantle or planetary disk. Another major advantage of our method is that it is fully probabilistic. This allows us to include all of the uncertainties inherent in the inference process, giving us far more information about the reliability of the result than other methods. Finally our method includes the impact of composition on mantle convection. This allows us to make much more precise inferences from geophysical data than other geophysical approaches, which attempt to invert one observation with no consideration of the relationship between convection and composition. We use a sampling based inversion method, using hundreds of convection simulations run using StagYY with self consistent mineral physics properties calculated using the PerpleX package. The observations from these simulations are used to train a neural network to make a probabilistic inference for major element oxide composition of the mantle. We find we can constrain bulk mantle FeO molar percent, FeO/MgO and FeO/SiO2 using observations of the temperature and density structure of the mantle in convection simulations.

  6. Single-Cell-Based Platform for Copy Number Variation Profiling through Digital Counting of Amplified Genomic DNA Fragments.

    PubMed

    Li, Chunmei; Yu, Zhilong; Fu, Yusi; Pang, Yuhong; Huang, Yanyi

    2017-04-26

    We develop a novel single-cell-based platform through digital counting of amplified genomic DNA fragments, named multifraction amplification (mfA), to detect the copy number variations (CNVs) in a single cell. Amplification is required to acquire genomic information from a single cell, while introducing unavoidable bias. Unlike prevalent methods that directly infer CNV profiles from the pattern of sequencing depth, our mfA platform denatures and separates the DNA molecules from a single cell into multiple fractions of a reaction mix before amplification. By examining the sequencing result of each fraction for a specific fragment and applying a segment-merge maximum likelihood algorithm to the calculation of copy number, we digitize the sequencing-depth-based CNV identification and thus provide a method that is less sensitive to the amplification bias. In this paper, we demonstrate a mfA platform through multiple displacement amplification (MDA) chemistry. When performing the mfA platform, the noise of MDA is reduced; therefore, the resolution of single-cell CNV identification can be improved to 100 kb. We can also determine the genomic region free of allelic drop-out with mfA platform, which is impossible for conventional single-cell amplification methods.

  7. Visual Analysis of Multiple Baseline across Participants Graphs when Change Is Delayed

    ERIC Educational Resources Information Center

    Lieberman, Rebecca G.; Yoder, Paul J.; Reichow, Brian; Wolery, Mark

    2010-01-01

    A within-subjects group experimental design was used to test whether three manipulated characteristics of multiple baseline across participants (MBL-P) data showing at least a month delayed change in slope affected experts' inference of a functional relation and agreement on this judgment. Thirty-six experts completed a survey composed of 16 MBL-P…

  8. Effect of Rate Reduction and Increased Loudness on Acoustic Measures of Anticipatory Coarticulation in Multiple Sclerosis and Parkinson's Disease

    ERIC Educational Resources Information Center

    Tjaden, Kris; Wilding, Gregory E.

    2005-01-01

    The present study compared patterns of anticipatory coarticulation for utterances produced in habitual, loud, and slow conditions by 17 individuals with multiple sclerosis (MS), 12 individuals with Parkinson's disease (PD), and 15 healthy controls. Coarticulation was inferred from vowel F2 frequencies and consonant first-moment coefficients.…

  9. Expectation propagation for large scale Bayesian inference of non-linear molecular networks from perturbation data.

    PubMed

    Narimani, Zahra; Beigy, Hamid; Ahmad, Ashar; Masoudi-Nejad, Ali; Fröhlich, Holger

    2017-01-01

    Inferring the structure of molecular networks from time series protein or gene expression data provides valuable information about the complex biological processes of the cell. Causal network structure inference has been approached using different methods in the past. Most causal network inference techniques, such as Dynamic Bayesian Networks and ordinary differential equations, are limited by their computational complexity and thus make large scale inference infeasible. This is specifically true if a Bayesian framework is applied in order to deal with the unavoidable uncertainty about the correct model. We devise a novel Bayesian network reverse engineering approach using ordinary differential equations with the ability to include non-linearity. Besides modeling arbitrary, possibly combinatorial and time dependent perturbations with unknown targets, one of our main contributions is the use of Expectation Propagation, an algorithm for approximate Bayesian inference over large scale network structures in short computation time. We further explore the possibility of integrating prior knowledge into network inference. We evaluate the proposed model on DREAM4 and DREAM8 data and find it competitive against several state-of-the-art existing network inference methods.

  10. Integration of heterogeneous molecular networks to unravel gene-regulation in Mycobacterium tuberculosis.

    PubMed

    van Dam, Jesse C J; Schaap, Peter J; Martins dos Santos, Vitor A P; Suárez-Diez, María

    2014-09-26

    Different methods have been developed to infer regulatory networks from heterogeneous omics datasets and to construct co-expression networks. Each algorithm produces different networks and efforts have been devoted to automatically integrate them into consensus sets. However each separate set has an intrinsic value that is diluted and partly lost when building a consensus network. Here we present a methodology to generate co-expression networks and, instead of a consensus network, we propose an integration framework where the different networks are kept and analysed with additional tools to efficiently combine the information extracted from each network. We developed a workflow to efficiently analyse information generated by different inference and prediction methods. Our methodology relies on providing the user the means to simultaneously visualise and analyse the coexisting networks generated by different algorithms, heterogeneous datasets, and a suite of analysis tools. As a show case, we have analysed the gene co-expression networks of Mycobacterium tuberculosis generated using over 600 expression experiments. Regarding DNA damage repair, we identified SigC as a key control element, 12 new targets for LexA, an updated LexA binding motif, and a potential mismatch repair system. We expanded the DevR regulon with 27 genes while identifying 9 targets wrongly assigned to this regulon. We discovered 10 new genes linked to zinc uptake and a new regulatory mechanism for ZuR. The use of co-expression networks to perform system level analysis allows the development of custom made methodologies. As show cases we implemented a pipeline to integrate ChIP-seq data and another method to uncover multiple regulatory layers. Our workflow is based on representing the multiple types of information as network representations and presenting these networks in a synchronous framework that allows their simultaneous visualization while keeping specific associations from the different networks. By simultaneously exploring these networks and metadata, we gained insights into regulatory mechanisms in M. tuberculosis that could not be obtained through the separate analysis of each data type.

  11. Approximation and inference methods for stochastic biochemical kinetics—a tutorial review

    NASA Astrophysics Data System (ADS)

    Schnoerr, David; Sanguinetti, Guido; Grima, Ramon

    2017-03-01

    Stochastic fluctuations of molecule numbers are ubiquitous in biological systems. Important examples include gene expression and enzymatic processes in living cells. Such systems are typically modelled as chemical reaction networks whose dynamics are governed by the chemical master equation. Despite its simple structure, no analytic solutions to the chemical master equation are known for most systems. Moreover, stochastic simulations are computationally expensive, making systematic analysis and statistical inference a challenging task. Consequently, significant effort has been spent in recent decades on the development of efficient approximation and inference methods. This article gives an introduction to basic modelling concepts as well as an overview of state of the art methods. First, we motivate and introduce deterministic and stochastic methods for modelling chemical networks, and give an overview of simulation and exact solution methods. Next, we discuss several approximation methods, including the chemical Langevin equation, the system size expansion, moment closure approximations, time-scale separation approximations and hybrid methods. We discuss their various properties and review recent advances and remaining challenges for these methods. We present a comparison of several of these methods by means of a numerical case study and highlight some of their respective advantages and disadvantages. Finally, we discuss the problem of inference from experimental data in the Bayesian framework and review recent methods developed the literature. In summary, this review gives a self-contained introduction to modelling, approximations and inference methods for stochastic chemical kinetics.

  12. Improving Causal Inferences in Meta-analyses of Longitudinal Studies: Spanking as an Illustration.

    PubMed

    Larzelere, Robert E; Gunnoe, Marjorie Lindner; Ferguson, Christopher J

    2018-05-24

    To evaluate and improve the validity of causal inferences from meta-analyses of longitudinal studies, two adjustments for Time-1 outcome scores and a temporally backwards test are demonstrated. Causal inferences would be supported by robust results across both adjustment methods, distinct from results run backwards. A systematic strategy for evaluating potential confounds is also introduced. The methods are illustrated by assessing the impact of spanking on subsequent externalizing problems (child age: 18 months to 11 years). Significant results indicated a small risk or a small benefit of spanking, depending on the adjustment method. These meta-analytic methods are applicable for research on alternatives to spanking and other developmental science topics. The underlying principles can also improve causal inferences in individual studies. © 2018 Society for Research in Child Development.

  13. The effect of multiple encounters on short period comet orbits

    NASA Technical Reports Server (NTRS)

    Lowrey, B. E.

    1972-01-01

    The observed orbital elements of short period comets are found to be consistent with the hypothesis of derivation from long period comets as long as two assumptions are made. First, the distribution of short period comets has been randomized by multiple encounters with Jupiter and second, the short period comets have lower velocities of encounter with Jupiter than is generally expected. Some 16% of the observed short period comets have lower encounter velocities than is allowed mathematically using Laplace's method. This may be due to double encounter processes with Jupiter and Saturn, or as a result of prolonged encounters. The distribution of unobservable short period comets can be inferred in part from the observed comets. Many have orbits between Jupiter and Saturn with somewhat higher inclinations than those with perihelions near the earth. Debris from those comets may form the major component of the zodiacal dust.

  14. A Fast Numerical Method for Max-Convolution and the Application to Efficient Max-Product Inference in Bayesian Networks.

    PubMed

    Serang, Oliver

    2015-08-01

    Observations depending on sums of random variables are common throughout many fields; however, no efficient solution is currently known for performing max-product inference on these sums of general discrete distributions (max-product inference can be used to obtain maximum a posteriori estimates). The limiting step to max-product inference is the max-convolution problem (sometimes presented in log-transformed form and denoted as "infimal convolution," "min-convolution," or "convolution on the tropical semiring"), for which no O(k log(k)) method is currently known. Presented here is an O(k log(k)) numerical method for estimating the max-convolution of two nonnegative vectors (e.g., two probability mass functions), where k is the length of the larger vector. This numerical max-convolution method is then demonstrated by performing fast max-product inference on a convolution tree, a data structure for performing fast inference given information on the sum of n discrete random variables in O(nk log(nk)log(n)) steps (where each random variable has an arbitrary prior distribution on k contiguous possible states). The numerical max-convolution method can be applied to specialized classes of hidden Markov models to reduce the runtime of computing the Viterbi path from nk(2) to nk log(k), and has potential application to the all-pairs shortest paths problem.

  15. Fast Realistic MRI Simulations Based on Generalized Multi-Pool Exchange Tissue Model.

    PubMed

    Liu, Fang; Velikina, Julia V; Block, Walter F; Kijowski, Richard; Samsonov, Alexey A

    2017-02-01

    We present MRiLab, a new comprehensive simulator for large-scale realistic MRI simulations on a regular PC equipped with a modern graphical processing unit (GPU). MRiLab combines realistic tissue modeling with numerical virtualization of an MRI system and scanning experiment to enable assessment of a broad range of MRI approaches including advanced quantitative MRI methods inferring microstructure on a sub-voxel level. A flexible representation of tissue microstructure is achieved in MRiLab by employing the generalized tissue model with multiple exchanging water and macromolecular proton pools rather than a system of independent proton isochromats typically used in previous simulators. The computational power needed for simulation of the biologically relevant tissue models in large 3D objects is gained using parallelized execution on GPU. Three simulated and one actual MRI experiments were performed to demonstrate the ability of the new simulator to accommodate a wide variety of voxel composition scenarios and demonstrate detrimental effects of simplified treatment of tissue micro-organization adapted in previous simulators. GPU execution allowed  ∼ 200× improvement in computational speed over standard CPU. As a cross-platform, open-source, extensible environment for customizing virtual MRI experiments, MRiLab streamlines the development of new MRI methods, especially those aiming to infer quantitatively tissue composition and microstructure.

  16. Fast Realistic MRI Simulations Based on Generalized Multi-Pool Exchange Tissue Model

    PubMed Central

    Velikina, Julia V.; Block, Walter F.; Kijowski, Richard; Samsonov, Alexey A.

    2017-01-01

    We present MRiLab, a new comprehensive simulator for large-scale realistic MRI simulations on a regular PC equipped with a modern graphical processing unit (GPU). MRiLab combines realistic tissue modeling with numerical virtualization of an MRI system and scanning experiment to enable assessment of a broad range of MRI approaches including advanced quantitative MRI methods inferring microstructure on a sub-voxel level. A flexibl representation of tissue microstructure is achieved in MRiLab by employing the generalized tissue model with multiple exchanging water and macromolecular proton pools rather than a system of independent proton isochromats typically used in previous simulators. The computational power needed for simulation of the biologically relevant tissue models in large 3D objects is gained using parallelized execution on GPU. Three simulated and one actual MRI experiments were performed to demonstrate the ability of the new simulator to accommodate a wide variety of voxel composition scenarios and demonstrate detrimental effects of simplifie treatment of tissue micro-organization adapted in previous simulators. GPU execution allowed ∼200× improvement in computational speed over standard CPU. As a cross-platform, open-source, extensible environment for customizing virtual MRI experiments, MRiLab streamlines the development of new MRI methods, especially those aiming to infer quantitatively tissue composition and microstructure. PMID:28113746

  17. Automatic inference of geometric camera parameters and inter-camera topology in uncalibrated disjoint surveillance cameras

    NASA Astrophysics Data System (ADS)

    den Hollander, Richard J. M.; Bouma, Henri; Baan, Jan; Eendebak, Pieter T.; van Rest, Jeroen H. C.

    2015-10-01

    Person tracking across non-overlapping cameras and other types of video analytics benefit from spatial calibration information that allows an estimation of the distance between cameras and a relation between pixel coordinates and world coordinates within a camera. In a large environment with many cameras, or for frequent ad-hoc deployments of cameras, the cost of this calibration is high. This creates a barrier for the use of video analytics. Automating the calibration allows for a short configuration time, and the use of video analytics in a wider range of scenarios, including ad-hoc crisis situations and large scale surveillance systems. We show an autocalibration method entirely based on pedestrian detections in surveillance video in multiple non-overlapping cameras. In this paper, we show the two main components of automatic calibration. The first shows the intra-camera geometry estimation that leads to an estimate of the tilt angle, focal length and camera height, which is important for the conversion from pixels to meters and vice versa. The second component shows the inter-camera topology inference that leads to an estimate of the distance between cameras, which is important for spatio-temporal analysis of multi-camera tracking. This paper describes each of these methods and provides results on realistic video data.

  18. Data-driven inference for the spatial scan statistic.

    PubMed

    Almeida, Alexandre C L; Duarte, Anderson R; Duczmal, Luiz H; Oliveira, Fernando L P; Takahashi, Ricardo H C

    2011-08-02

    Kulldorff's spatial scan statistic for aggregated area maps searches for clusters of cases without specifying their size (number of areas) or geographic location in advance. Their statistical significance is tested while adjusting for the multiple testing inherent in such a procedure. However, as is shown in this work, this adjustment is not done in an even manner for all possible cluster sizes. A modification is proposed to the usual inference test of the spatial scan statistic, incorporating additional information about the size of the most likely cluster found. A new interpretation of the results of the spatial scan statistic is done, posing a modified inference question: what is the probability that the null hypothesis is rejected for the original observed cases map with a most likely cluster of size k, taking into account only those most likely clusters of size k found under null hypothesis for comparison? This question is especially important when the p-value computed by the usual inference process is near the alpha significance level, regarding the correctness of the decision based in this inference. A practical procedure is provided to make more accurate inferences about the most likely cluster found by the spatial scan statistic.

  19. P values in display items are ubiquitous and almost invariably significant: A survey of top science journals

    PubMed Central

    Cristea, Ioana Alina

    2018-01-01

    P values represent a widely used, but pervasively misunderstood and fiercely contested method of scientific inference. Display items, such as figures and tables, often containing the main results, are an important source of P values. We conducted a survey comparing the overall use of P values and the occurrence of significant P values in display items of a sample of articles in the three top multidisciplinary journals (Nature, Science, PNAS) in 2017 and, respectively, in 1997. We also examined the reporting of multiplicity corrections and its potential influence on the proportion of statistically significant P values. Our findings demonstrated substantial and growing reliance on P values in display items, with increases of 2.5 to 14.5 times in 2017 compared to 1997. The overwhelming majority of P values (94%, 95% confidence interval [CI] 92% to 96%) were statistically significant. Methods to adjust for multiplicity were almost non-existent in 1997, but reported in many articles relying on P values in 2017 (Nature 68%, Science 48%, PNAS 38%). In their absence, almost all reported P values were statistically significant (98%, 95% CI 96% to 99%). Conversely, when any multiplicity corrections were described, 88% (95% CI 82% to 93%) of reported P values were statistically significant. Use of Bayesian methods was scant (2.5%) and rarely (0.7%) articles relied exclusively on Bayesian statistics. Overall, wider appreciation of the need for multiplicity corrections is a welcome evolution, but the rapid growth of reliance on P values and implausibly high rates of reported statistical significance are worrisome. PMID:29763472

  20. P values in display items are ubiquitous and almost invariably significant: A survey of top science journals.

    PubMed

    Cristea, Ioana Alina; Ioannidis, John P A

    2018-01-01

    P values represent a widely used, but pervasively misunderstood and fiercely contested method of scientific inference. Display items, such as figures and tables, often containing the main results, are an important source of P values. We conducted a survey comparing the overall use of P values and the occurrence of significant P values in display items of a sample of articles in the three top multidisciplinary journals (Nature, Science, PNAS) in 2017 and, respectively, in 1997. We also examined the reporting of multiplicity corrections and its potential influence on the proportion of statistically significant P values. Our findings demonstrated substantial and growing reliance on P values in display items, with increases of 2.5 to 14.5 times in 2017 compared to 1997. The overwhelming majority of P values (94%, 95% confidence interval [CI] 92% to 96%) were statistically significant. Methods to adjust for multiplicity were almost non-existent in 1997, but reported in many articles relying on P values in 2017 (Nature 68%, Science 48%, PNAS 38%). In their absence, almost all reported P values were statistically significant (98%, 95% CI 96% to 99%). Conversely, when any multiplicity corrections were described, 88% (95% CI 82% to 93%) of reported P values were statistically significant. Use of Bayesian methods was scant (2.5%) and rarely (0.7%) articles relied exclusively on Bayesian statistics. Overall, wider appreciation of the need for multiplicity corrections is a welcome evolution, but the rapid growth of reliance on P values and implausibly high rates of reported statistical significance are worrisome.

  1. Comparing population structure as inferred from genealogical versus genetic information.

    PubMed

    Colonna, Vincenza; Nutile, Teresa; Ferrucci, Ronald R; Fardella, Giulio; Aversano, Mario; Barbujani, Guido; Ciullo, Marina

    2009-12-01

    Algorithms for inferring population structure from genetic data (ie, population assignment methods) have shown to effectively recognize genetic clusters in human populations. However, their performance in identifying groups of genealogically related individuals, especially in scanty-differentiated populations, has not been tested empirically thus far. For this study, we had access to both genealogical and genetic data from two closely related, isolated villages in southern Italy. We found that nearly all living individuals were included in a single pedigree, with multiple inbreeding loops. Despite F(st) between villages being a low 0.008, genetic clustering analysis identified two clusters roughly corresponding to the two villages. Average kinship between individuals (estimated from genealogies) increased at increasing values of group membership (estimated from the genetic data), showing that the observed genetic clusters represent individuals who are more closely related to each other than to random members of the population. Further, average kinship within clusters and F(st) between clusters increases with increasingly stringent membership threshold requirements. We conclude that a limited number of genetic markers is sufficient to detect structuring, and that the results of genetic analyses faithfully mirror the structuring inferred from detailed analyses of population genealogies, even when F(st) values are low, as in the case of the two villages. We then estimate the impact of observed levels of population structure on association studies using simulated data.

  2. Comparing population structure as inferred from genealogical versus genetic information

    PubMed Central

    Colonna, Vincenza; Nutile, Teresa; Ferrucci, Ronald R; Fardella, Giulio; Aversano, Mario; Barbujani, Guido; Ciullo, Marina

    2009-01-01

    Algorithms for inferring population structure from genetic data (ie, population assignment methods) have shown to effectively recognize genetic clusters in human populations. However, their performance in identifying groups of genealogically related individuals, especially in scanty-differentiated populations, has not been tested empirically thus far. For this study, we had access to both genealogical and genetic data from two closely related, isolated villages in southern Italy. We found that nearly all living individuals were included in a single pedigree, with multiple inbreeding loops. Despite Fst between villages being a low 0.008, genetic clustering analysis identified two clusters roughly corresponding to the two villages. Average kinship between individuals (estimated from genealogies) increased at increasing values of group membership (estimated from the genetic data), showing that the observed genetic clusters represent individuals who are more closely related to each other than to random members of the population. Further, average kinship within clusters and Fst between clusters increases with increasingly stringent membership threshold requirements. We conclude that a limited number of genetic markers is sufficient to detect structuring, and that the results of genetic analyses faithfully mirror the structuring inferred from detailed analyses of population genealogies, even when Fst values are low, as in the case of the two villages. We then estimate the impact of observed levels of population structure on association studies using simulated data. PMID:19550436

  3. Clustering network layers with the strata multilayer stochastic block model.

    PubMed

    Stanley, Natalie; Shai, Saray; Taylor, Dane; Mucha, Peter J

    2016-01-01

    Multilayer networks are a useful data structure for simultaneously capturing multiple types of relationships between a set of nodes. In such networks, each relational definition gives rise to a layer. While each layer provides its own set of information, community structure across layers can be collectively utilized to discover and quantify underlying relational patterns between nodes. To concisely extract information from a multilayer network, we propose to identify and combine sets of layers with meaningful similarities in community structure. In this paper, we describe the "strata multilayer stochastic block model" (sMLSBM), a probabilistic model for multilayer community structure. The central extension of the model is that there exist groups of layers, called "strata", which are defined such that all layers in a given stratum have community structure described by a common stochastic block model (SBM). That is, layers in a stratum exhibit similar node-to-community assignments and SBM probability parameters. Fitting the sMLSBM to a multilayer network provides a joint clustering that yields node-to-community and layer-to-stratum assignments, which cooperatively aid one another during inference. We describe an algorithm for separating layers into their appropriate strata and an inference technique for estimating the SBM parameters for each stratum. We demonstrate our method using synthetic networks and a multilayer network inferred from data collected in the Human Microbiome Project.

  4. Mapping causal functional contributions derived from the clinical assessment of brain damage after stroke

    PubMed Central

    Zavaglia, Melissa; Forkert, Nils D.; Cheng, Bastian; Gerloff, Christian; Thomalla, Götz; Hilgetag, Claus C.

    2015-01-01

    Lesion analysis reveals causal contributions of brain regions to mental functions, aiding the understanding of normal brain function as well as rehabilitation of brain-damaged patients. We applied a novel lesion inference technique based on game theory, Multi-perturbation Shapley value Analysis (MSA), to a large clinical lesion dataset. We used MSA to analyze the lesion patterns of 148 acute stroke patients together with their neurological deficits, as assessed by the National Institutes of Health Stroke Scale (NIHSS). The results revealed regional functional contributions to essential behavioral and cognitive functions as reflected in the NIHSS, particularly by subcortical structures. There were also side specific differences of functional contributions between the right and left hemispheric brain regions which may reflect the dominance of the left hemispheric syndrome aphasia in the NIHSS. Comparison of MSA to established lesion inference methods demonstrated the feasibility of the approach for analyzing clinical data and indicated its capability for objectively inferring functional contributions from multiple injured, potentially interacting sites, at the cost of having to predict the outcome of unknown lesion configurations. The analysis of regional functional contributions to neurological symptoms measured by the NIHSS contributes to the interpretation of this widely used standardized stroke scale in clinical practice as well as clinical trials and provides a first approximation of a ‘map of stroke’. PMID:26448908

  5. Mapping causal functional contributions derived from the clinical assessment of brain damage after stroke.

    PubMed

    Zavaglia, Melissa; Forkert, Nils D; Cheng, Bastian; Gerloff, Christian; Thomalla, Götz; Hilgetag, Claus C

    2015-01-01

    Lesion analysis reveals causal contributions of brain regions to mental functions, aiding the understanding of normal brain function as well as rehabilitation of brain-damaged patients. We applied a novel lesion inference technique based on game theory, Multi-perturbation Shapley value Analysis (MSA), to a large clinical lesion dataset. We used MSA to analyze the lesion patterns of 148 acute stroke patients together with their neurological deficits, as assessed by the National Institutes of Health Stroke Scale (NIHSS). The results revealed regional functional contributions to essential behavioral and cognitive functions as reflected in the NIHSS, particularly by subcortical structures. There were also side specific differences of functional contributions between the right and left hemispheric brain regions which may reflect the dominance of the left hemispheric syndrome aphasia in the NIHSS. Comparison of MSA to established lesion inference methods demonstrated the feasibility of the approach for analyzing clinical data and indicated its capability for objectively inferring functional contributions from multiple injured, potentially interacting sites, at the cost of having to predict the outcome of unknown lesion configurations. The analysis of regional functional contributions to neurological symptoms measured by the NIHSS contributes to the interpretation of this widely used standardized stroke scale in clinical practice as well as clinical trials and provides a first approximation of a 'map of stroke'.

  6. Clustering network layers with the strata multilayer stochastic block model

    PubMed Central

    Stanley, Natalie; Shai, Saray; Taylor, Dane; Mucha, Peter J.

    2016-01-01

    Multilayer networks are a useful data structure for simultaneously capturing multiple types of relationships between a set of nodes. In such networks, each relational definition gives rise to a layer. While each layer provides its own set of information, community structure across layers can be collectively utilized to discover and quantify underlying relational patterns between nodes. To concisely extract information from a multilayer network, we propose to identify and combine sets of layers with meaningful similarities in community structure. In this paper, we describe the “strata multilayer stochastic block model” (sMLSBM), a probabilistic model for multilayer community structure. The central extension of the model is that there exist groups of layers, called “strata”, which are defined such that all layers in a given stratum have community structure described by a common stochastic block model (SBM). That is, layers in a stratum exhibit similar node-to-community assignments and SBM probability parameters. Fitting the sMLSBM to a multilayer network provides a joint clustering that yields node-to-community and layer-to-stratum assignments, which cooperatively aid one another during inference. We describe an algorithm for separating layers into their appropriate strata and an inference technique for estimating the SBM parameters for each stratum. We demonstrate our method using synthetic networks and a multilayer network inferred from data collected in the Human Microbiome Project. PMID:28435844

  7. A grammar inference approach for predicting kinase specific phosphorylation sites.

    PubMed

    Datta, Sutapa; Mukhopadhyay, Subhasis

    2015-01-01

    Kinase mediated phosphorylation site detection is the key mechanism of post translational mechanism that plays an important role in regulating various cellular processes and phenotypes. Many diseases, like cancer are related with the signaling defects which are associated with protein phosphorylation. Characterizing the protein kinases and their substrates enhances our ability to understand the mechanism of protein phosphorylation and extends our knowledge of signaling network; thereby helping us to treat such diseases. Experimental methods for predicting phosphorylation sites are labour intensive and expensive. Also, manifold increase of protein sequences in the databanks over the years necessitates the improvement of high speed and accurate computational methods for predicting phosphorylation sites in protein sequences. Till date, a number of computational methods have been proposed by various researchers in predicting phosphorylation sites, but there remains much scope of improvement. In this communication, we present a simple and novel method based on Grammatical Inference (GI) approach to automate the prediction of kinase specific phosphorylation sites. In this regard, we have used a popular GI algorithm Alergia to infer Deterministic Stochastic Finite State Automata (DSFA) which equally represents the regular grammar corresponding to the phosphorylation sites. Extensive experiments on several datasets generated by us reveal that, our inferred grammar successfully predicts phosphorylation sites in a kinase specific manner. It performs significantly better when compared with the other existing phosphorylation site prediction methods. We have also compared our inferred DSFA with two other GI inference algorithms. The DSFA generated by our method performs superior which indicates that our method is robust and has a potential for predicting the phosphorylation sites in a kinase specific manner.

  8. Formalizing the role of agent-based modeling in causal inference and epidemiology.

    PubMed

    Marshall, Brandon D L; Galea, Sandro

    2015-01-15

    Calls for the adoption of complex systems approaches, including agent-based modeling, in the field of epidemiology have largely centered on the potential for such methods to examine complex disease etiologies, which are characterized by feedback behavior, interference, threshold dynamics, and multiple interacting causal effects. However, considerable theoretical and practical issues impede the capacity of agent-based methods to examine and evaluate causal effects and thus illuminate new areas for intervention. We build on this work by describing how agent-based models can be used to simulate counterfactual outcomes in the presence of complexity. We show that these models are of particular utility when the hypothesized causal mechanisms exhibit a high degree of interdependence between multiple causal effects and when interference (i.e., one person's exposure affects the outcome of others) is present and of intrinsic scientific interest. Although not without challenges, agent-based modeling (and complex systems methods broadly) represent a promising novel approach to identify and evaluate complex causal effects, and they are thus well suited to complement other modern epidemiologic methods of etiologic inquiry. © The Author 2014. Published by Oxford University Press on behalf of the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  9. Accelerated Profile HMM Searches

    PubMed Central

    Eddy, Sean R.

    2011-01-01

    Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches. PMID:22039361

  10. Multi-objective Decision Based Available Transfer Capability in Deregulated Power System Using Heuristic Approaches

    NASA Astrophysics Data System (ADS)

    Pasam, Gopi Krishna; Manohar, T. Gowri

    2016-09-01

    Determination of available transfer capability (ATC) requires the use of experience, intuition and exact judgment in order to meet several significant aspects in the deregulated environment. Based on these points, this paper proposes two heuristic approaches to compute ATC. The first proposed heuristic algorithm integrates the five methods known as continuation repeated power flow, repeated optimal power flow, radial basis function neural network, back propagation neural network and adaptive neuro fuzzy inference system to obtain ATC. The second proposed heuristic model is used to obtain multiple ATC values. Out of these, a specific ATC value will be selected based on a number of social, economic, deregulated environmental constraints and related to specific applications like optimization, on-line monitoring, and ATC forecasting known as multi-objective decision based optimal ATC. The validity of results obtained through these proposed methods are scrupulously verified on various buses of the IEEE 24-bus reliable test system. The results presented and derived conclusions in this paper are very useful for planning, operation, maintaining of reliable power in any power system and its monitoring in an on-line environment of deregulated power system. In this way, the proposed heuristic methods would contribute the best possible approach to assess multiple objective ATC using integrated methods.

  11. Multiple Kernel Learning with Data Augmentation

    DTIC Science & Technology

    2016-11-22

    model, we further show how to make inference and learn parameters in the following section. Note that we still need hyper -parameters (i.e., κ0, θ0, µ0...parameter α and sparsity tuning parameter β, these hyper -parameters are not sensitive to data. As in the BEMKL method, we fix the values of these... hyper -parameters for all datasets. αxn β yn wmf λmf µ0 σ0κ0 θ0 N M × F α ∼ G (κ0, θ0) β ∼ N ( µ0, σ 2 0 ) wmf , λmf | α, β ∼ Equation (8) yn | xn,wmf

  12. Online Emotional Inferences in Written and Auditory Texts: A Study with Children and Adults

    ERIC Educational Resources Information Center

    Diergarten, Anna Katharina; Nieding, Gerhild

    2016-01-01

    Emotional inferences are conclusions that a reader draws about the emotional state of a story's protagonist. In this study, we examined whether children and adults draw emotional inferences while reading short stories or listening to an aural presentation of short stories. We used an online method that assesses inferences during reading with a…

  13. Bayesian pedigree inference with small numbers of single nucleotide polymorphisms via a factor-graph representation.

    PubMed

    Anderson, Eric C; Ng, Thomas C

    2016-02-01

    We develop a computational framework for addressing pedigree inference problems using small numbers (80-400) of single nucleotide polymorphisms (SNPs). Our approach relaxes the assumptions, which are commonly made, that sampling is complete with respect to the pedigree and that there is no genotyping error. It relies on representing the inferred pedigree as a factor graph and invoking the Sum-Product algorithm to compute and store quantities that allow the joint probability of the data to be rapidly computed under a large class of rearrangements of the pedigree structure. This allows efficient MCMC sampling over the space of pedigrees, and, hence, Bayesian inference of pedigree structure. In this paper we restrict ourselves to inference of pedigrees without loops using SNPs assumed to be unlinked. We present the methodology in general for multigenerational inference, and we illustrate the method by applying it to the inference of full sibling groups in a large sample (n=1157) of Chinook salmon typed at 95 SNPs. The results show that our method provides a better point estimate and estimate of uncertainty than the currently best-available maximum-likelihood sibling reconstruction method. Extensions of this work to more complex scenarios are briefly discussed. Published by Elsevier Inc.

  14. Soft sensor modeling based on variable partition ensemble method for nonlinear batch processes

    NASA Astrophysics Data System (ADS)

    Wang, Li; Chen, Xiangguang; Yang, Kai; Jin, Huaiping

    2017-01-01

    Batch processes are always characterized by nonlinear and system uncertain properties, therefore, the conventional single model may be ill-suited. A local learning strategy soft sensor based on variable partition ensemble method is developed for the quality prediction of nonlinear and non-Gaussian batch processes. A set of input variable sets are obtained by bootstrapping and PMI criterion. Then, multiple local GPR models are developed based on each local input variable set. When a new test data is coming, the posterior probability of each best performance local model is estimated based on Bayesian inference and used to combine these local GPR models to get the final prediction result. The proposed soft sensor is demonstrated by applying to an industrial fed-batch chlortetracycline fermentation process.

  15. Extinction spectra of suspensions of microspheres: determination of the spectral refractive index and particle size distribution with nanometer accuracy.

    PubMed

    Gienger, Jonas; Bär, Markus; Neukammer, Jörg

    2018-01-10

    A method is presented to infer simultaneously the wavelength-dependent real refractive index (RI) of the material of microspheres and their size distribution from extinction measurements of particle suspensions. To derive the averaged spectral optical extinction cross section of the microspheres from such ensemble measurements, we determined the particle concentration by flow cytometry to an accuracy of typically 2% and adjusted the particle concentration to ensure that perturbations due to multiple scattering are negligible. For analysis of the extinction spectra, we employ Mie theory, a series-expansion representation of the refractive index and nonlinear numerical optimization. In contrast to other approaches, our method offers the advantage to simultaneously determine size, size distribution, and spectral refractive index of ensembles of microparticles including uncertainty estimation.

  16. Retrieval practice with short-answer, multiple-choice, and hybrid tests.

    PubMed

    Smith, Megan A; Karpicke, Jeffrey D

    2014-01-01

    Retrieval practice improves meaningful learning, and the most frequent way of implementing retrieval practice in classrooms is to have students answer questions. In four experiments (N=372) we investigated the effects of different question formats on learning. Students read educational texts and practised retrieval by answering short-answer, multiple-choice, or hybrid questions. In hybrid conditions students first attempted to recall answers in short-answer format, then identified answers in multiple-choice format. We measured learning 1 week later using a final assessment with two types of questions: those that could be answered by recalling information verbatim from the texts and those that required inferences. Practising retrieval in all format conditions enhanced retention, relative to a study-only control condition, on both verbatim and inference questions. However, there were little or no advantages of answering short-answer or hybrid format questions over multiple-choice questions in three experiments. In Experiment 4, when retrieval success was improved under initial short-answer conditions, there was an advantage of answering short-answer or hybrid questions over multiple-choice questions. The results challenge the simple conclusion that short-answer questions always produce the best learning, due to increased retrieval effort or difficulty, and demonstrate the importance of retrieval success for retrieval-based learning activities.

  17. Optimization of analytical parameters for inferring relationships among Escherichia coli isolates from repetitive-element PCR by maximizing correspondence with multilocus sequence typing data.

    PubMed

    Goldberg, Tony L; Gillespie, Thomas R; Singer, Randall S

    2006-09-01

    Repetitive-element PCR (rep-PCR) is a method for genotyping bacteria based on the selective amplification of repetitive genetic elements dispersed throughout bacterial chromosomes. The method has great potential for large-scale epidemiological studies because of its speed and simplicity; however, objective guidelines for inferring relationships among bacterial isolates from rep-PCR data are lacking. We used multilocus sequence typing (MLST) as a "gold standard" to optimize the analytical parameters for inferring relationships among Escherichia coli isolates from rep-PCR data. We chose 12 isolates from a large database to represent a wide range of pairwise genetic distances, based on the initial evaluation of their rep-PCR fingerprints. We conducted MLST with these same isolates and systematically varied the analytical parameters to maximize the correspondence between the relationships inferred from rep-PCR and those inferred from MLST. Methods that compared the shapes of densitometric profiles ("curve-based" methods) yielded consistently higher correspondence values between data types than did methods that calculated indices of similarity based on shared and different bands (maximum correspondences of 84.5% and 80.3%, respectively). Curve-based methods were also markedly more robust in accommodating variations in user-specified analytical parameter values than were "band-sharing coefficient" methods, and they enhanced the reproducibility of rep-PCR. Phylogenetic analyses of rep-PCR data yielded trees with high topological correspondence to trees based on MLST and high statistical support for major clades. These results indicate that rep-PCR yields accurate information for inferring relationships among E. coli isolates and that accuracy can be enhanced with the use of analytical methods that consider the shapes of densitometric profiles.

  18. What time is it? Deep learning approaches for circadian rhythms.

    PubMed

    Agostinelli, Forest; Ceglia, Nicholas; Shahbaba, Babak; Sassone-Corsi, Paolo; Baldi, Pierre

    2016-06-15

    Circadian rhythms date back to the origins of life, are found in virtually every species and every cell, and play fundamental roles in functions ranging from metabolism to cognition. Modern high-throughput technologies allow the measurement of concentrations of transcripts, metabolites and other species along the circadian cycle creating novel computational challenges and opportunities, including the problems of inferring whether a given species oscillate in circadian fashion or not, and inferring the time at which a set of measurements was taken. We first curate several large synthetic and biological time series datasets containing labels for both periodic and aperiodic signals. We then use deep learning methods to develop and train BIO_CYCLE, a system to robustly estimate which signals are periodic in high-throughput circadian experiments, producing estimates of amplitudes, periods, phases, as well as several statistical significance measures. Using the curated data, BIO_CYCLE is compared to other approaches and shown to achieve state-of-the-art performance across multiple metrics. We then use deep learning methods to develop and train BIO_CLOCK to robustly estimate the time at which a particular single-time-point transcriptomic experiment was carried. In most cases, BIO_CLOCK can reliably predict time, within approximately 1 h, using the expression levels of only a small number of core clock genes. BIO_CLOCK is shown to work reasonably well across tissue types, and often with only small degradation across conditions. BIO_CLOCK is used to annotate most mouse experiments found in the GEO database with an inferred time stamp. All data and software are publicly available on the CircadiOmics web portal: circadiomics.igb.uci.edu/ fagostin@uci.edu or pfbaldi@uci.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.

  19. What time is it? Deep learning approaches for circadian rhythms

    PubMed Central

    Agostinelli, Forest; Ceglia, Nicholas; Shahbaba, Babak; Sassone-Corsi, Paolo; Baldi, Pierre

    2016-01-01

    Motivation: Circadian rhythms date back to the origins of life, are found in virtually every species and every cell, and play fundamental roles in functions ranging from metabolism to cognition. Modern high-throughput technologies allow the measurement of concentrations of transcripts, metabolites and other species along the circadian cycle creating novel computational challenges and opportunities, including the problems of inferring whether a given species oscillate in circadian fashion or not, and inferring the time at which a set of measurements was taken. Results: We first curate several large synthetic and biological time series datasets containing labels for both periodic and aperiodic signals. We then use deep learning methods to develop and train BIO_CYCLE, a system to robustly estimate which signals are periodic in high-throughput circadian experiments, producing estimates of amplitudes, periods, phases, as well as several statistical significance measures. Using the curated data, BIO_CYCLE is compared to other approaches and shown to achieve state-of-the-art performance across multiple metrics. We then use deep learning methods to develop and train BIO_CLOCK to robustly estimate the time at which a particular single-time-point transcriptomic experiment was carried. In most cases, BIO_CLOCK can reliably predict time, within approximately 1 h, using the expression levels of only a small number of core clock genes. BIO_CLOCK is shown to work reasonably well across tissue types, and often with only small degradation across conditions. BIO_CLOCK is used to annotate most mouse experiments found in the GEO database with an inferred time stamp. Availability and Implementation: All data and software are publicly available on the CircadiOmics web portal: circadiomics.igb.uci.edu/. Contacts: fagostin@uci.edu or pfbaldi@uci.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:27307647

  20. Missing data imputation and haplotype phase inference for genome-wide association studies

    PubMed Central

    Browning, Sharon R.

    2009-01-01

    Imputation of missing data and the use of haplotype-based association tests can improve the power of genome-wide association studies (GWAS). In this article, I review methods for haplotype inference and missing data imputation, and discuss their application to GWAS. I discuss common features of the best algorithms for haplotype phase inference and missing data imputation in large-scale data sets, as well as some important differences between classes of methods, and highlight the methods that provide the highest accuracy and fastest computational performance. PMID:18850115

  1. Method of fuzzy inference for one class of MISO-structure systems with non-singleton inputs

    NASA Astrophysics Data System (ADS)

    Sinuk, V. G.; Panchenko, M. V.

    2018-03-01

    In fuzzy modeling, the inputs of the simulated systems can receive both crisp values and non-Singleton. Computational complexity of fuzzy inference with fuzzy non-Singleton inputs corresponds to an exponential. This paper describes a new method of inference, based on the theorem of decomposition of a multidimensional fuzzy implication and a fuzzy truth value. This method is considered for fuzzy inputs and has a polynomial complexity, which makes it possible to use it for modeling large-dimensional MISO-structure systems.

  2. A machine-learning approach reveals that alignment properties alone can accurately predict inference of lateral gene transfer from discordant phylogenies.

    PubMed

    Roettger, Mayo; Martin, William; Dagan, Tal

    2009-09-01

    Among the methods currently used in phylogenomic practice to detect the presence of lateral gene transfer (LGT), one of the most frequently employed is the comparison of gene tree topologies for different genes. In cases where the phylogenies for different genes are incompatible, or discordant, for well-supported branches there are three simple interpretations for the result: 1) gene duplications (paralogy) followed by many independent gene losses have occurred, 2) LGT has occurred, or 3) the phylogeny is well supported but for reasons unknown is nonetheless incorrect. Here, we focus on the third possibility by examining the properties of 22,437 published multiple sequence alignments, the Bayesian maximum likelihood trees for which either do or do not suggest the occurrence of LGT by the criterion of discordant branches. The alignments that produce discordant phylogenies differ significantly in several salient alignment properties from those that do not. Using a support vector machine, we were able to predict the inference of discordant tree topologies with up to 80% accuracy from alignment properties alone.

  3. Co-activation patterns in resting-state fMRI signals.

    PubMed

    Liu, Xiao; Zhang, Nanyin; Chang, Catie; Duyn, Jeff H

    2018-02-08

    The brain is a complex system that integrates and processes information across multiple time scales by dynamically coordinating activities over brain regions and circuits. Correlations in resting-state functional magnetic resonance imaging (rsfMRI) signals have been widely used to infer functional connectivity of the brain, providing a metric of functional associations that reflects a temporal average over an entire scan (typically several minutes or longer). Not until recently was the study of dynamic brain interactions at much shorter time scales (seconds to minutes) considered for inference of functional connectivity. One method proposed for this objective seeks to identify and extract recurring co-activation patterns (CAPs) that represent instantaneous brain configurations at single time points. Here, we review the development and recent advancement of CAP methodology and other closely related approaches, as well as their applications and associated findings. We also discuss the potential neural origins and behavioral relevance of CAPs, along with methodological issues and future research directions in the analysis of fMRI co-activation patterns. Copyright © 2018 Elsevier Inc. All rights reserved.

  4. Phylogenetic trees in bioinformatics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Burr, Tom L

    2008-01-01

    Genetic data is often used to infer evolutionary relationships among a collection of viruses, bacteria, animal or plant species, or other operational taxonomic units (OTU). A phylogenetic tree depicts such relationships and provides a visual representation of the estimated branching order of the OTUs. Tree estimation is unique for several reasons, including: the types of data used to represent each OTU; the use ofprobabilistic nucleotide substitution models; the inference goals involving both tree topology and branch length, and the huge number of possible trees for a given sample of a very modest number of OTUs, which implies that fmding themore » best tree(s) to describe the genetic data for each OTU is computationally demanding. Bioinformatics is too large a field to review here. We focus on that aspect of bioinformatics that includes study of similarities in genetic data from multiple OTUs. Although research questions are diverse, a common underlying challenge is to estimate the evolutionary history of the OTUs. Therefore, this paper reviews the role of phylogenetic tree estimation in bioinformatics, available methods and software, and identifies areas for additional research and development.« less

  5. Comparison of an adaptive neuro-fuzzy inference system and an artificial neural network in the cross-talk correction of simultaneous 99 m Tc / 201Tl SPECT imaging using a GATE Monte-Carlo simulation

    NASA Astrophysics Data System (ADS)

    Heidary, Saeed; Setayeshi, Saeed; Ghannadi-Maragheh, Mohammad

    2014-09-01

    The aim of this study is to compare the adaptive neuro-fuzzy inference system (ANFIS) and the artificial neural network (ANN) to estimate the cross-talk contamination of 99 m Tc / 201 Tl image acquisition in the 201 Tl energy window (77 ± 15% keV). GATE (Geant4 Application in Emission and Tomography) is employed due to its ability to simulate multiple radioactive sources concurrently. Two kinds of phantoms, including two digital and one physical phantom, are used. In the real and the simulation studies, data acquisition is carried out using eight energy windows. The ANN and the ANFIS are prepared in MATLAB, and the GATE results are used as a training data set. Three indications are evaluated and compared. The ANFIS method yields better outcomes for two indications (Spearman's rank correlation coefficient and contrast) and the two phantom results in each category. The maximum image biasing, which is the third indication, is found to be 6% more than that for the ANN.

  6. The first iguanian lizard from the Mesozoic of Africa

    NASA Astrophysics Data System (ADS)

    Apesteguía, Sebastián; Daza, Juan D.; Simões, Tiago R.; Rage, Jean Claude

    2016-09-01

    The fossil record shows that iguanian lizards were widely distributed during the Late Cretaceous. However, the biogeographic history and early evolution of one of its most diverse and peculiar clades (acrodontans) remain poorly known. Here, we present the first Mesozoic acrodontan from Africa, which also represents the oldest iguanian lizard from that continent. The new taxon comes from the Kem Kem Beds in Morocco (Cenomanian, Late Cretaceous) and is based on a partial lower jaw. The new taxon presents a number of features that are found only among acrodontan lizards and shares greatest similarities with uromastycines, specifically. In a combined evidence phylogenetic dataset comprehensive of all major acrodontan lineages using multiple tree inference methods (traditional and implied weighting maximum-parsimony, and Bayesian inference), we found support for the placement of the new species within uromastycines, along with Gueragama sulamericana (Late Cretaceous of Brazil). The new fossil supports the previously hypothesized widespread geographical distribution of acrodontans in Gondwana during the Mesozoic. Additionally, it provides the first fossil evidence of uromastycines in the Cretaceous, and the ancestry of acrodontan iguanians in Africa.

  7. Corpus Callosum Function in Verbal Dichotic Listening: Inferences from a Longitudinal Follow-Up of Relapsing-Remitting Multiple Sclerosis Patients

    ERIC Educational Resources Information Center

    Gadea, Marien; Marti-Bonmati, Luis; Arana, Estanislao; Espert, Raul; Salvador, Alicia; Casanova, Bonaventura

    2009-01-01

    This study conducted a follow-up of 13 early-onset slightly disabled Relapsing-Remitting Multiple Sclerosis (RRMS) patients within an year, evaluating both CC area measurements in a midsagittal Magnetic Resonance (MR) image, and Dichotic Listening (DL) testing with stop consonant vowel (C-V) syllables. Patients showed a significant progressive…

  8. Designing a Stage-Sensitive Written Assessment of Elementary Students' Scheme for Multiplicative Reasoning

    ERIC Educational Resources Information Center

    Hodkowski, Nicola M.; Gardner, Amber; Jorgensen, Cody; Hornbein, Peter; Johnson, Heather L.; Tzur, Ron

    2016-01-01

    In this paper we examine the application of Tzur's (2007) fine-grained assessment to the design of an assessment measure of a particular multiplicative scheme so that non-interview, good enough data can be obtained (on a large scale) to infer into elementary students' reasoning. We outline three design principles that surfaced through our recent…

  9. Inverse Bayesian inference as a key of consciousness featuring a macroscopic quantum logical structure.

    PubMed

    Gunji, Yukio-Pegio; Shinohara, Shuji; Haruna, Taichi; Basios, Vasileios

    2017-02-01

    To overcome the dualism between mind and matter and to implement consciousness in science, a physical entity has to be embedded with a measurement process. Although quantum mechanics have been regarded as a candidate for implementing consciousness, nature at its macroscopic level is inconsistent with quantum mechanics. We propose a measurement-oriented inference system comprising Bayesian and inverse Bayesian inferences. While Bayesian inference contracts probability space, the newly defined inverse one relaxes the space. These two inferences allow an agent to make a decision corresponding to an immediate change in their environment. They generate a particular pattern of joint probability for data and hypotheses, comprising multiple diagonal and noisy matrices. This is expressed as a nondistributive orthomodular lattice equivalent to quantum logic. We also show that an orthomodular lattice can reveal information generated by inverse syllogism as well as the solutions to the frame and symbol-grounding problems. Our model is the first to connect macroscopic cognitive processes with the mathematical structure of quantum mechanics with no additional assumptions. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  10. A Self-Organizing State-Space-Model Approach for Parameter Estimation in Hodgkin-Huxley-Type Models of Single Neurons

    PubMed Central

    Vavoulis, Dimitrios V.; Straub, Volko A.; Aston, John A. D.; Feng, Jianfeng

    2012-01-01

    Traditional approaches to the problem of parameter estimation in biophysical models of neurons and neural networks usually adopt a global search algorithm (for example, an evolutionary algorithm), often in combination with a local search method (such as gradient descent) in order to minimize the value of a cost function, which measures the discrepancy between various features of the available experimental data and model output. In this study, we approach the problem of parameter estimation in conductance-based models of single neurons from a different perspective. By adopting a hidden-dynamical-systems formalism, we expressed parameter estimation as an inference problem in these systems, which can then be tackled using a range of well-established statistical inference methods. The particular method we used was Kitagawa's self-organizing state-space model, which was applied on a number of Hodgkin-Huxley-type models using simulated or actual electrophysiological data. We showed that the algorithm can be used to estimate a large number of parameters, including maximal conductances, reversal potentials, kinetics of ionic currents, measurement and intrinsic noise, based on low-dimensional experimental data and sufficiently informative priors in the form of pre-defined constraints imposed on model parameters. The algorithm remained operational even when very noisy experimental data were used. Importantly, by combining the self-organizing state-space model with an adaptive sampling algorithm akin to the Covariance Matrix Adaptation Evolution Strategy, we achieved a significant reduction in the variance of parameter estimates. The algorithm did not require the explicit formulation of a cost function and it was straightforward to apply on compartmental models and multiple data sets. Overall, the proposed methodology is particularly suitable for resolving high-dimensional inference problems based on noisy electrophysiological data and, therefore, a potentially useful tool in the construction of biophysical neuron models. PMID:22396632

  11. A New Modified Histogram Matching Normalization for Time Series Microarray Analysis.

    PubMed

    Astola, Laura; Molenaar, Jaap

    2014-07-01

    Microarray data is often utilized in inferring regulatory networks. Quantile normalization (QN) is a popular method to reduce array-to-array variation. We show that in the context of time series measurements QN may not be the best choice for this task, especially not if the inference is based on continuous time ODE model. We propose an alternative normalization method that is better suited for network inference from time series data.

  12. Empirical comparison study of approximate methods for structure selection in binary graphical models.

    PubMed

    Viallon, Vivian; Banerjee, Onureena; Jougla, Eric; Rey, Grégoire; Coste, Joel

    2014-03-01

    Looking for associations among multiple variables is a topical issue in statistics due to the increasing amount of data encountered in biology, medicine, and many other domains involving statistical applications. Graphical models have recently gained popularity for this purpose in the statistical literature. In the binary case, however, exact inference is generally very slow or even intractable because of the form of the so-called log-partition function. In this paper, we review various approximate methods for structure selection in binary graphical models that have recently been proposed in the literature and compare them through an extensive simulation study. We also propose a modification of one existing method, that is shown to achieve good performance and to be generally very fast. We conclude with an application in which we search for associations among causes of death recorded on French death certificates. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  13. A Decentralized Eigenvalue Computation Method for Spectrum Sensing Based on Average Consensus

    NASA Astrophysics Data System (ADS)

    Mohammadi, Jafar; Limmer, Steffen; Stańczak, Sławomir

    2016-07-01

    This paper considers eigenvalue estimation for the decentralized inference problem for spectrum sensing. We propose a decentralized eigenvalue computation algorithm based on the power method, which is referred to as generalized power method GPM; it is capable of estimating the eigenvalues of a given covariance matrix under certain conditions. Furthermore, we have developed a decentralized implementation of GPM by splitting the iterative operations into local and global computation tasks. The global tasks require data exchange to be performed among the nodes. For this task, we apply an average consensus algorithm to efficiently perform the global computations. As a special case, we consider a structured graph that is a tree with clusters of nodes at its leaves. For an accelerated distributed implementation, we propose to use computation over multiple access channel (CoMAC) as a building block of the algorithm. Numerical simulations are provided to illustrate the performance of the two algorithms.

  14. Analysis of cohort studies with multivariate and partially observed disease classification data.

    PubMed

    Chatterjee, Nilanjan; Sinha, Samiran; Diver, W Ryan; Feigelson, Heather Spencer

    2010-09-01

    Complex diseases like cancers can often be classified into subtypes using various pathological and molecular traits of the disease. In this article, we develop methods for analysis of disease incidence in cohort studies incorporating data on multiple disease traits using a two-stage semiparametric Cox proportional hazards regression model that allows one to examine the heterogeneity in the effect of the covariates by the levels of the different disease traits. For inference in the presence of missing disease traits, we propose a generalization of an estimating equation approach for handling missing cause of failure in competing-risk data. We prove asymptotic unbiasedness of the estimating equation method under a general missing-at-random assumption and propose a novel influence-function-based sandwich variance estimator. The methods are illustrated using simulation studies and a real data application involving the Cancer Prevention Study II nutrition cohort.

  15. Resolution of structural heterogeneity in dynamic crystallography

    PubMed Central

    Ren, Zhong; Chan, Peter W. Y.; Moffat, Keith; Pai, Emil F.; Royer, William E.; Šrajer, Vukica; Yang, Xiaojing

    2013-01-01

    Dynamic behavior of proteins is critical to their function. X-­ray crystallography, a powerful yet mostly static technique, faces inherent challenges in acquiring dynamic information despite decades of effort. Dynamic ‘structural changes’ are often indirectly inferred from ‘structural differences’ by comparing related static structures. In contrast, the direct observation of dynamic structural changes requires the initiation of a biochemical reaction or process in a crystal. Both the direct and the indirect approaches share a common challenge in analysis: how to interpret the structural heterogeneity intrinsic to all dynamic processes. This paper presents a real-space approach to this challenge, in which a suite of analytical methods and tools to identify and refine the mixed structural species present in multiple crystallographic data sets have been developed. These methods have been applied to representative scenarios in dynamic crystallography, and reveal structural information that is otherwise difficult to interpret or inaccessible using conventional methods. PMID:23695239

  16. Joint Inference of Population Assignment and Demographic History

    PubMed Central

    Choi, Sang Chul; Hey, Jody

    2011-01-01

    A new approach to assigning individuals to populations using genetic data is described. Most existing methods work by maximizing Hardy–Weinberg and linkage equilibrium within populations, neither of which will apply for many demographic histories. By including a demographic model, within a likelihood framework based on coalescent theory, we can jointly study demographic history and population assignment. Genealogies and population assignments are sampled from a posterior distribution using a general isolation-with-migration model for multiple populations. A measure of partition distance between assignments facilitates not only the summary of a posterior sample of assignments, but also the estimation of the posterior density for the demographic history. It is shown that joint estimates of assignment and demographic history are possible, including estimation of population phylogeny for samples from three populations. The new method is compared to results of a widely used assignment method, using simulated and published empirical data sets. PMID:21775468

  17. Resolution of structural heterogeneity in dynamic crystallography.

    PubMed

    Ren, Zhong; Chan, Peter W Y; Moffat, Keith; Pai, Emil F; Royer, William E; Šrajer, Vukica; Yang, Xiaojing

    2013-06-01

    Dynamic behavior of proteins is critical to their function. X-ray crystallography, a powerful yet mostly static technique, faces inherent challenges in acquiring dynamic information despite decades of effort. Dynamic `structural changes' are often indirectly inferred from `structural differences' by comparing related static structures. In contrast, the direct observation of dynamic structural changes requires the initiation of a biochemical reaction or process in a crystal. Both the direct and the indirect approaches share a common challenge in analysis: how to interpret the structural heterogeneity intrinsic to all dynamic processes. This paper presents a real-space approach to this challenge, in which a suite of analytical methods and tools to identify and refine the mixed structural species present in multiple crystallographic data sets have been developed. These methods have been applied to representative scenarios in dynamic crystallography, and reveal structural information that is otherwise difficult to interpret or inaccessible using conventional methods.

  18. Designing Studies That Would Address the Multilayered Nature of Health Care

    PubMed Central

    Pennell, Michael; Rhoda, Dale; Hade, Erinn M.; Paskett, Electra D.

    2010-01-01

    We review design and analytic methods available for multilevel interventions in cancer research with particular attention to study design, sample size requirements, and potential to provide statistical evidence for causal inference. The most appropriate methods will depend on the stage of development of the research and whether randomization is possible. Early on, fractional factorial designs may be used to screen intervention components, particularly when randomization of individuals is possible. Quasi-experimental designs, including time-series and multiple baseline designs, can be useful once the intervention is designed because they require few sites and can provide the preliminary evidence to plan efficacy studies. In efficacy and effectiveness studies, group-randomized trials are preferred when randomization is possible and regression discontinuity designs are preferred otherwise if assignment based on a quantitative score is possible. Quasi-experimental designs may be used, especially when combined with recent developments in analytic methods to reduce bias in effect estimates. PMID:20386057

  19. A comparison of gantry-mounted x-ray-based real-time target tracking methods.

    PubMed

    Montanaro, Tim; Nguyen, Doan Trang; Keall, Paul J; Booth, Jeremy; Caillet, Vincent; Eade, Thomas; Haddad, Carol; Shieh, Chun-Chien

    2018-03-01

    Most modern radiotherapy machines are built with a 2D kV imaging system. Combining this imaging system with a 2D-3D inference method would allow for a ready-made option for real-time 3D tumor tracking. This work investigates and compares the accuracy of four existing 2D-3D inference methods using both motion traces inferred from external surrogates and measured internally from implanted beacons. Tumor motion data from 160 fractions (46 thoracic/abdominal patients) of Synchrony traces (inferred traces), and 28 fractions (7 lung patients) of Calypso traces (internal traces) from the LIGHT SABR trial (NCT02514512) were used in this study. The motion traces were used as the ground truth. The ground truth trajectories were used in silico to generate 2D positions projected on the kV detector. These 2D traces were then passed to the 2D-3D inference methods: interdimensional correlation, Gaussian probability density function (PDF), arbitrary-shape PDF, and the Kalman filter. The inferred 3D positions were compared with the ground truth to determine tracking errors. The relationships between tracking error and motion magnitude, interdimensional correlation, and breathing periodicity index (BPI) were also investigated. Larger tracking errors were observed from the Calypso traces, with RMS and 95th percentile 3D errors of 0.84-1.25 mm and 1.72-2.64 mm, compared to 0.45-0.68 mm and 0.74-1.13 mm from the Synchrony traces. The Gaussian PDF method was found to be the most accurate, followed by the Kalman filter, the interdimensional correlation method, and the arbitrary-shape PDF method. Tracking error was found to strongly and positively correlate with motion magnitude for both the Synchrony and Calypso traces and for all four methods. Interdimensional correlation and BPI were found to negatively correlate with tracking error only for the Synchrony traces. The Synchrony traces exhibited higher interdimensional correlation than the Calypso traces especially in the anterior-posterior direction. Inferred traces often exhibit higher interdimensional correlation, which are not true representation of thoracic/abdominal motion and may underestimate kV-based tracking errors. The use of internal traces acquired from systems such as Calypso is advised for future kV-based tracking studies. The Gaussian PDF method is the most accurate 2D-3D inference method for tracking thoracic/abdominal targets. Motion magnitude has significant impact on 2D-3D inference error, and should be considered when estimating kV-based tracking error. © 2018 American Association of Physicists in Medicine.

  20. Investigation of Statistical Inference Methodologies Through Scale Model Propagation Experiments

    DTIC Science & Technology

    2015-09-30

    statistical inference methodologies for ocean- acoustic problems by investigating and applying statistical methods to data collected from scale-model...to begin planning experiments for statistical inference applications. APPROACH In the ocean acoustics community over the past two decades...solutions for waveguide parameters. With the introduction of statistical inference to the field of ocean acoustics came the desire to interpret marginal

Top