NASA Astrophysics Data System (ADS)
Saputra, K. V. I.; Cahyadi, L.; Sembiring, U. A.
2018-01-01
Start in this paper, we assess our traditional elementary statistics education and also we introduce elementary statistics with simulation-based inference. To assess our statistical class, we adapt the well-known CAOS (Comprehensive Assessment of Outcomes in Statistics) test that serves as an external measure to assess the student’s basic statistical literacy. This test generally represents as an accepted measure of statistical literacy. We also introduce a new teaching method on elementary statistics class. Different from the traditional elementary statistics course, we will introduce a simulation-based inference method to conduct hypothesis testing. From the literature, it has shown that this new teaching method works very well in increasing student’s understanding of statistics.
Design-based Sample and Probability Law-Assumed Sample: Their Role in Scientific Investigation.
ERIC Educational Resources Information Center
Ojeda, Mario Miguel; Sahai, Hardeo
2002-01-01
Discusses some key statistical concepts in probabilistic and non-probabilistic sampling to provide an overview for understanding the inference process. Suggests a statistical model constituting the basis of statistical inference and provides a brief review of the finite population descriptive inference and a quota sampling inferential theory.…
The Importance of Statistical Modeling in Data Analysis and Inference
ERIC Educational Resources Information Center
Rollins, Derrick, Sr.
2017-01-01
Statistical inference simply means to draw a conclusion based on information that comes from data. Error bars are the most commonly used tool for data analysis and inference in chemical engineering data studies. This work demonstrates, using common types of data collection studies, the importance of specifying the statistical model for sound…
Inferring Demographic History Using Two-Locus Statistics.
Ragsdale, Aaron P; Gutenkunst, Ryan N
2017-06-01
Population demographic history may be learned from contemporary genetic variation data. Methods based on aggregating the statistics of many single loci into an allele frequency spectrum (AFS) have proven powerful, but such methods ignore potentially informative patterns of linkage disequilibrium (LD) between neighboring loci. To leverage such patterns, we developed a composite-likelihood framework for inferring demographic history from aggregated statistics of pairs of loci. Using this framework, we show that two-locus statistics are more sensitive to demographic history than single-locus statistics such as the AFS. In particular, two-locus statistics escape the notorious confounding of depth and duration of a bottleneck, and they provide a means to estimate effective population size based on the recombination rather than mutation rate. We applied our approach to a Zambian population of Drosophila melanogaster Notably, using both single- and two-locus statistics, we inferred a substantially lower ancestral effective population size than previous works and did not infer a bottleneck history. Together, our results demonstrate the broad potential for two-locus statistics to enable powerful population genetic inference. Copyright © 2017 by the Genetics Society of America.
Apes are intuitive statisticians.
Rakoczy, Hannes; Clüver, Annette; Saucke, Liane; Stoffregen, Nicole; Gräbener, Alice; Migura, Judith; Call, Josep
2014-04-01
Inductive learning and reasoning, as we use it both in everyday life and in science, is characterized by flexible inferences based on statistical information: inferences from populations to samples and vice versa. Many forms of such statistical reasoning have been found to develop late in human ontogeny, depending on formal education and language, and to be fragile even in adults. New revolutionary research, however, suggests that even preverbal human infants make use of intuitive statistics. Here, we conducted the first investigation of such intuitive statistical reasoning with non-human primates. In a series of 7 experiments, Bonobos, Chimpanzees, Gorillas and Orangutans drew flexible statistical inferences from populations to samples. These inferences, furthermore, were truly based on statistical information regarding the relative frequency distributions in a population, and not on absolute frequencies. Intuitive statistics in its most basic form is thus an evolutionarily more ancient rather than a uniquely human capacity. Copyright © 2014 Elsevier B.V. All rights reserved.
"Magnitude-based inference": a statistical review.
Welsh, Alan H; Knight, Emma J
2015-04-01
We consider "magnitude-based inference" and its interpretation by examining in detail its use in the problem of comparing two means. We extract from the spreadsheets, which are provided to users of the analysis (http://www.sportsci.org/), a precise description of how "magnitude-based inference" is implemented. We compare the implemented version of the method with general descriptions of it and interpret the method in familiar statistical terms. We show that "magnitude-based inference" is not a progressive improvement on modern statistics. The additional probabilities introduced are not directly related to the confidence interval but, rather, are interpretable either as P values for two different nonstandard tests (for different null hypotheses) or as approximate Bayesian calculations, which also lead to a type of test. We also discuss sample size calculations associated with "magnitude-based inference" and show that the substantial reduction in sample sizes claimed for the method (30% of the sample size obtained from standard frequentist calculations) is not justifiable so the sample size calculations should not be used. Rather than using "magnitude-based inference," a better solution is to be realistic about the limitations of the data and use either confidence intervals or a fully Bayesian analysis.
Statistical inference for remote sensing-based estimates of net deforestation
Ronald E. McRoberts; Brian F. Walters
2012-01-01
Statistical inference requires expression of an estimate in probabilistic terms, usually in the form of a confidence interval. An approach to constructing confidence intervals for remote sensing-based estimates of net deforestation is illustrated. The approach is based on post-classification methods using two independent forest/non-forest classifications because...
In defence of model-based inference in phylogeography
Beaumont, Mark A.; Nielsen, Rasmus; Robert, Christian; Hey, Jody; Gaggiotti, Oscar; Knowles, Lacey; Estoup, Arnaud; Panchal, Mahesh; Corander, Jukka; Hickerson, Mike; Sisson, Scott A.; Fagundes, Nelson; Chikhi, Lounès; Beerli, Peter; Vitalis, Renaud; Cornuet, Jean-Marie; Huelsenbeck, John; Foll, Matthieu; Yang, Ziheng; Rousset, Francois; Balding, David; Excoffier, Laurent
2017-01-01
Recent papers have promoted the view that model-based methods in general, and those based on Approximate Bayesian Computation (ABC) in particular, are flawed in a number of ways, and are therefore inappropriate for the analysis of phylogeographic data. These papers further argue that Nested Clade Phylogeographic Analysis (NCPA) offers the best approach in statistical phylogeography. In order to remove the confusion and misconceptions introduced by these papers, we justify and explain the reasoning behind model-based inference. We argue that ABC is a statistically valid approach, alongside other computational statistical techniques that have been successfully used to infer parameters and compare models in population genetics. We also examine the NCPA method and highlight numerous deficiencies, either when used with single or multiple loci. We further show that the ages of clades are carelessly used to infer ages of demographic events, that these ages are estimated under a simple model of panmixia and population stationarity but are then used under different and unspecified models to test hypotheses, a usage the invalidates these testing procedures. We conclude by encouraging researchers to study and use model-based inference in population genetics. PMID:29284924
Inference and the Introductory Statistics Course
ERIC Educational Resources Information Center
Pfannkuch, Maxine; Regan, Matt; Wild, Chris; Budgett, Stephanie; Forbes, Sharleen; Harraway, John; Parsonage, Ross
2011-01-01
This article sets out some of the rationale and arguments for making major changes to the teaching and learning of statistical inference in introductory courses at our universities by changing from a norm-based, mathematical approach to more conceptually accessible computer-based approaches. The core problem of the inferential argument with its…
On Some Assumptions of the Null Hypothesis Statistical Testing
ERIC Educational Resources Information Center
Patriota, Alexandre Galvão
2017-01-01
Bayesian and classical statistical approaches are based on different types of logical principles. In order to avoid mistaken inferences and misguided interpretations, the practitioner must respect the inference rules embedded into each statistical method. Ignoring these principles leads to the paradoxical conclusions that the hypothesis…
Statistical Inferences from Formaldehyde Dna-Protein Cross-Link Data
Physiologically-based pharmacokinetic (PBPK) modeling has reached considerable sophistication in its application in the pharmacological and environmental health areas. Yet, mature methodologies for making statistical inferences have not been routinely incorporated in these applic...
Sensitivity to the Sampling Process Emerges From the Principle of Efficiency.
Jara-Ettinger, Julian; Sun, Felix; Schulz, Laura; Tenenbaum, Joshua B
2018-05-01
Humans can seamlessly infer other people's preferences, based on what they do. Broadly, two types of accounts have been proposed to explain different aspects of this ability. The first account focuses on spatial information: Agents' efficient navigation in space reveals what they like. The second account focuses on statistical information: Uncommon choices reveal stronger preferences. Together, these two lines of research suggest that we have two distinct capacities for inferring preferences. Here we propose that this is not the case, and that spatial-based and statistical-based preference inferences can be explained by the assumption that agents are efficient alone. We show that people's sensitivity to spatial and statistical information when they infer preferences is best predicted by a computational model of the principle of efficiency, and that this model outperforms dual-system models, even when the latter are fit to participant judgments. Our results suggest that, as adults, a unified understanding of agency under the principle of efficiency underlies our ability to infer preferences. Copyright © 2018 Cognitive Science Society, Inc.
ERIC Educational Resources Information Center
Thompson, Bruce
Web-based statistical instruction, like all statistical instruction, ought to focus on teaching the essence of the research endeavor: the exercise of reflective judgment. Using the framework of the recent report of the American Psychological Association (APA) Task Force on Statistical Inference (Wilkinson and the APA Task Force on Statistical…
Introducing Statistical Inference to Biology Students through Bootstrapping and Randomization
ERIC Educational Resources Information Center
Lock, Robin H.; Lock, Patti Frazer
2008-01-01
Bootstrap methods and randomization tests are increasingly being used as alternatives to standard statistical procedures in biology. They also serve as an effective introduction to the key ideas of statistical inference in introductory courses for biology students. We discuss the use of such simulation based procedures in an integrated curriculum…
Statistical inference for extended or shortened phase II studies based on Simon's two-stage designs.
Zhao, Junjun; Yu, Menggang; Feng, Xi-Ping
2015-06-07
Simon's two-stage designs are popular choices for conducting phase II clinical trials, especially in the oncology trials to reduce the number of patients placed on ineffective experimental therapies. Recently Koyama and Chen (2008) discussed how to conduct proper inference for such studies because they found that inference procedures used with Simon's designs almost always ignore the actual sampling plan used. In particular, they proposed an inference method for studies when the actual second stage sample sizes differ from planned ones. We consider an alternative inference method based on likelihood ratio. In particular, we order permissible sample paths under Simon's two-stage designs using their corresponding conditional likelihood. In this way, we can calculate p-values using the common definition: the probability of obtaining a test statistic value at least as extreme as that observed under the null hypothesis. In addition to providing inference for a couple of scenarios where Koyama and Chen's method can be difficult to apply, the resulting estimate based on our method appears to have certain advantage in terms of inference properties in many numerical simulations. It generally led to smaller biases and narrower confidence intervals while maintaining similar coverages. We also illustrated the two methods in a real data setting. Inference procedures used with Simon's designs almost always ignore the actual sampling plan. Reported P-values, point estimates and confidence intervals for the response rate are not usually adjusted for the design's adaptiveness. Proper statistical inference procedures should be used.
Cluster mass inference via random field theory.
Zhang, Hui; Nichols, Thomas E; Johnson, Timothy D
2009-01-01
Cluster extent and voxel intensity are two widely used statistics in neuroimaging inference. Cluster extent is sensitive to spatially extended signals while voxel intensity is better for intense but focal signals. In order to leverage strength from both statistics, several nonparametric permutation methods have been proposed to combine the two methods. Simulation studies have shown that of the different cluster permutation methods, the cluster mass statistic is generally the best. However, to date, there is no parametric cluster mass inference available. In this paper, we propose a cluster mass inference method based on random field theory (RFT). We develop this method for Gaussian images, evaluate it on Gaussian and Gaussianized t-statistic images and investigate its statistical properties via simulation studies and real data. Simulation results show that the method is valid under the null hypothesis and demonstrate that it can be more powerful than the cluster extent inference method. Further, analyses with a single subject and a group fMRI dataset demonstrate better power than traditional cluster size inference, and good accuracy relative to a gold-standard permutation test.
The Role of Probability-Based Inference in an Intelligent Tutoring System.
ERIC Educational Resources Information Center
Mislevy, Robert J.; Gitomer, Drew H.
Probability-based inference in complex networks of interdependent variables is an active topic in statistical research, spurred by such diverse applications as forecasting, pedigree analysis, troubleshooting, and medical diagnosis. This paper concerns the role of Bayesian inference networks for updating student models in intelligent tutoring…
Data-driven inference for the spatial scan statistic.
Almeida, Alexandre C L; Duarte, Anderson R; Duczmal, Luiz H; Oliveira, Fernando L P; Takahashi, Ricardo H C
2011-08-02
Kulldorff's spatial scan statistic for aggregated area maps searches for clusters of cases without specifying their size (number of areas) or geographic location in advance. Their statistical significance is tested while adjusting for the multiple testing inherent in such a procedure. However, as is shown in this work, this adjustment is not done in an even manner for all possible cluster sizes. A modification is proposed to the usual inference test of the spatial scan statistic, incorporating additional information about the size of the most likely cluster found. A new interpretation of the results of the spatial scan statistic is done, posing a modified inference question: what is the probability that the null hypothesis is rejected for the original observed cases map with a most likely cluster of size k, taking into account only those most likely clusters of size k found under null hypothesis for comparison? This question is especially important when the p-value computed by the usual inference process is near the alpha significance level, regarding the correctness of the decision based in this inference. A practical procedure is provided to make more accurate inferences about the most likely cluster found by the spatial scan statistic.
ERIC Educational Resources Information Center
Denbleyker, John Nickolas
2012-01-01
The shortcomings of the proportion above cut (PAC) statistic used so prominently in the educational landscape renders it a very problematic measure for making correct inferences with student test data. The limitations of PAC-based statistics are more pronounced with cross-test comparisons due to their dependency on cut-score locations. A better…
ERIC Educational Resources Information Center
Trumpower, David L.
2015-01-01
Making inferences about population differences based on samples of data, that is, performing intuitive analysis of variance (IANOVA), is common in everyday life. However, the intuitive reasoning of individuals when making such inferences (even following statistics instruction), often differs from the normative logic of formal statistics. The…
Intuitive statistics by 8-month-old infants
Xu, Fei; Garcia, Vashti
2008-01-01
Human learners make inductive inferences based on small amounts of data: we generalize from samples to populations and vice versa. The academic discipline of statistics formalizes these intuitive statistical inferences. What is the origin of this ability? We report six experiments investigating whether 8-month-old infants are “intuitive statisticians.” Our results showed that, given a sample, the infants were able to make inferences about the population from which the sample had been drawn. Conversely, given information about the entire population of relatively small size, the infants were able to make predictions about the sample. Our findings provide evidence that infants possess a powerful mechanism for inductive learning, either using heuristics or basic principles of probability. This ability to make inferences based on samples or information about the population develops early and in the absence of schooling or explicit teaching. Human infants may be rational learners from very early in development. PMID:18378901
Spurious correlations and inference in landscape genetics
Samuel A. Cushman; Erin L. Landguth
2010-01-01
Reliable interpretation of landscape genetic analyses depends on statistical methods that have high power to identify the correct process driving gene flow while rejecting incorrect alternative hypotheses. Little is known about statistical power and inference in individual-based landscape genetics. Our objective was to evaluate the power of causalmodelling with partial...
Ensemble stacking mitigates biases in inference of synaptic connectivity.
Chambers, Brendan; Levy, Maayan; Dechery, Joseph B; MacLean, Jason N
2018-01-01
A promising alternative to directly measuring the anatomical connections in a neuronal population is inferring the connections from the activity. We employ simulated spiking neuronal networks to compare and contrast commonly used inference methods that identify likely excitatory synaptic connections using statistical regularities in spike timing. We find that simple adjustments to standard algorithms improve inference accuracy: A signing procedure improves the power of unsigned mutual-information-based approaches and a correction that accounts for differences in mean and variance of background timing relationships, such as those expected to be induced by heterogeneous firing rates, increases the sensitivity of frequency-based methods. We also find that different inference methods reveal distinct subsets of the synaptic network and each method exhibits different biases in the accurate detection of reciprocity and local clustering. To correct for errors and biases specific to single inference algorithms, we combine methods into an ensemble. Ensemble predictions, generated as a linear combination of multiple inference algorithms, are more sensitive than the best individual measures alone, and are more faithful to ground-truth statistics of connectivity, mitigating biases specific to single inference methods. These weightings generalize across simulated datasets, emphasizing the potential for the broad utility of ensemble-based approaches.
Difference to Inference: teaching logical and statistical reasoning through on-line interactivity.
Malloy, T E
2001-05-01
Difference to Inference is an on-line JAVA program that simulates theory testing and falsification through research design and data collection in a game format. The program, based on cognitive and epistemological principles, is designed to support learning of the thinking skills underlying deductive and inductive logic and statistical reasoning. Difference to Inference has database connectivity so that game scores can be counted as part of course grades.
Statistically optimal perception and learning: from behavior to neural representations
Fiser, József; Berkes, Pietro; Orbán, Gergő; Lengyel, Máté
2010-01-01
Human perception has recently been characterized as statistical inference based on noisy and ambiguous sensory inputs. Moreover, suitable neural representations of uncertainty have been identified that could underlie such probabilistic computations. In this review, we argue that learning an internal model of the sensory environment is another key aspect of the same statistical inference procedure and thus perception and learning need to be treated jointly. We review evidence for statistically optimal learning in humans and animals, and reevaluate possible neural representations of uncertainty based on their potential to support statistically optimal learning. We propose that spontaneous activity can have a functional role in such representations leading to a new, sampling-based, framework of how the cortex represents information and uncertainty. PMID:20153683
Design-based and model-based inference in surveys of freshwater mollusks
Dorazio, R.M.
1999-01-01
Well-known concepts in statistical inference and sampling theory are used to develop recommendations for planning and analyzing the results of quantitative surveys of freshwater mollusks. Two methods of inference commonly used in survey sampling (design-based and model-based) are described and illustrated using examples relevant in surveys of freshwater mollusks. The particular objectives of a survey and the type of information observed in each unit of sampling can be used to help select the sampling design and the method of inference. For example, the mean density of a sparsely distributed population of mollusks can be estimated with higher precision by using model-based inference or by using design-based inference with adaptive cluster sampling than by using design-based inference with conventional sampling. More experience with quantitative surveys of natural assemblages of freshwater mollusks is needed to determine the actual benefits of different sampling designs and inferential procedures.
Statistical inference and Aristotle's Rhetoric.
Macdonald, Ranald R
2004-11-01
Formal logic operates in a closed system where all the information relevant to any conclusion is present, whereas this is not the case when one reasons about events and states of the world. Pollard and Richardson drew attention to the fact that the reasoning behind statistical tests does not lead to logically justifiable conclusions. In this paper statistical inferences are defended not by logic but by the standards of everyday reasoning. Aristotle invented formal logic, but argued that people mostly get at the truth with the aid of enthymemes--incomplete syllogisms which include arguing from examples, analogies and signs. It is proposed that statistical tests work in the same way--in that they are based on examples, invoke the analogy of a model and use the size of the effect under test as a sign that the chance hypothesis is unlikely. Of existing theories of statistical inference only a weak version of Fisher's takes this into account. Aristotle anticipated Fisher by producing an argument of the form that there were too many cases in which an outcome went in a particular direction for that direction to be plausibly attributed to chance. We can therefore conclude that Aristotle would have approved of statistical inference and there is a good reason for calling this form of statistical inference classical.
Serang, Oliver; Noble, William Stafford
2012-01-01
The problem of identifying the proteins in a complex mixture using tandem mass spectrometry can be framed as an inference problem on a graph that connects peptides to proteins. Several existing protein identification methods make use of statistical inference methods for graphical models, including expectation maximization, Markov chain Monte Carlo, and full marginalization coupled with approximation heuristics. We show that, for this problem, the majority of the cost of inference usually comes from a few highly connected subgraphs. Furthermore, we evaluate three different statistical inference methods using a common graphical model, and we demonstrate that junction tree inference substantially improves rates of convergence compared to existing methods. The python code used for this paper is available at http://noble.gs.washington.edu/proj/fido. PMID:22331862
Empirical evidence for acceleration-dependent amplification factors
Borcherdt, R.D.
2002-01-01
Site-specific amplification factors, Fa and Fv, used in current U.S. building codes decrease with increasing base acceleration level as implied by the Loma Prieta earthquake at 0.1g and extrapolated using numerical models and laboratory results. The Northridge earthquake recordings of 17 January 1994 and subsequent geotechnical data permit empirical estimates of amplification at base acceleration levels up to 0.5g. Distance measures and normalization procedures used to infer amplification ratios from soil-rock pairs in predetermined azimuth-distance bins significantly influence the dependence of amplification estimates on base acceleration. Factors inferred using a hypocentral distance norm do not show a statistically significant dependence on base acceleration. Factors inferred using norms implied by the attenuation functions of Abrahamson and Silva show a statistically significant decrease with increasing base acceleration. The decrease is statistically more significant for stiff clay and sandy soil (site class D) sites than for stiffer sites underlain by gravely soils and soft rock (site class C). The decrease in amplification with increasing base acceleration is more pronounced for the short-period amplification factor, Fa, than for the midperiod factor, Fv.
Statistical Inference at Work: Statistical Process Control as an Example
ERIC Educational Resources Information Center
Bakker, Arthur; Kent, Phillip; Derry, Jan; Noss, Richard; Hoyles, Celia
2008-01-01
To characterise statistical inference in the workplace this paper compares a prototypical type of statistical inference at work, statistical process control (SPC), with a type of statistical inference that is better known in educational settings, hypothesis testing. Although there are some similarities between the reasoning structure involved in…
The role of familiarity in binary choice inferences.
Honda, Hidehito; Abe, Keiga; Matsuka, Toshihiko; Yamagishi, Kimihiko
2011-07-01
In research on the recognition heuristic (Goldstein & Gigerenzer, Psychological Review, 109, 75-90, 2002), knowledge of recognized objects has been categorized as "recognized" or "unrecognized" without regard to the degree of familiarity of the recognized object. In the present article, we propose a new inference model--familiarity-based inference. We hypothesize that when subjective knowledge levels (familiarity) of recognized objects differ, the degree of familiarity of recognized objects will influence inferences. Specifically, people are predicted to infer that the more familiar object in a pair of two objects has a higher criterion value on the to-be-judged dimension. In two experiments, using a binary choice task, we examined inferences about populations in a pair of two cities. Results support predictions of familiarity-based inference. Participants inferred that the more familiar city in a pair was more populous. Statistical modeling showed that individual differences in familiarity-based inference lie in the sensitivity to differences in familiarity. In addition, we found that familiarity-based inference can be generally regarded as an ecologically rational inference. Furthermore, when cue knowledge about the inference criterion was available, participants made inferences based on the cue knowledge about population instead of familiarity. Implications of the role of familiarity in psychological processes are discussed.
Royle, J. Andrew; Dorazio, Robert M.
2008-01-01
A guide to data collection, modeling and inference strategies for biological survey data using Bayesian and classical statistical methods. This book describes a general and flexible framework for modeling and inference in ecological systems based on hierarchical models, with a strict focus on the use of probability models and parametric inference. Hierarchical models represent a paradigm shift in the application of statistics to ecological inference problems because they combine explicit models of ecological system structure or dynamics with models of how ecological systems are observed. The principles of hierarchical modeling are developed and applied to problems in population, metapopulation, community, and metacommunity systems. The book provides the first synthetic treatment of many recent methodological advances in ecological modeling and unifies disparate methods and procedures. The authors apply principles of hierarchical modeling to ecological problems, including * occurrence or occupancy models for estimating species distribution * abundance models based on many sampling protocols, including distance sampling * capture-recapture models with individual effects * spatial capture-recapture models based on camera trapping and related methods * population and metapopulation dynamic models * models of biodiversity, community structure and dynamics.
Feder, Paul I; Ma, Zhenxu J; Bull, Richard J; Teuschler, Linda K; Rice, Glenn
2009-01-01
In chemical mixtures risk assessment, the use of dose-response data developed for one mixture to estimate risk posed by a second mixture depends on whether the two mixtures are sufficiently similar. While evaluations of similarity may be made using qualitative judgments, this article uses nonparametric statistical methods based on the "bootstrap" resampling technique to address the question of similarity among mixtures of chemical disinfectant by-products (DBP) in drinking water. The bootstrap resampling technique is a general-purpose, computer-intensive approach to statistical inference that substitutes empirical sampling for theoretically based parametric mathematical modeling. Nonparametric, bootstrap-based inference involves fewer assumptions than parametric normal theory based inference. The bootstrap procedure is appropriate, at least in an asymptotic sense, whether or not the parametric, distributional assumptions hold, even approximately. The statistical analysis procedures in this article are initially illustrated with data from 5 water treatment plants (Schenck et al., 2009), and then extended using data developed from a study of 35 drinking-water utilities (U.S. EPA/AMWA, 1989), which permits inclusion of a greater number of water constituents and increased structure in the statistical models.
Protein and gene model inference based on statistical modeling in k-partite graphs.
Gerster, Sarah; Qeli, Ermir; Ahrens, Christian H; Bühlmann, Peter
2010-07-06
One of the major goals of proteomics is the comprehensive and accurate description of a proteome. Shotgun proteomics, the method of choice for the analysis of complex protein mixtures, requires that experimentally observed peptides are mapped back to the proteins they were derived from. This process is also known as protein inference. We present Markovian Inference of Proteins and Gene Models (MIPGEM), a statistical model based on clearly stated assumptions to address the problem of protein and gene model inference for shotgun proteomics data. In particular, we are dealing with dependencies among peptides and proteins using a Markovian assumption on k-partite graphs. We are also addressing the problems of shared peptides and ambiguous proteins by scoring the encoding gene models. Empirical results on two control datasets with synthetic mixtures of proteins and on complex protein samples of Saccharomyces cerevisiae, Drosophila melanogaster, and Arabidopsis thaliana suggest that the results with MIPGEM are competitive with existing tools for protein inference.
ERIC Educational Resources Information Center
Meiser, Thorsten; Rummel, Jan; Fleig, Hanna
2018-01-01
Pseudocontingencies are inferences about correlations in the environment that are formed on the basis of statistical regularities like skewed base rates or varying base rates across environmental contexts. Previous research has demonstrated that pseudocontingencies provide a pervasive mechanism of inductive inference in numerous social judgment…
Cognitive Transfer Outcomes for a Simulation-Based Introductory Statistics Curriculum
ERIC Educational Resources Information Center
Backman, Matthew D.; Delmas, Robert C.; Garfield, Joan
2017-01-01
Cognitive transfer is the ability to apply learned skills and knowledge to new applications and contexts. This investigation evaluates cognitive transfer outcomes for a tertiary-level introductory statistics course using the CATALST curriculum, which exclusively used simulation-based methods to develop foundations of statistical inference. A…
Theory-based Bayesian Models of Inductive Inference
2010-07-19
Subjective randomness and natural scene statistics. Psychonomic Bulletin & Review . http://cocosci.berkeley.edu/tom/papers/randscenes.pdf Page 1...in press). Exemplar models as a mechanism for performing Bayesian inference. Psychonomic Bulletin & Review . http://cocosci.berkeley.edu/tom
Statistical numeracy as a moderator of (pseudo)contingency effects on decision behavior.
Fleig, Hanna; Meiser, Thorsten; Ettlin, Florence; Rummel, Jan
2017-03-01
Pseudocontingencies denote contingency estimates inferred from base rates rather than from cell frequencies. We examined the role of statistical numeracy for effects of such fallible but adaptive inferences on choice behavior. In Experiment 1, we provided information on single observations as well as on base rates and tracked participants' eye movements. In Experiment 2, we manipulated the availability of information on cell frequencies and base rates between conditions. Our results demonstrate that a focus on base rates rather than cell frequencies benefits pseudocontingency effects. Learners who are more proficient in (conditional) probability calculation prefer to rely on cell frequencies in order to judge contingencies, though, as was evident from their gaze behavior. If cell frequencies are available in summarized format, they may infer the true contingency between options and outcomes. Otherwise, however, even highly numerate learners are susceptible to pseudocontingency effects. Copyright © 2017 Elsevier B.V. All rights reserved.
Anchoring quartet-based phylogenetic distances and applications to species tree reconstruction.
Sayyari, Erfan; Mirarab, Siavash
2016-11-11
Inferring species trees from gene trees using the coalescent-based summary methods has been the subject of much attention, yet new scalable and accurate methods are needed. We introduce DISTIQUE, a new statistically consistent summary method for inferring species trees from gene trees under the coalescent model. We generalize our results to arbitrary phylogenetic inference problems; we show that two arbitrarily chosen leaves, called anchors, can be used to estimate relative distances between all other pairs of leaves by inferring relevant quartet trees. This results in a family of distance-based tree inference methods, with running times ranging between quadratic to quartic in the number of leaves. We show in simulated studies that DISTIQUE has comparable accuracy to leading coalescent-based summary methods and reduced running times.
Theory-based Bayesian models of inductive learning and reasoning.
Tenenbaum, Joshua B; Griffiths, Thomas L; Kemp, Charles
2006-07-01
Inductive inference allows humans to make powerful generalizations from sparse data when learning about word meanings, unobserved properties, causal relationships, and many other aspects of the world. Traditional accounts of induction emphasize either the power of statistical learning, or the importance of strong constraints from structured domain knowledge, intuitive theories or schemas. We argue that both components are necessary to explain the nature, use and acquisition of human knowledge, and we introduce a theory-based Bayesian framework for modeling inductive learning and reasoning as statistical inferences over structured knowledge representations.
Investigation of Statistical Inference Methodologies Through Scale Model Propagation Experiments
2015-09-30
statistical inference methodologies for ocean- acoustic problems by investigating and applying statistical methods to data collected from scale-model...to begin planning experiments for statistical inference applications. APPROACH In the ocean acoustics community over the past two decades...solutions for waveguide parameters. With the introduction of statistical inference to the field of ocean acoustics came the desire to interpret marginal
A Coalitional Game for Distributed Inference in Sensor Networks With Dependent Observations
NASA Astrophysics Data System (ADS)
He, Hao; Varshney, Pramod K.
2016-04-01
We consider the problem of collaborative inference in a sensor network with heterogeneous and statistically dependent sensor observations. Each sensor aims to maximize its inference performance by forming a coalition with other sensors and sharing information within the coalition. It is proved that the inference performance is a nondecreasing function of the coalition size. However, in an energy constrained network, the energy consumption of inter-sensor communication also increases with increasing coalition size, which discourages the formation of the grand coalition (the set of all sensors). In this paper, the formation of non-overlapping coalitions with statistically dependent sensors is investigated under a specific communication constraint. We apply a game theoretical approach to fully explore and utilize the information contained in the spatial dependence among sensors to maximize individual sensor performance. Before formulating the distributed inference problem as a coalition formation game, we first quantify the gain and loss in forming a coalition by introducing the concepts of diversity gain and redundancy loss for both estimation and detection problems. These definitions, enabled by the statistical theory of copulas, allow us to characterize the influence of statistical dependence among sensor observations on inference performance. An iterative algorithm based on merge-and-split operations is proposed for the solution and the stability of the proposed algorithm is analyzed. Numerical results are provided to demonstrate the superiority of our proposed game theoretical approach.
Computational statistics using the Bayesian Inference Engine
NASA Astrophysics Data System (ADS)
Weinberg, Martin D.
2013-09-01
This paper introduces the Bayesian Inference Engine (BIE), a general parallel, optimized software package for parameter inference and model selection. This package is motivated by the analysis needs of modern astronomical surveys and the need to organize and reuse expensive derived data. The BIE is the first platform for computational statistics designed explicitly to enable Bayesian update and model comparison for astronomical problems. Bayesian update is based on the representation of high-dimensional posterior distributions using metric-ball-tree based kernel density estimation. Among its algorithmic offerings, the BIE emphasizes hybrid tempered Markov chain Monte Carlo schemes that robustly sample multimodal posterior distributions in high-dimensional parameter spaces. Moreover, the BIE implements a full persistence or serialization system that stores the full byte-level image of the running inference and previously characterized posterior distributions for later use. Two new algorithms to compute the marginal likelihood from the posterior distribution, developed for and implemented in the BIE, enable model comparison for complex models and data sets. Finally, the BIE was designed to be a collaborative platform for applying Bayesian methodology to astronomy. It includes an extensible object-oriented and easily extended framework that implements every aspect of the Bayesian inference. By providing a variety of statistical algorithms for all phases of the inference problem, a scientist may explore a variety of approaches with a single model and data implementation. Additional technical details and download details are available from http://www.astro.umass.edu/bie. The BIE is distributed under the GNU General Public License.
NASA Astrophysics Data System (ADS)
Stan Development Team
2018-01-01
Stan facilitates statistical inference at the frontiers of applied statistics and provides both a modeling language for specifying complex statistical models and a library of statistical algorithms for computing inferences with those models. These components are exposed through interfaces in environments such as R, Python, and the command line.
ERIC Educational Resources Information Center
Jacob, Bridgette L.
2013-01-01
The difficulties introductory statistics students have with formal statistical inference are well known in the field of statistics education. "Informal" statistical inference has been studied as a means to introduce inferential reasoning well before and without the formalities of formal statistical inference. This mixed methods study…
ERIC Educational Resources Information Center
Braham, Hana Manor; Ben-Zvi, Dani
2017-01-01
A fundamental aspect of statistical inference is representation of real-world data using statistical models. This article analyzes students' articulations of statistical models and modeling during their first steps in making informal statistical inferences. An integrated modeling approach (IMA) was designed and implemented to help students…
Nabi, Razieh; Shpitser, Ilya
2017-01-01
In this paper, we consider the problem of fair statistical inference involving outcome variables. Examples include classification and regression problems, and estimating treatment effects in randomized trials or observational data. The issue of fairness arises in such problems where some covariates or treatments are “sensitive,” in the sense of having potential of creating discrimination. In this paper, we argue that the presence of discrimination can be formalized in a sensible way as the presence of an effect of a sensitive covariate on the outcome along certain causal pathways, a view which generalizes (Pearl 2009). A fair outcome model can then be learned by solving a constrained optimization problem. We discuss a number of complications that arise in classical statistical inference due to this view and provide workarounds based on recent work in causal and semi-parametric inference.
The role of causal criteria in causal inferences: Bradford Hill's "aspects of association".
Ward, Andrew C
2009-06-17
As noted by Wesley Salmon and many others, causal concepts are ubiquitous in every branch of theoretical science, in the practical disciplines and in everyday life. In the theoretical and practical sciences especially, people often base claims about causal relations on applications of statistical methods to data. However, the source and type of data place important constraints on the choice of statistical methods as well as on the warrant attributed to the causal claims based on the use of such methods. For example, much of the data used by people interested in making causal claims come from non-experimental, observational studies in which random allocations to treatment and control groups are not present. Thus, one of the most important problems in the social and health sciences concerns making justified causal inferences using non-experimental, observational data. In this paper, I examine one method of justifying such inferences that is especially widespread in epidemiology and the health sciences generally - the use of causal criteria. I argue that while the use of causal criteria is not appropriate for either deductive or inductive inferences, they do have an important role to play in inferences to the best explanation. As such, causal criteria, exemplified by what Bradford Hill referred to as "aspects of [statistical] associations", have an indispensible part to play in the goal of making justified causal claims.
The role of causal criteria in causal inferences: Bradford Hill's "aspects of association"
Ward, Andrew C
2009-01-01
As noted by Wesley Salmon and many others, causal concepts are ubiquitous in every branch of theoretical science, in the practical disciplines and in everyday life. In the theoretical and practical sciences especially, people often base claims about causal relations on applications of statistical methods to data. However, the source and type of data place important constraints on the choice of statistical methods as well as on the warrant attributed to the causal claims based on the use of such methods. For example, much of the data used by people interested in making causal claims come from non-experimental, observational studies in which random allocations to treatment and control groups are not present. Thus, one of the most important problems in the social and health sciences concerns making justified causal inferences using non-experimental, observational data. In this paper, I examine one method of justifying such inferences that is especially widespread in epidemiology and the health sciences generally – the use of causal criteria. I argue that while the use of causal criteria is not appropriate for either deductive or inductive inferences, they do have an important role to play in inferences to the best explanation. As such, causal criteria, exemplified by what Bradford Hill referred to as "aspects of [statistical] associations", have an indispensible part to play in the goal of making justified causal claims. PMID:19534788
Data-driven sensitivity inference for Thomson scattering electron density measurement systems.
Fujii, Keisuke; Yamada, Ichihiro; Hasuo, Masahiro
2017-01-01
We developed a method to infer the calibration parameters of multichannel measurement systems, such as channel variations of sensitivity and noise amplitude, from experimental data. We regard such uncertainties of the calibration parameters as dependent noise. The statistical properties of the dependent noise and that of the latent functions were modeled and implemented in the Gaussian process kernel. Based on their statistical difference, both parameters were inferred from the data. We applied this method to the electron density measurement system by Thomson scattering for the Large Helical Device plasma, which is equipped with 141 spatial channels. Based on the 210 sets of experimental data, we evaluated the correction factor of the sensitivity and noise amplitude for each channel. The correction factor varies by ≈10%, and the random noise amplitude is ≈2%, i.e., the measurement accuracy increases by a factor of 5 after this sensitivity correction. The certainty improvement in the spatial derivative inference was demonstrated.
Logical reasoning versus information processing in the dual-strategy model of reasoning.
Markovits, Henry; Brisson, Janie; de Chantal, Pier-Luc
2017-01-01
One of the major debates concerning the nature of inferential reasoning is between counterexample-based strategies such as mental model theory and statistical strategies underlying probabilistic models. The dual-strategy model, proposed by Verschueren, Schaeken, & d'Ydewalle (2005a, 2005b), which suggests that people might have access to both kinds of strategy has been supported by several recent studies. These have shown that statistical reasoners make inferences based on using information about premises in order to generate a likelihood estimate of conclusion probability. However, while results concerning counterexample reasoners are consistent with a counterexample detection model, these results could equally be interpreted as indicating a greater sensitivity to logical form. In order to distinguish these 2 interpretations, in Studies 1 and 2, we presented reasoners with Modus ponens (MP) inferences with statistical information about premise strength and in Studies 3 and 4, naturalistic MP inferences with premises having many disabling conditions. Statistical reasoners accepted the MP inference more often than counterexample reasoners in Studies 1 and 2, while the opposite pattern was observed in Studies 3 and 4. Results show that these strategies must be defined in terms of information processing, with no clear relations to "logical" reasoning. These results have additional implications for the underlying debate about the nature of human reasoning. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Sumner, Jeremy G; Taylor, Amelia; Holland, Barbara R; Jarvis, Peter D
2017-12-01
Recently there has been renewed interest in phylogenetic inference methods based on phylogenetic invariants, alongside the related Markov invariants. Broadly speaking, both these approaches give rise to polynomial functions of sequence site patterns that, in expectation value, either vanish for particular evolutionary trees (in the case of phylogenetic invariants) or have well understood transformation properties (in the case of Markov invariants). While both approaches have been valued for their intrinsic mathematical interest, it is not clear how they relate to each other, and to what extent they can be used as practical tools for inference of phylogenetic trees. In this paper, by focusing on the special case of binary sequence data and quartets of taxa, we are able to view these two different polynomial-based approaches within a common framework. To motivate the discussion, we present three desirable statistical properties that we argue any invariant-based phylogenetic method should satisfy: (1) sensible behaviour under reordering of input sequences; (2) stability as the taxa evolve independently according to a Markov process; and (3) explicit dependence on the assumption of a continuous-time process. Motivated by these statistical properties, we develop and explore several new phylogenetic inference methods. In particular, we develop a statistically bias-corrected version of the Markov invariants approach which satisfies all three properties. We also extend previous work by showing that the phylogenetic invariants can be implemented in such a way as to satisfy property (3). A simulation study shows that, in comparison to other methods, our new proposed approach based on bias-corrected Markov invariants is extremely powerful for phylogenetic inference. The binary case is of particular theoretical interest as-in this case only-the Markov invariants can be expressed as linear combinations of the phylogenetic invariants. A wider implication of this is that, for models with more than two states-for example DNA sequence alignments with four-state models-we find that methods which rely on phylogenetic invariants are incapable of satisfying all three of the stated statistical properties. This is because in these cases the relevant Markov invariants belong to a class of polynomials independent from the phylogenetic invariants.
Assessing colour-dependent occupation statistics inferred from galaxy group catalogues
NASA Astrophysics Data System (ADS)
Campbell, Duncan; van den Bosch, Frank C.; Hearin, Andrew; Padmanabhan, Nikhil; Berlind, Andreas; Mo, H. J.; Tinker, Jeremy; Yang, Xiaohu
2015-09-01
We investigate the ability of current implementations of galaxy group finders to recover colour-dependent halo occupation statistics. To test the fidelity of group catalogue inferred statistics, we run three different group finders used in the literature over a mock that includes galaxy colours in a realistic manner. Overall, the resulting mock group catalogues are remarkably similar, and most colour-dependent statistics are recovered with reasonable accuracy. However, it is also clear that certain systematic errors arise as a consequence of correlated errors in group membership determination, central/satellite designation, and halo mass assignment. We introduce a new statistic, the halo transition probability (HTP), which captures the combined impact of all these errors. As a rule of thumb, errors tend to equalize the properties of distinct galaxy populations (i.e. red versus blue galaxies or centrals versus satellites), and to result in inferred occupation statistics that are more accurate for red galaxies than for blue galaxies. A statistic that is particularly poorly recovered from the group catalogues is the red fraction of central galaxies as a function of halo mass. Group finders do a good job in recovering galactic conformity, but also have a tendency to introduce weak conformity when none is present. We conclude that proper inference of colour-dependent statistics from group catalogues is best achieved using forward modelling (i.e. running group finders over mock data) or by implementing a correction scheme based on the HTP, as long as the latter is not too strongly model dependent.
Building Intuitions about Statistical Inference Based on Resampling
ERIC Educational Resources Information Center
Watson, Jane; Chance, Beth
2012-01-01
Formal inference, which makes theoretical assumptions about distributions and applies hypothesis testing procedures with null and alternative hypotheses, is notoriously difficult for tertiary students to master. The debate about whether this content should appear in Years 11 and 12 of the "Australian Curriculum: Mathematics" has gone on…
In silico model-based inference: a contemporary approach for hypothesis testing in network biology
Klinke, David J.
2014-01-01
Inductive inference plays a central role in the study of biological systems where one aims to increase their understanding of the system by reasoning backwards from uncertain observations to identify causal relationships among components of the system. These causal relationships are postulated from prior knowledge as a hypothesis or simply a model. Experiments are designed to test the model. Inferential statistics are used to establish a level of confidence in how well our postulated model explains the acquired data. This iterative process, commonly referred to as the scientific method, either improves our confidence in a model or suggests that we revisit our prior knowledge to develop a new model. Advances in technology impact how we use prior knowledge and data to formulate models of biological networks and how we observe cellular behavior. However, the approach for model-based inference has remained largely unchanged since Fisher, Neyman and Pearson developed the ideas in the early 1900’s that gave rise to what is now known as classical statistical hypothesis (model) testing. Here, I will summarize conventional methods for model-based inference and suggest a contemporary approach to aid in our quest to discover how cells dynamically interpret and transmit information for therapeutic aims that integrates ideas drawn from high performance computing, Bayesian statistics, and chemical kinetics. PMID:25139179
In silico model-based inference: a contemporary approach for hypothesis testing in network biology.
Klinke, David J
2014-01-01
Inductive inference plays a central role in the study of biological systems where one aims to increase their understanding of the system by reasoning backwards from uncertain observations to identify causal relationships among components of the system. These causal relationships are postulated from prior knowledge as a hypothesis or simply a model. Experiments are designed to test the model. Inferential statistics are used to establish a level of confidence in how well our postulated model explains the acquired data. This iterative process, commonly referred to as the scientific method, either improves our confidence in a model or suggests that we revisit our prior knowledge to develop a new model. Advances in technology impact how we use prior knowledge and data to formulate models of biological networks and how we observe cellular behavior. However, the approach for model-based inference has remained largely unchanged since Fisher, Neyman and Pearson developed the ideas in the early 1900s that gave rise to what is now known as classical statistical hypothesis (model) testing. Here, I will summarize conventional methods for model-based inference and suggest a contemporary approach to aid in our quest to discover how cells dynamically interpret and transmit information for therapeutic aims that integrates ideas drawn from high performance computing, Bayesian statistics, and chemical kinetics. © 2014 American Institute of Chemical Engineers.
Emura, Takeshi; Konno, Yoshihiko; Michimae, Hirofumi
2015-07-01
Doubly truncated data consist of samples whose observed values fall between the right- and left- truncation limits. With such samples, the distribution function of interest is estimated using the nonparametric maximum likelihood estimator (NPMLE) that is obtained through a self-consistency algorithm. Owing to the complicated asymptotic distribution of the NPMLE, the bootstrap method has been suggested for statistical inference. This paper proposes a closed-form estimator for the asymptotic covariance function of the NPMLE, which is computationally attractive alternative to bootstrapping. Furthermore, we develop various statistical inference procedures, such as confidence interval, goodness-of-fit tests, and confidence bands to demonstrate the usefulness of the proposed covariance estimator. Simulations are performed to compare the proposed method with both the bootstrap and jackknife methods. The methods are illustrated using the childhood cancer dataset.
The Reasoning behind Informal Statistical Inference
ERIC Educational Resources Information Center
Makar, Katie; Bakker, Arthur; Ben-Zvi, Dani
2011-01-01
Informal statistical inference (ISI) has been a frequent focus of recent research in statistics education. Considering the role that context plays in developing ISI calls into question the need to be more explicit about the reasoning that underpins ISI. This paper uses educational literature on informal statistical inference and philosophical…
Variation in reaction norms: Statistical considerations and biological interpretation.
Morrissey, Michael B; Liefting, Maartje
2016-09-01
Analysis of reaction norms, the functions by which the phenotype produced by a given genotype depends on the environment, is critical to studying many aspects of phenotypic evolution. Different techniques are available for quantifying different aspects of reaction norm variation. We examine what biological inferences can be drawn from some of the more readily applicable analyses for studying reaction norms. We adopt a strongly biologically motivated view, but draw on statistical theory to highlight strengths and drawbacks of different techniques. In particular, consideration of some formal statistical theory leads to revision of some recently, and forcefully, advocated opinions on reaction norm analysis. We clarify what simple analysis of the slope between mean phenotype in two environments can tell us about reaction norms, explore the conditions under which polynomial regression can provide robust inferences about reaction norm shape, and explore how different existing approaches may be used to draw inferences about variation in reaction norm shape. We show how mixed model-based approaches can provide more robust inferences than more commonly used multistep statistical approaches, and derive new metrics of the relative importance of variation in reaction norm intercepts, slopes, and curvatures. © 2016 The Author(s). Evolution © 2016 The Society for the Study of Evolution.
Ganju, Jitendra; Yu, Xinxin; Ma, Guoguang Julie
2013-01-01
Formal inference in randomized clinical trials is based on controlling the type I error rate associated with a single pre-specified statistic. The deficiency of using just one method of analysis is that it depends on assumptions that may not be met. For robust inference, we propose pre-specifying multiple test statistics and relying on the minimum p-value for testing the null hypothesis of no treatment effect. The null hypothesis associated with the various test statistics is that the treatment groups are indistinguishable. The critical value for hypothesis testing comes from permutation distributions. Rejection of the null hypothesis when the smallest p-value is less than the critical value controls the type I error rate at its designated value. Even if one of the candidate test statistics has low power, the adverse effect on the power of the minimum p-value statistic is not much. Its use is illustrated with examples. We conclude that it is better to rely on the minimum p-value rather than a single statistic particularly when that single statistic is the logrank test, because of the cost and complexity of many survival trials. Copyright © 2013 John Wiley & Sons, Ltd.
Li, Huanjie; Nickerson, Lisa D; Nichols, Thomas E; Gao, Jia-Hong
2017-03-01
Two powerful methods for statistical inference on MRI brain images have been proposed recently, a non-stationary voxelation-corrected cluster-size test (CST) based on random field theory and threshold-free cluster enhancement (TFCE) based on calculating the level of local support for a cluster, then using permutation testing for inference. Unlike other statistical approaches, these two methods do not rest on the assumptions of a uniform and high degree of spatial smoothness of the statistic image. Thus, they are strongly recommended for group-level fMRI analysis compared to other statistical methods. In this work, the non-stationary voxelation-corrected CST and TFCE methods for group-level analysis were evaluated for both stationary and non-stationary images under varying smoothness levels, degrees of freedom and signal to noise ratios. Our results suggest that, both methods provide adequate control for the number of voxel-wise statistical tests being performed during inference on fMRI data and they are both superior to current CSTs implemented in popular MRI data analysis software packages. However, TFCE is more sensitive and stable for group-level analysis of VBM data. Thus, the voxelation-corrected CST approach may confer some advantages by being computationally less demanding for fMRI data analysis than TFCE with permutation testing and by also being applicable for single-subject fMRI analyses, while the TFCE approach is advantageous for VBM data. Hum Brain Mapp 38:1269-1280, 2017. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
A statistical method for lung tumor segmentation uncertainty in PET images based on user inference.
Zheng, Chaojie; Wang, Xiuying; Feng, Dagan
2015-01-01
PET has been widely accepted as an effective imaging modality for lung tumor diagnosis and treatment. However, standard criteria for delineating tumor boundary from PET are yet to develop largely due to relatively low quality of PET images, uncertain tumor boundary definition, and variety of tumor characteristics. In this paper, we propose a statistical solution to segmentation uncertainty on the basis of user inference. We firstly define the uncertainty segmentation band on the basis of segmentation probability map constructed from Random Walks (RW) algorithm; and then based on the extracted features of the user inference, we use Principle Component Analysis (PCA) to formulate the statistical model for labeling the uncertainty band. We validated our method on 10 lung PET-CT phantom studies from the public RIDER collections [1] and 16 clinical PET studies where tumors were manually delineated by two experienced radiologists. The methods were validated using Dice similarity coefficient (DSC) to measure the spatial volume overlap. Our method achieved an average DSC of 0.878 ± 0.078 on phantom studies and 0.835 ± 0.039 on clinical studies.
NASA Astrophysics Data System (ADS)
Ma, Chuang; Chen, Han-Shuang; Lai, Ying-Cheng; Zhang, Hai-Feng
2018-02-01
Complex networks hosting binary-state dynamics arise in a variety of contexts. In spite of previous works, to fully reconstruct the network structure from observed binary data remains challenging. We articulate a statistical inference based approach to this problem. In particular, exploiting the expectation-maximization (EM) algorithm, we develop a method to ascertain the neighbors of any node in the network based solely on binary data, thereby recovering the full topology of the network. A key ingredient of our method is the maximum-likelihood estimation of the probabilities associated with actual or nonexistent links, and we show that the EM algorithm can distinguish the two kinds of probability values without any ambiguity, insofar as the length of the available binary time series is reasonably long. Our method does not require any a priori knowledge of the detailed dynamical processes, is parameter-free, and is capable of accurate reconstruction even in the presence of noise. We demonstrate the method using combinations of distinct types of binary dynamical processes and network topologies, and provide a physical understanding of the underlying reconstruction mechanism. Our statistical inference based reconstruction method contributes an additional piece to the rapidly expanding "toolbox" of data based reverse engineering of complex networked systems.
Ma, Chuang; Chen, Han-Shuang; Lai, Ying-Cheng; Zhang, Hai-Feng
2018-02-01
Complex networks hosting binary-state dynamics arise in a variety of contexts. In spite of previous works, to fully reconstruct the network structure from observed binary data remains challenging. We articulate a statistical inference based approach to this problem. In particular, exploiting the expectation-maximization (EM) algorithm, we develop a method to ascertain the neighbors of any node in the network based solely on binary data, thereby recovering the full topology of the network. A key ingredient of our method is the maximum-likelihood estimation of the probabilities associated with actual or nonexistent links, and we show that the EM algorithm can distinguish the two kinds of probability values without any ambiguity, insofar as the length of the available binary time series is reasonably long. Our method does not require any a priori knowledge of the detailed dynamical processes, is parameter-free, and is capable of accurate reconstruction even in the presence of noise. We demonstrate the method using combinations of distinct types of binary dynamical processes and network topologies, and provide a physical understanding of the underlying reconstruction mechanism. Our statistical inference based reconstruction method contributes an additional piece to the rapidly expanding "toolbox" of data based reverse engineering of complex networked systems.
Conceptual Challenges in Coordinating Theoretical and Data-Centered Estimates of Probability
ERIC Educational Resources Information Center
Konold, Cliff; Madden, Sandra; Pollatsek, Alexander; Pfannkuch, Maxine; Wild, Chris; Ziedins, Ilze; Finzer, William; Horton, Nicholas J.; Kazak, Sibel
2011-01-01
A core component of informal statistical inference is the recognition that judgments based on sample data are inherently uncertain. This implies that instruction aimed at developing informal inference needs to foster basic probabilistic reasoning. In this article, we analyze and critique the now-common practice of introducing students to both…
Bayesian Inference for Functional Dynamics Exploring in fMRI Data.
Guo, Xuan; Liu, Bing; Chen, Le; Chen, Guantao; Pan, Yi; Zhang, Jing
2016-01-01
This paper aims to review state-of-the-art Bayesian-inference-based methods applied to functional magnetic resonance imaging (fMRI) data. Particularly, we focus on one specific long-standing challenge in the computational modeling of fMRI datasets: how to effectively explore typical functional interactions from fMRI time series and the corresponding boundaries of temporal segments. Bayesian inference is a method of statistical inference which has been shown to be a powerful tool to encode dependence relationships among the variables with uncertainty. Here we provide an introduction to a group of Bayesian-inference-based methods for fMRI data analysis, which were designed to detect magnitude or functional connectivity change points and to infer their functional interaction patterns based on corresponding temporal boundaries. We also provide a comparison of three popular Bayesian models, that is, Bayesian Magnitude Change Point Model (BMCPM), Bayesian Connectivity Change Point Model (BCCPM), and Dynamic Bayesian Variable Partition Model (DBVPM), and give a summary of their applications. We envision that more delicate Bayesian inference models will be emerging and play increasingly important roles in modeling brain functions in the years to come.
de la Rúa, Nicholas M.; Bustamante, Dulce M.; Menes, Marianela; Stevens, Lori; Monroy, Carlota; Kilpatrick, William; Rizzo, Donna; Klotz, Stephen A.; Schmidt, Justin; Axen, Heather J.; Dorn, Patricia L.
2014-01-01
Phylogenetic relationships of insect vectors of parasitic diseases are important for understanding the evolution of epidemiologically relevant traits, and may be useful in vector control. The subfamily Triatominae (Hemiptera:Reduviidae) includes ~140 extant species arranged in five tribes comprised of 15 genera. The genus Triatoma is the most species-rich and contains important vectors of Trypanosoma cruzi, the causative agent of Chagas disease. Triatoma species were grouped into complexes originally by morphology and more recently with the addition of information from molecular phylogenetics (the four-complex hypothesis); however, without a strict adherence to monophyly. To date, the validity of proposed species complexes has not been tested by statistical tests of topology. The goal of this study was to clarify the systematics of 19 Triatoma species from North and Central America. We inferred their evolutionary relatedness using two independent data sets: the complete nuclear Internal Transcribed Spacer-2 ribosomal DNA (ITS-2 rDNA) and head morphometrics. In addition, we used the Shimodaira-Hasegawa statistical test of topology to assess the fit of the data to a set of competing systematic hypotheses (topologies). An unconstrained topology inferred from the ITS-2 data was compared to topologies constrained based on the four-complex hypothesis or one inferred from our morphometry results. The unconstrained topology represents a statistically significant better fit of the molecular data than either the four-complex or the morphometric topology. We propose an update to the composition of species complexes in the North and Central American Triatoma, based on a phylogeny inferred from ITS-2 as a first step towards updating the phylogeny of the complexes based on monophyly and statistical tests of topologies. PMID:24681261
Learning Quantitative Sequence-Function Relationships from Massively Parallel Experiments
NASA Astrophysics Data System (ADS)
Atwal, Gurinder S.; Kinney, Justin B.
2016-03-01
A fundamental aspect of biological information processing is the ubiquity of sequence-function relationships—functions that map the sequence of DNA, RNA, or protein to a biochemically relevant activity. Most sequence-function relationships in biology are quantitative, but only recently have experimental techniques for effectively measuring these relationships been developed. The advent of such "massively parallel" experiments presents an exciting opportunity for the concepts and methods of statistical physics to inform the study of biological systems. After reviewing these recent experimental advances, we focus on the problem of how to infer parametric models of sequence-function relationships from the data produced by these experiments. Specifically, we retrace and extend recent theoretical work showing that inference based on mutual information, not the standard likelihood-based approach, is often necessary for accurately learning the parameters of these models. Closely connected with this result is the emergence of "diffeomorphic modes"—directions in parameter space that are far less constrained by data than likelihood-based inference would suggest. Analogous to Goldstone modes in physics, diffeomorphic modes arise from an arbitrarily broken symmetry of the inference problem. An analytically tractable model of a massively parallel experiment is then described, providing an explicit demonstration of these fundamental aspects of statistical inference. This paper concludes with an outlook on the theoretical and computational challenges currently facing studies of quantitative sequence-function relationships.
Using Action Research to Develop a Course in Statistical Inference for Workplace-Based Adults
ERIC Educational Resources Information Center
Forbes, Sharleen
2014-01-01
Many adults who need an understanding of statistical concepts have limited mathematical skills. They need a teaching approach that includes as little mathematical context as possible. Iterative participatory qualitative research (action research) was used to develop a statistical literacy course for adult learners informed by teaching in…
Inferring epidemiological parameters from phylogenies using regression-ABC: A comparative study
Gascuel, Olivier
2017-01-01
Inferring epidemiological parameters such as the R0 from time-scaled phylogenies is a timely challenge. Most current approaches rely on likelihood functions, which raise specific issues that range from computing these functions to finding their maxima numerically. Here, we present a new regression-based Approximate Bayesian Computation (ABC) approach, which we base on a large variety of summary statistics intended to capture the information contained in the phylogeny and its corresponding lineage-through-time plot. The regression step involves the Least Absolute Shrinkage and Selection Operator (LASSO) method, which is a robust machine learning technique. It allows us to readily deal with the large number of summary statistics, while avoiding resorting to Markov Chain Monte Carlo (MCMC) techniques. To compare our approach to existing ones, we simulated target trees under a variety of epidemiological models and settings, and inferred parameters of interest using the same priors. We found that, for large phylogenies, the accuracy of our regression-ABC is comparable to that of likelihood-based approaches involving birth-death processes implemented in BEAST2. Our approach even outperformed these when inferring the host population size with a Susceptible-Infected-Removed epidemiological model. It also clearly outperformed a recent kernel-ABC approach when assuming a Susceptible-Infected epidemiological model with two host types. Lastly, by re-analyzing data from the early stages of the recent Ebola epidemic in Sierra Leone, we showed that regression-ABC provides more realistic estimates for the duration parameters (latency and infectiousness) than the likelihood-based method. Overall, ABC based on a large variety of summary statistics and a regression method able to perform variable selection and avoid overfitting is a promising approach to analyze large phylogenies. PMID:28263987
Schmidt, Paul; Schmid, Volker J; Gaser, Christian; Buck, Dorothea; Bührlen, Susanne; Förschler, Annette; Mühlau, Mark
2013-01-01
Aiming at iron-related T2-hypointensity, which is related to normal aging and neurodegenerative processes, we here present two practicable approaches, based on Bayesian inference, for preprocessing and statistical analysis of a complex set of structural MRI data. In particular, Markov Chain Monte Carlo methods were used to simulate posterior distributions. First, we rendered a segmentation algorithm that uses outlier detection based on model checking techniques within a Bayesian mixture model. Second, we rendered an analytical tool comprising a Bayesian regression model with smoothness priors (in the form of Gaussian Markov random fields) mitigating the necessity to smooth data prior to statistical analysis. For validation, we used simulated data and MRI data of 27 healthy controls (age: [Formula: see text]; range, [Formula: see text]). We first observed robust segmentation of both simulated T2-hypointensities and gray-matter regions known to be T2-hypointense. Second, simulated data and images of segmented T2-hypointensity were analyzed. We found not only robust identification of simulated effects but also a biologically plausible age-related increase of T2-hypointensity primarily within the dentate nucleus but also within the globus pallidus, substantia nigra, and red nucleus. Our results indicate that fully Bayesian inference can successfully be applied for preprocessing and statistical analysis of structural MRI data.
Prikken, Merel; van der Weiden, Anouk; Renes, Robert A; Koevoets, Martijn G J C; Heering, Henriette D; Kahn, René S; Aarts, Henk; van Haren, Neeltje E M
2017-04-01
Experiencing self-agency over one's own action outcomes is essential for social functioning. Recent research revealed that patients with schizophrenia do not use implicitly available information about their action-outcomes (i.e., prime-based agency inference) to arrive at self-agency experiences. Here, we examined whether this is related to symptoms and/or familial risk to develop the disease. Fifty-four patients, 54 controls, and 19 unaffected (and unrelated) siblings performed an agency inference task, in which experienced agency was measured over action-outcomes that matched or mismatched outcome-primes that were presented before action performance. The Positive and Negative Syndrome Scale (PANSS) and Comprehensive Assessment of Symptoms and History (CASH) were administered to assess psychopathology. Impairments in prime-based inferences did not differ between patients with symptoms of over- and underattribution. However, patients with agency underattribution symptoms reported significantly lower overall self-agency experiences. Siblings displayed stronger prime-based agency inferences than patients, but weaker prime-based inferences than healthy controls. However, these differences were not statistically significant. Findings suggest that impairments in prime-based agency inferences may be a trait characteristic of schizophrenia. Moreover, this study may stimulate further research on the familial basis and the clinical relevance of impairments in implicit agency inferences. Copyright © 2017 Elsevier Ireland Ltd. All rights reserved.
White, H; Racine, J
2001-01-01
We propose tests for individual and joint irrelevance of network inputs. Such tests can be used to determine whether an input or group of inputs "belong" in a particular model, thus permitting valid statistical inference based on estimated feedforward neural-network models. The approaches employ well-known statistical resampling techniques. We conduct a small Monte Carlo experiment showing that our tests have reasonable level and power behavior, and we apply our methods to examine whether there are predictable regularities in foreign exchange rates. We find that exchange rates do appear to contain information that is exploitable for enhanced point prediction, but the nature of the predictive relations evolves through time.
Drug target inference through pathway analysis of genomics data
Ma, Haisu; Zhao, Hongyu
2013-01-01
Statistical modeling coupled with bioinformatics is commonly used for drug discovery. Although there exist many approaches for single target based drug design and target inference, recent years have seen a paradigm shift to system-level pharmacological research. Pathway analysis of genomics data represents one promising direction for computational inference of drug targets. This article aims at providing a comprehensive review on the evolving issues is this field, covering methodological developments, their pros and cons, as well as future research directions. PMID:23369829
P values are only an index to evidence: 20th- vs. 21st-century statistical science.
Burnham, K P; Anderson, D R
2014-03-01
Early statistical methods focused on pre-data probability statements (i.e., data as random variables) such as P values; these are not really inferences nor are P values evidential. Statistical science clung to these principles throughout much of the 20th century as a wide variety of methods were developed for special cases. Looking back, it is clear that the underlying paradigm (i.e., testing and P values) was weak. As Kuhn (1970) suggests, new paradigms have taken the place of earlier ones: this is a goal of good science. New methods have been developed and older methods extended and these allow proper measures of strength of evidence and multimodel inference. It is time to move forward with sound theory and practice for the difficult practical problems that lie ahead. Given data the useful foundation shifts to post-data probability statements such as model probabilities (Akaike weights) or related quantities such as odds ratios and likelihood intervals. These new methods allow formal inference from multiple models in the a prior set. These quantities are properly evidential. The past century was aimed at finding the "best" model and making inferences from it. The goal in the 21st century is to base inference on all the models weighted by their model probabilities (model averaging). Estimates of precision can include model selection uncertainty leading to variances conditional on the model set. The 21st century will be about the quantification of information, proper measures of evidence, and multi-model inference. Nelder (1999:261) concludes, "The most important task before us in developing statistical science is to demolish the P-value culture, which has taken root to a frightening extent in many areas of both pure and applied science and technology".
Truth, models, model sets, AIC, and multimodel inference: a Bayesian perspective
Barker, Richard J.; Link, William A.
2015-01-01
Statistical inference begins with viewing data as realizations of stochastic processes. Mathematical models provide partial descriptions of these processes; inference is the process of using the data to obtain a more complete description of the stochastic processes. Wildlife and ecological scientists have become increasingly concerned with the conditional nature of model-based inference: what if the model is wrong? Over the last 2 decades, Akaike's Information Criterion (AIC) has been widely and increasingly used in wildlife statistics for 2 related purposes, first for model choice and second to quantify model uncertainty. We argue that for the second of these purposes, the Bayesian paradigm provides the natural framework for describing uncertainty associated with model choice and provides the most easily communicated basis for model weighting. Moreover, Bayesian arguments provide the sole justification for interpreting model weights (including AIC weights) as coherent (mathematically self consistent) model probabilities. This interpretation requires treating the model as an exact description of the data-generating mechanism. We discuss the implications of this assumption, and conclude that more emphasis is needed on model checking to provide confidence in the quality of inference.
Econophysical visualization of Adam Smith’s invisible hand
NASA Astrophysics Data System (ADS)
Cohen, Morrel H.; Eliazar, Iddo I.
2013-02-01
Consider a complex system whose macrostate is statistically observable, but yet whose operating mechanism is an unknown black-box. In this paper we address the problem of inferring, from the system’s macrostate statistics, the system’s intrinsic force yielding the observed statistics. The inference is established via two diametrically opposite approaches which result in the very same intrinsic force: a top-down approach based on the notion of entropy, and a bottom-up approach based on the notion of Langevin dynamics. The general results established are applied to the problem of visualizing the intrinsic socioeconomic force-Adam Smith’s invisible hand-shaping the distribution of wealth in human societies. Our analysis yields quantitative econophysical representations of figurative socioeconomic forces, quantitative definitions of “poor” and “rich”, and a quantitative characterization of the “poor-get-poorer” and the “rich-get-richer” phenomena.
When mechanism matters: Bayesian forecasting using models of ecological diffusion
Hefley, Trevor J.; Hooten, Mevin B.; Russell, Robin E.; Walsh, Daniel P.; Powell, James A.
2017-01-01
Ecological diffusion is a theory that can be used to understand and forecast spatio-temporal processes such as dispersal, invasion, and the spread of disease. Hierarchical Bayesian modelling provides a framework to make statistical inference and probabilistic forecasts, using mechanistic ecological models. To illustrate, we show how hierarchical Bayesian models of ecological diffusion can be implemented for large data sets that are distributed densely across space and time. The hierarchical Bayesian approach is used to understand and forecast the growth and geographic spread in the prevalence of chronic wasting disease in white-tailed deer (Odocoileus virginianus). We compare statistical inference and forecasts from our hierarchical Bayesian model to phenomenological regression-based methods that are commonly used to analyse spatial occurrence data. The mechanistic statistical model based on ecological diffusion led to important ecological insights, obviated a commonly ignored type of collinearity, and was the most accurate method for forecasting.
Teaching Statistical Inference for Causal Effects in Experiments and Observational Studies
ERIC Educational Resources Information Center
Rubin, Donald B.
2004-01-01
Inference for causal effects is a critical activity in many branches of science and public policy. The field of statistics is the one field most suited to address such problems, whether from designed experiments or observational studies. Consequently, it is arguably essential that departments of statistics teach courses in causal inference to both…
ERIC Educational Resources Information Center
Schochet, Peter Z.
2015-01-01
This report presents the statistical theory underlying the "RCT-YES" software that estimates and reports impacts for RCTs for a wide range of designs used in social policy research. The report discusses a unified, non-parametric design-based approach for impact estimation using the building blocks of the Neyman-Rubin-Holland causal…
Probabilistic Graphical Model Representation in Phylogenetics
Höhna, Sebastian; Heath, Tracy A.; Boussau, Bastien; Landis, Michael J.; Ronquist, Fredrik; Huelsenbeck, John P.
2014-01-01
Recent years have seen a rapid expansion of the model space explored in statistical phylogenetics, emphasizing the need for new approaches to statistical model representation and software development. Clear communication and representation of the chosen model is crucial for: (i) reproducibility of an analysis, (ii) model development, and (iii) software design. Moreover, a unified, clear and understandable framework for model representation lowers the barrier for beginners and nonspecialists to grasp complex phylogenetic models, including their assumptions and parameter/variable dependencies. Graphical modeling is a unifying framework that has gained in popularity in the statistical literature in recent years. The core idea is to break complex models into conditionally independent distributions. The strength lies in the comprehensibility, flexibility, and adaptability of this formalism, and the large body of computational work based on it. Graphical models are well-suited to teach statistical models, to facilitate communication among phylogeneticists and in the development of generic software for simulation and statistical inference. Here, we provide an introduction to graphical models for phylogeneticists and extend the standard graphical model representation to the realm of phylogenetics. We introduce a new graphical model component, tree plates, to capture the changing structure of the subgraph corresponding to a phylogenetic tree. We describe a range of phylogenetic models using the graphical model framework and introduce modules to simplify the representation of standard components in large and complex models. Phylogenetic model graphs can be readily used in simulation, maximum likelihood inference, and Bayesian inference using, for example, Metropolis–Hastings or Gibbs sampling of the posterior distribution. [Computation; graphical models; inference; modularization; statistical phylogenetics; tree plate.] PMID:24951559
Conditional statistical inference with multistage testing designs.
Zwitser, Robert J; Maris, Gunter
2015-03-01
In this paper it is demonstrated how statistical inference from multistage test designs can be made based on the conditional likelihood. Special attention is given to parameter estimation, as well as the evaluation of model fit. Two reasons are provided why the fit of simple measurement models is expected to be better in adaptive designs, compared to linear designs: more parameters are available for the same number of observations; and undesirable response behavior, like slipping and guessing, might be avoided owing to a better match between item difficulty and examinee proficiency. The results are illustrated with simulated data, as well as with real data.
Statistical inference for noisy nonlinear ecological dynamic systems.
Wood, Simon N
2010-08-26
Chaotic ecological dynamic systems defy conventional statistical analysis. Systems with near-chaotic dynamics are little better. Such systems are almost invariably driven by endogenous dynamic processes plus demographic and environmental process noise, and are only observable with error. Their sensitivity to history means that minute changes in the driving noise realization, or the system parameters, will cause drastic changes in the system trajectory. This sensitivity is inherited and amplified by the joint probability density of the observable data and the process noise, rendering it useless as the basis for obtaining measures of statistical fit. Because the joint density is the basis for the fit measures used by all conventional statistical methods, this is a major theoretical shortcoming. The inability to make well-founded statistical inferences about biological dynamic models in the chaotic and near-chaotic regimes, other than on an ad hoc basis, leaves dynamic theory without the methods of quantitative validation that are essential tools in the rest of biological science. Here I show that this impasse can be resolved in a simple and general manner, using a method that requires only the ability to simulate the observed data on a system from the dynamic model about which inferences are required. The raw data series are reduced to phase-insensitive summary statistics, quantifying local dynamic structure and the distribution of observations. Simulation is used to obtain the mean and the covariance matrix of the statistics, given model parameters, allowing the construction of a 'synthetic likelihood' that assesses model fit. This likelihood can be explored using a straightforward Markov chain Monte Carlo sampler, but one further post-processing step returns pure likelihood-based inference. I apply the method to establish the dynamic nature of the fluctuations in Nicholson's classic blowfly experiments.
Reasoning about Informal Statistical Inference: One Statistician's View
ERIC Educational Resources Information Center
Rossman, Allan J.
2008-01-01
This paper identifies key concepts and issues associated with the reasoning of informal statistical inference. I focus on key ideas of inference that I think all students should learn, including at secondary level as well as tertiary. I argue that a fundamental component of inference is to go beyond the data at hand, and I propose that statistical…
SIG-VISA: Signal-based Vertically Integrated Seismic Monitoring
NASA Astrophysics Data System (ADS)
Moore, D.; Mayeda, K. M.; Myers, S. C.; Russell, S.
2013-12-01
Traditional seismic monitoring systems rely on discrete detections produced by station processing software; however, while such detections may constitute a useful summary of station activity, they discard large amounts of information present in the original recorded signal. We present SIG-VISA (Signal-based Vertically Integrated Seismic Analysis), a system for seismic monitoring through Bayesian inference on seismic signals. By directly modeling the recorded signal, our approach incorporates additional information unavailable to detection-based methods, enabling higher sensitivity and more accurate localization using techniques such as waveform matching. SIG-VISA's Bayesian forward model of seismic signal envelopes includes physically-derived models of travel times and source characteristics as well as Gaussian process (kriging) statistical models of signal properties that combine interpolation of historical data with extrapolation of learned physical trends. Applying Bayesian inference, we evaluate the model on earthquakes as well as the 2009 DPRK test event, demonstrating a waveform matching effect as part of the probabilistic inference, along with results on event localization and sensitivity. In particular, we demonstrate increased sensitivity from signal-based modeling, in which the SIGVISA signal model finds statistical evidence for arrivals even at stations for which the IMS station processing failed to register any detection.
Unraveling multiple changes in complex climate time series using Bayesian inference
NASA Astrophysics Data System (ADS)
Berner, Nadine; Trauth, Martin H.; Holschneider, Matthias
2016-04-01
Change points in time series are perceived as heterogeneities in the statistical or dynamical characteristics of observations. Unraveling such transitions yields essential information for the understanding of the observed system. The precise detection and basic characterization of underlying changes is therefore of particular importance in environmental sciences. We present a kernel-based Bayesian inference approach to investigate direct as well as indirect climate observations for multiple generic transition events. In order to develop a diagnostic approach designed to capture a variety of natural processes, the basic statistical features of central tendency and dispersion are used to locally approximate a complex time series by a generic transition model. A Bayesian inversion approach is developed to robustly infer on the location and the generic patterns of such a transition. To systematically investigate time series for multiple changes occurring at different temporal scales, the Bayesian inversion is extended to a kernel-based inference approach. By introducing basic kernel measures, the kernel inference results are composed into a proxy probability to a posterior distribution of multiple transitions. Thus, based on a generic transition model a probability expression is derived that is capable to indicate multiple changes within a complex time series. We discuss the method's performance by investigating direct and indirect climate observations. The approach is applied to environmental time series (about 100 a), from the weather station in Tuscaloosa, Alabama, and confirms documented instrumentation changes. Moreover, the approach is used to investigate a set of complex terrigenous dust records from the ODP sites 659, 721/722 and 967 interpreted as climate indicators of the African region of the Plio-Pleistocene period (about 5 Ma). The detailed inference unravels multiple transitions underlying the indirect climate observations coinciding with established global climate events.
Cocco, Simona; Leibler, Stanislas; Monasson, Rémi
2009-01-01
Complexity of neural systems often makes impracticable explicit measurements of all interactions between their constituents. Inverse statistical physics approaches, which infer effective couplings between neurons from their spiking activity, have been so far hindered by their computational complexity. Here, we present 2 complementary, computationally efficient inverse algorithms based on the Ising and “leaky integrate-and-fire” models. We apply those algorithms to reanalyze multielectrode recordings in the salamander retina in darkness and under random visual stimulus. We find strong positive couplings between nearby ganglion cells common to both stimuli, whereas long-range couplings appear under random stimulus only. The uncertainty on the inferred couplings due to limitations in the recordings (duration, small area covered on the retina) is discussed. Our methods will allow real-time evaluation of couplings for large assemblies of neurons. PMID:19666487
A Test by Any Other Name: P Values, Bayes Factors, and Statistical Inference.
Stern, Hal S
2016-01-01
Procedures used for statistical inference are receiving increased scrutiny as the scientific community studies the factors associated with insuring reproducible research. This note addresses recent negative attention directed at p values, the relationship of confidence intervals and tests, and the role of Bayesian inference and Bayes factors, with an eye toward better understanding these different strategies for statistical inference. We argue that researchers and data analysts too often resort to binary decisions (e.g., whether to reject or accept the null hypothesis) in settings where this may not be required.
Hasegawa, Takanori; Yamaguchi, Rui; Nagasaki, Masao; Miyano, Satoru; Imoto, Seiya
2014-01-01
Comprehensive understanding of gene regulatory networks (GRNs) is a major challenge in the field of systems biology. Currently, there are two main approaches in GRN analysis using time-course observation data, namely an ordinary differential equation (ODE)-based approach and a statistical model-based approach. The ODE-based approach can generate complex dynamics of GRNs according to biologically validated nonlinear models. However, it cannot be applied to ten or more genes to simultaneously estimate system dynamics and regulatory relationships due to the computational difficulties. The statistical model-based approach uses highly abstract models to simply describe biological systems and to infer relationships among several hundreds of genes from the data. However, the high abstraction generates false regulations that are not permitted biologically. Thus, when dealing with several tens of genes of which the relationships are partially known, a method that can infer regulatory relationships based on a model with low abstraction and that can emulate the dynamics of ODE-based models while incorporating prior knowledge is urgently required. To accomplish this, we propose a method for inference of GRNs using a state space representation of a vector auto-regressive (VAR) model with L1 regularization. This method can estimate the dynamic behavior of genes based on linear time-series modeling constructed from an ODE-based model and can infer the regulatory structure among several tens of genes maximizing prediction ability for the observational data. Furthermore, the method is capable of incorporating various types of existing biological knowledge, e.g., drug kinetics and literature-recorded pathways. The effectiveness of the proposed method is shown through a comparison of simulation studies with several previous methods. For an application example, we evaluated mRNA expression profiles over time upon corticosteroid stimulation in rats, thus incorporating corticosteroid kinetics/dynamics, literature-recorded pathways and transcription factor (TF) information. PMID:25162401
Augmenting Latent Dirichlet Allocation and Rank Threshold Detection with Ontologies
2010-03-01
Probabilistic Latent Semantic Indexing (PLSI) is an automated indexing information retrieval model [20]. It is based on a statistical latent class model which is...uses a statistical foundation that is more accurate in finding hidden semantic relationships [20]. The model uses factor analysis of count data, number...principle of statistical infer- ence which asserts that all of the information in a sample is contained in the likelihood function [20]. The statistical
NASA Astrophysics Data System (ADS)
Luzzi, R.; Vasconcellos, A. R.; Ramos, J. G.; Rodrigues, C. G.
2018-01-01
We describe the formalism of statistical irreversible thermodynamics constructed based on Zubarev's nonequilibrium statistical operator (NSO) method, which is a powerful and universal tool for investigating the most varied physical phenomena. We present brief overviews of the statistical ensemble formalism and statistical irreversible thermodynamics. The first can be constructed either based on a heuristic approach or in the framework of information theory in the Jeffreys-Jaynes scheme of scientific inference; Zubarev and his school used both approaches in formulating the NSO method. We describe the main characteristics of statistical irreversible thermodynamics and discuss some particular considerations of several authors. We briefly describe how Rosenfeld, Bohr, and Prigogine proposed to derive a thermodynamic uncertainty principle.
The Use of a Context-Based Information Retrieval Technique
2009-07-01
provided in context. Latent Semantic Analysis (LSA) is a statistical technique for inferring contextual and structural information, and previous studies...WAIS). 10 DSTO-TR-2322 1.4.4 Latent Semantic Analysis LSA, which is also known as latent semantic indexing (LSI), uses a statistical and...1.4.6 Language Models In contrast, natural language models apply algorithms that combine statistical information with semantic information. Semantic
Using Guided Reinvention to Develop Teachers' Understanding of Hypothesis Testing Concepts
ERIC Educational Resources Information Center
Dolor, Jason; Noll, Jennifer
2015-01-01
Statistics education reform efforts emphasize the importance of informal inference in the learning of statistics. Research suggests statistics teachers experience similar difficulties understanding statistical inference concepts as students and how teacher knowledge can impact student learning. This study investigates how teachers reinvented an…
PREFACE: ELC International Meeting on Inference, Computation, and Spin Glasses (ICSG2013)
NASA Astrophysics Data System (ADS)
Kabashima, Yoshiyuki; Hukushima, Koji; Inoue, Jun-ichi; Tanaka, Toshiyuki; Watanabe, Osamu
2013-12-01
The close relationship between probability-based inference and statistical mechanics of disordered systems has been noted for some time. This relationship has provided researchers with a theoretical foundation in various fields of information processing for analytical performance evaluation and construction of efficient algorithms based on message-passing or Monte Carlo sampling schemes. The ELC International Meeting on 'Inference, Computation, and Spin Glasses (ICSG2013)', was held in Sapporo 28-30 July 2013. The meeting was organized as a satellite meeting of STATPHYS25 in order to offer a forum where concerned researchers can assemble and exchange information on the latest results and newly established methodologies, and discuss future directions of the interdisciplinary studies between statistical mechanics and information sciences. Financial support from Grant-in-Aid for Scientific Research on Innovative Areas, MEXT, Japan 'Exploring the Limits of Computation (ELC)' is gratefully acknowledged. We are pleased to publish 23 papers contributed by invited speakers of ICSG2013 in this volume of Journal of Physics: Conference Series. We hope that this volume will promote further development of this highly vigorous interdisciplinary field between statistical mechanics and information/computer science. Editors and ICSG2013 Organizing Committee: Koji Hukushima Jun-ichi Inoue (Local Chair of ICSG2013) Yoshiyuki Kabashima (Editor-in-Chief) Toshiyuki Tanaka Osamu Watanabe (General Chair of ICSG2013)
Computer-Based Instruction in Statistical Inference; Final Report. Technical Memorandum (TM Series).
ERIC Educational Resources Information Center
Rosenbaum, J.; And Others
A two-year investigation into the development of computer-assisted instruction (CAI) for the improvement of undergraduate training in statistics was undertaken. The first year was largely devoted to designing PLANIT (Programming LANguage for Interactive Teaching) which reduces, or completely eliminates, the need an author of CAI lessons would…
A Modeling Approach to the Development of Students' Informal Inferential Reasoning
ERIC Educational Resources Information Center
Doerr, Helen M.; Delmas, Robert; Makar, Katie
2017-01-01
Teaching from an informal statistical inference perspective can address the challenge of teaching statistics in a coherent way. We argue that activities that promote model-based reasoning address two additional challenges: providing a coherent sequence of topics and promoting the application of knowledge to novel situations. We take a models and…
ERIC Educational Resources Information Center
Larwin, Karen H.; Larwin, David A.
2011-01-01
Bootstrapping methods and random distribution methods are increasingly recommended as better approaches for teaching students about statistical inference in introductory-level statistics courses. The authors examined the effect of teaching undergraduate business statistics students using random distribution and bootstrapping simulations. It is the…
Bayesian multimodel inference for dose-response studies
Link, W.A.; Albers, P.H.
2007-01-01
Statistical inference in dose?response studies is model-based: The analyst posits a mathematical model of the relation between exposure and response, estimates parameters of the model, and reports conclusions conditional on the model. Such analyses rarely include any accounting for the uncertainties associated with model selection. The Bayesian inferential system provides a convenient framework for model selection and multimodel inference. In this paper we briefly describe the Bayesian paradigm and Bayesian multimodel inference. We then present a family of models for multinomial dose?response data and apply Bayesian multimodel inferential methods to the analysis of data on the reproductive success of American kestrels (Falco sparveriuss) exposed to various sublethal dietary concentrations of methylmercury.
Application of Transformations in Parametric Inference
ERIC Educational Resources Information Center
Brownstein, Naomi; Pensky, Marianna
2008-01-01
The objective of the present paper is to provide a simple approach to statistical inference using the method of transformations of variables. We demonstrate performance of this powerful tool on examples of constructions of various estimation procedures, hypothesis testing, Bayes analysis and statistical inference for the stress-strength systems.…
IMNN: Information Maximizing Neural Networks
NASA Astrophysics Data System (ADS)
Charnock, Tom; Lavaux, Guilhem; Wandelt, Benjamin D.
2018-04-01
This software trains artificial neural networks to find non-linear functionals of data that maximize Fisher information: information maximizing neural networks (IMNNs). As compressing large data sets vastly simplifies both frequentist and Bayesian inference, important information may be inadvertently missed. Likelihood-free inference based on automatically derived IMNN summaries produces summaries that are good approximations to sufficient statistics. IMNNs are robustly capable of automatically finding optimal, non-linear summaries of the data even in cases where linear compression fails: inferring the variance of Gaussian signal in the presence of noise, inferring cosmological parameters from mock simulations of the Lyman-α forest in quasar spectra, and inferring frequency-domain parameters from LISA-like detections of gravitational waveforms. In this final case, the IMNN summary outperforms linear data compression by avoiding the introduction of spurious likelihood maxima.
Robust Strategy for Rocket Engine Health Monitoring
NASA Technical Reports Server (NTRS)
Santi, L. Michael
2001-01-01
Monitoring the health of rocket engine systems is essentially a two-phase process. The acquisition phase involves sensing physical conditions at selected locations, converting physical inputs to electrical signals, conditioning the signals as appropriate to establish scale or filter interference, and recording results in a form that is easy to interpret. The inference phase involves analysis of results from the acquisition phase, comparison of analysis results to established health measures, and assessment of health indications. A variety of analytical tools may be employed in the inference phase of health monitoring. These tools can be separated into three broad categories: statistical, rule based, and model based. Statistical methods can provide excellent comparative measures of engine operating health. They require well-characterized data from an ensemble of "typical" engines, or "golden" data from a specific test assumed to define the operating norm in order to establish reliable comparative measures. Statistical methods are generally suitable for real-time health monitoring because they do not deal with the physical complexities of engine operation. The utility of statistical methods in rocket engine health monitoring is hindered by practical limits on the quantity and quality of available data. This is due to the difficulty and high cost of data acquisition, the limited number of available test engines, and the problem of simulating flight conditions in ground test facilities. In addition, statistical methods incur a penalty for disregarding flow complexity and are therefore limited in their ability to define performance shift causality. Rule based methods infer the health state of the engine system based on comparison of individual measurements or combinations of measurements with defined health norms or rules. This does not mean that rule based methods are necessarily simple. Although binary yes-no health assessment can sometimes be established by relatively simple rules, the causality assignment needed for refined health monitoring often requires an exceptionally complex rule base involving complicated logical maps. Structuring the rule system to be clear and unambiguous can be difficult, and the expert input required to maintain a large logic network and associated rule base can be prohibitive.
Standard deviation and standard error of the mean.
Lee, Dong Kyu; In, Junyong; Lee, Sangseok
2015-06-01
In most clinical and experimental studies, the standard deviation (SD) and the estimated standard error of the mean (SEM) are used to present the characteristics of sample data and to explain statistical analysis results. However, some authors occasionally muddle the distinctive usage between the SD and SEM in medical literature. Because the process of calculating the SD and SEM includes different statistical inferences, each of them has its own meaning. SD is the dispersion of data in a normal distribution. In other words, SD indicates how accurately the mean represents sample data. However the meaning of SEM includes statistical inference based on the sampling distribution. SEM is the SD of the theoretical distribution of the sample means (the sampling distribution). While either SD or SEM can be applied to describe data and statistical results, one should be aware of reasonable methods with which to use SD and SEM. We aim to elucidate the distinctions between SD and SEM and to provide proper usage guidelines for both, which summarize data and describe statistical results.
Standard deviation and standard error of the mean
In, Junyong; Lee, Sangseok
2015-01-01
In most clinical and experimental studies, the standard deviation (SD) and the estimated standard error of the mean (SEM) are used to present the characteristics of sample data and to explain statistical analysis results. However, some authors occasionally muddle the distinctive usage between the SD and SEM in medical literature. Because the process of calculating the SD and SEM includes different statistical inferences, each of them has its own meaning. SD is the dispersion of data in a normal distribution. In other words, SD indicates how accurately the mean represents sample data. However the meaning of SEM includes statistical inference based on the sampling distribution. SEM is the SD of the theoretical distribution of the sample means (the sampling distribution). While either SD or SEM can be applied to describe data and statistical results, one should be aware of reasonable methods with which to use SD and SEM. We aim to elucidate the distinctions between SD and SEM and to provide proper usage guidelines for both, which summarize data and describe statistical results. PMID:26045923
Estimation versus falsification approaches in sport and exercise science.
Wilkinson, Michael; Winter, Edward M
2018-05-22
There has been a recent resurgence in debate about methods for statistical inference in science. The debate addresses statistical concepts and their impact on the value and meaning of analyses' outcomes. In contrast, philosophical underpinnings of approaches and the extent to which analytical tools match philosophical goals of the scientific method have received less attention. This short piece considers application of the scientific method to "what-is-the-influence-of x-on-y" type questions characteristic of sport and exercise science. We consider applications and interpretations of estimation versus falsification based statistical approaches and their value in addressing how much x influences y, and in measurement error and method agreement settings. We compare estimation using magnitude based inference (MBI) with falsification using null hypothesis significance testing (NHST), and highlight the limited value both of falsification and NHST to address problems in sport and exercise science. We recommend adopting an estimation approach, expressing the uncertainty of effects of x on y, and their practical/clinical value against pre-determined effect magnitudes using MBI.
Reliability of a Measure of Institutional Discrimination against Minorities
1979-12-01
samples are presented. The first is based upon classical statistical theory and the second derives from a series of computer-generated Monte Carlo...Institutional racism and sexism . Englewood Cliffs, N. J.: Prentice-Hall, Inc., 1978. Hays, W. L. and Winkler, R. L. Statistics : probability, inference... statistical measure of the e of institutional discrimination are discussed. Two methods of dealing with the problem of reliability of the measure in small
Statistical inference for tumor growth inhibition T/C ratio.
Wu, Jianrong
2010-09-01
The tumor growth inhibition T/C ratio is commonly used to quantify treatment effects in drug screening tumor xenograft experiments. The T/C ratio is converted to an antitumor activity rating using an arbitrary cutoff point and often without any formal statistical inference. Here, we applied a nonparametric bootstrap method and a small sample likelihood ratio statistic to make a statistical inference of the T/C ratio, including both hypothesis testing and a confidence interval estimate. Furthermore, sample size and power are also discussed for statistical design of tumor xenograft experiments. Tumor xenograft data from an actual experiment were analyzed to illustrate the application.
Investigating Mathematics Teachers' Thoughts of Statistical Inference
ERIC Educational Resources Information Center
Yang, Kai-Lin
2012-01-01
Research on statistical cognition and application suggests that statistical inference concepts are commonly misunderstood by students and even misinterpreted by researchers. Although some research has been done on students' misunderstanding or misconceptions of confidence intervals (CIs), few studies explore either students' or mathematics…
Lessons from Inferentialism for Statistics Education
ERIC Educational Resources Information Center
Bakker, Arthur; Derry, Jan
2011-01-01
This theoretical paper relates recent interest in informal statistical inference (ISI) to the semantic theory termed inferentialism, a significant development in contemporary philosophy, which places inference at the heart of human knowing. This theory assists epistemological reflection on challenges in statistics education encountered when…
Statistical Inference and Patterns of Inequality in the Global North
ERIC Educational Resources Information Center
Moran, Timothy Patrick
2006-01-01
Cross-national inequality trends have historically been a crucial field of inquiry across the social sciences, and new methodological techniques of statistical inference have recently improved the ability to analyze these trends over time. This paper applies Monte Carlo, bootstrap inference methods to the income surveys of the Luxembourg Income…
Representing Micro-Macro Linkages by Actor-Based Dynamic Network Models
ERIC Educational Resources Information Center
Snijders, Tom A. B.; Steglich, Christian E. G.
2015-01-01
Stochastic actor-based models for network dynamics have the primary aim of statistical inference about processes of network change, but may be regarded as a kind of agent-based models. Similar to many other agent-based models, they are based on local rules for actor behavior. Different from many other agent-based models, by including elements of…
Statistical Inference and Simulation with StatKey
ERIC Educational Resources Information Center
Quinn, Anne
2016-01-01
While looking for an inexpensive technology package to help students in statistics classes, the author found StatKey, a free Web-based app. Not only is StatKey useful for students' year-end projects, but it is also valuable for helping students learn fundamental content such as the central limit theorem. Using StatKey, students can engage in…
Statistics, Computation, and Modeling in Cosmology
NASA Astrophysics Data System (ADS)
Jewell, Jeff; Guiness, Joe; SAMSI 2016 Working Group in Cosmology
2017-01-01
Current and future ground and space based missions are designed to not only detect, but map out with increasing precision, details of the universe in its infancy to the present-day. As a result we are faced with the challenge of analyzing and interpreting observations from a wide variety of instruments to form a coherent view of the universe. Finding solutions to a broad range of challenging inference problems in cosmology is one of the goals of the “Statistics, Computation, and Modeling in Cosmology” workings groups, formed as part of the year long program on ‘Statistical, Mathematical, and Computational Methods for Astronomy’, hosted by the Statistical and Applied Mathematical Sciences Institute (SAMSI), a National Science Foundation funded institute. Two application areas have emerged for focused development in the cosmology working group involving advanced algorithmic implementations of exact Bayesian inference for the Cosmic Microwave Background, and statistical modeling of galaxy formation. The former includes study and development of advanced Markov Chain Monte Carlo algorithms designed to confront challenging inference problems including inference for spatial Gaussian random fields in the presence of sources of galactic emission (an example of a source separation problem). Extending these methods to future redshift survey data probing the nonlinear regime of large scale structure formation is also included in the working group activities. In addition, the working group is also focused on the study of ‘Galacticus’, a galaxy formation model applied to dark matter-only cosmological N-body simulations operating on time-dependent halo merger trees. The working group is interested in calibrating the Galacticus model to match statistics of galaxy survey observations; specifically stellar mass functions, luminosity functions, and color-color diagrams. The group will use subsampling approaches and fractional factorial designs to statistically and computationally efficiently explore the Galacticus parameter space. The group will also use the Galacticus simulations to study the relationship between the topological and physical structure of the halo merger trees and the properties of the resulting galaxies.
Dorazio, R.M.; Johnson, F.A.
2003-01-01
Bayesian inference and decision theory may be used in the solution of relatively complex problems of natural resource management, owing to recent advances in statistical theory and computing. In particular, Markov chain Monte Carlo algorithms provide a computational framework for fitting models of adequate complexity and for evaluating the expected consequences of alternative management actions. We illustrate these features using an example based on management of waterfowl habitat.
Statistical analysis of fNIRS data: a comprehensive review.
Tak, Sungho; Ye, Jong Chul
2014-01-15
Functional near-infrared spectroscopy (fNIRS) is a non-invasive method to measure brain activities using the changes of optical absorption in the brain through the intact skull. fNIRS has many advantages over other neuroimaging modalities such as positron emission tomography (PET), functional magnetic resonance imaging (fMRI), or magnetoencephalography (MEG), since it can directly measure blood oxygenation level changes related to neural activation with high temporal resolution. However, fNIRS signals are highly corrupted by measurement noises and physiology-based systemic interference. Careful statistical analyses are therefore required to extract neuronal activity-related signals from fNIRS data. In this paper, we provide an extensive review of historical developments of statistical analyses of fNIRS signal, which include motion artifact correction, short source-detector separation correction, principal component analysis (PCA)/independent component analysis (ICA), false discovery rate (FDR), serially-correlated errors, as well as inference techniques such as the standard t-test, F-test, analysis of variance (ANOVA), and statistical parameter mapping (SPM) framework. In addition, to provide a unified view of various existing inference techniques, we explain a linear mixed effect model with restricted maximum likelihood (ReML) variance estimation, and show that most of the existing inference methods for fNIRS analysis can be derived as special cases. Some of the open issues in statistical analysis are also described. Copyright © 2013 Elsevier Inc. All rights reserved.
Permutation-based inference for the AUC: A unified approach for continuous and discontinuous data.
Pauly, Markus; Asendorf, Thomas; Konietschke, Frank
2016-11-01
We investigate rank-based studentized permutation methods for the nonparametric Behrens-Fisher problem, that is, inference methods for the area under the ROC curve. We hereby prove that the studentized permutation distribution of the Brunner-Munzel rank statistic is asymptotically standard normal, even under the alternative. Thus, incidentally providing the hitherto missing theoretical foundation for the Neubert and Brunner studentized permutation test. In particular, we do not only show its consistency, but also that confidence intervals for the underlying treatment effects can be computed by inverting this permutation test. In addition, we derive permutation-based range-preserving confidence intervals. Extensive simulation studies show that the permutation-based confidence intervals appear to maintain the preassigned coverage probability quite accurately (even for rather small sample sizes). For a convenient application of the proposed methods, a freely available software package for the statistical software R has been developed. A real data example illustrates the application. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
CADDIS Volume 4. Data Analysis: Biological and Environmental Data Requirements
Overview of PECBO Module, using scripts to infer environmental conditions from biological observations, statistically estimating species-environment relationships, methods for inferring environmental conditions, statistical scripts in module.
Statistical methods for the beta-binomial model in teratology.
Yamamoto, E; Yanagimoto, T
1994-01-01
The beta-binomial model is widely used for analyzing teratological data involving littermates. Recent developments in statistical analyses of teratological data are briefly reviewed with emphasis on the model. For statistical inference of the parameters in the beta-binomial distribution, separation of the likelihood introduces an likelihood inference. This leads to reducing biases of estimators and also to improving accuracy of empirical significance levels of tests. Separate inference of the parameters can be conducted in a unified way. PMID:8187716
Liu, Fang; Eugenio, Evercita C
2018-04-01
Beta regression is an increasingly popular statistical technique in medical research for modeling of outcomes that assume values in (0, 1), such as proportions and patient reported outcomes. When outcomes take values in the intervals [0,1), (0,1], or [0,1], zero-or-one-inflated beta (zoib) regression can be used. We provide a thorough review on beta regression and zoib regression in the modeling, inferential, and computational aspects via the likelihood-based and Bayesian approaches. We demonstrate the statistical and practical importance of correctly modeling the inflation at zero/one rather than ad hoc replacing them with values close to zero/one via simulation studies; the latter approach can lead to biased estimates and invalid inferences. We show via simulation studies that the likelihood-based approach is computationally faster in general than MCMC algorithms used in the Bayesian inferences, but runs the risk of non-convergence, large biases, and sensitivity to starting values in the optimization algorithm especially with clustered/correlated data, data with sparse inflation at zero and one, and data that warrant regularization of the likelihood. The disadvantages of the regular likelihood-based approach make the Bayesian approach an attractive alternative in these cases. Software packages and tools for fitting beta and zoib regressions in both the likelihood-based and Bayesian frameworks are also reviewed.
Interpretation of correlations in clinical research.
Hung, Man; Bounsanga, Jerry; Voss, Maren Wright
2017-11-01
Critically analyzing research is a key skill in evidence-based practice and requires knowledge of research methods, results interpretation, and applications, all of which rely on a foundation based in statistics. Evidence-based practice makes high demands on trained medical professionals to interpret an ever-expanding array of research evidence. As clinical training emphasizes medical care rather than statistics, it is useful to review the basics of statistical methods and what they mean for interpreting clinical studies. We reviewed the basic concepts of correlational associations, violations of normality, unobserved variable bias, sample size, and alpha inflation. The foundations of causal inference were discussed and sound statistical analyses were examined. We discuss four ways in which correlational analysis is misused, including causal inference overreach, over-reliance on significance, alpha inflation, and sample size bias. Recent published studies in the medical field provide evidence of causal assertion overreach drawn from correlational findings. The findings present a primer on the assumptions and nature of correlational methods of analysis and urge clinicians to exercise appropriate caution as they critically analyze the evidence before them and evaluate evidence that supports practice. Critically analyzing new evidence requires statistical knowledge in addition to clinical knowledge. Studies can overstate relationships, expressing causal assertions when only correlational evidence is available. Failure to account for the effect of sample size in the analyses tends to overstate the importance of predictive variables. It is important not to overemphasize the statistical significance without consideration of effect size and whether differences could be considered clinically meaningful.
Direct evidence for a dual process model of deductive inference.
Markovits, Henry; Brunet, Marie-Laurence; Thompson, Valerie; Brisson, Janie
2013-07-01
In 2 experiments, we tested a strong version of a dual process theory of conditional inference (cf. Verschueren et al., 2005a, 2005b) that assumes that most reasoners have 2 strategies available, the choice of which is determined by situational variables, cognitive capacity, and metacognitive control. The statistical strategy evaluates inferences probabilistically, accepting those with high conditional probability. The counterexample strategy rejects inferences when a counterexample shows the inference to be invalid. To discriminate strategy use, we presented reasoners with conditional statements (if p, then q) and explicit statistical information about the relative frequency of the probability of p/q (50% vs. 90%). A statistical strategy would accept the more probable inferences more frequently, whereas the counterexample one would reject both. In Experiment 1, reasoners under time pressure used the statistical strategy more, but switched to the counterexample strategy when time constraints were removed; the former took less time than the latter. These data are consistent with the hypothesis that the statistical strategy is the default heuristic. Under a free-time condition, reasoners preferred the counterexample strategy and kept it when put under time pressure. Thus, it is not simply a lack of capacity that produces a statistical strategy; instead, it seems that time pressure disrupts the ability to make good metacognitive choices. In line with this conclusion, in a 2nd experiment, we measured reasoners' confidence in their performance; those under time pressure were less confident in the statistical than the counterexample strategy and more likely to switch strategies under free-time conditions. PsycINFO Database Record (c) 2013 APA, all rights reserved.
Visual shape perception as Bayesian inference of 3D object-centered shape representations.
Erdogan, Goker; Jacobs, Robert A
2017-11-01
Despite decades of research, little is known about how people visually perceive object shape. We hypothesize that a promising approach to shape perception is provided by a "visual perception as Bayesian inference" framework which augments an emphasis on visual representation with an emphasis on the idea that shape perception is a form of statistical inference. Our hypothesis claims that shape perception of unfamiliar objects can be characterized as statistical inference of 3D shape in an object-centered coordinate system. We describe a computational model based on our theoretical framework, and provide evidence for the model along two lines. First, we show that, counterintuitively, the model accounts for viewpoint-dependency of object recognition, traditionally regarded as evidence against people's use of 3D object-centered shape representations. Second, we report the results of an experiment using a shape similarity task, and present an extensive evaluation of existing models' abilities to account for the experimental data. We find that our shape inference model captures subjects' behaviors better than competing models. Taken as a whole, our experimental and computational results illustrate the promise of our approach and suggest that people's shape representations of unfamiliar objects are probabilistic, 3D, and object-centered. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Nonparametric predictive inference for combining diagnostic tests with parametric copula
NASA Astrophysics Data System (ADS)
Muhammad, Noryanti; Coolen, F. P. A.; Coolen-Maturi, T.
2017-09-01
Measuring the accuracy of diagnostic tests is crucial in many application areas including medicine and health care. The Receiver Operating Characteristic (ROC) curve is a popular statistical tool for describing the performance of diagnostic tests. The area under the ROC curve (AUC) is often used as a measure of the overall performance of the diagnostic test. In this paper, we interest in developing strategies for combining test results in order to increase the diagnostic accuracy. We introduce nonparametric predictive inference (NPI) for combining two diagnostic test results with considering dependence structure using parametric copula. NPI is a frequentist statistical framework for inference on a future observation based on past data observations. NPI uses lower and upper probabilities to quantify uncertainty and is based on only a few modelling assumptions. While copula is a well-known statistical concept for modelling dependence of random variables. A copula is a joint distribution function whose marginals are all uniformly distributed and it can be used to model the dependence separately from the marginal distributions. In this research, we estimate the copula density using a parametric method which is maximum likelihood estimator (MLE). We investigate the performance of this proposed method via data sets from the literature and discuss results to show how our method performs for different family of copulas. Finally, we briefly outline related challenges and opportunities for future research.
Data Analysis Techniques for Physical Scientists
NASA Astrophysics Data System (ADS)
Pruneau, Claude A.
2017-10-01
Preface; How to read this book; 1. The scientific method; Part I. Foundation in Probability and Statistics: 2. Probability; 3. Probability models; 4. Classical inference I: estimators; 5. Classical inference II: optimization; 6. Classical inference III: confidence intervals and statistical tests; 7. Bayesian inference; Part II. Measurement Techniques: 8. Basic measurements; 9. Event reconstruction; 10. Correlation functions; 11. The multiple facets of correlation functions; 12. Data correction methods; Part III. Simulation Techniques: 13. Monte Carlo methods; 14. Collision and detector modeling; List of references; Index.
Overview of PECBO Module, using scripts to infer environmental conditions from biological observations, statistically estimating species-environment relationships, methods for inferring environmental conditions, statistical scripts in module.
Imputation approaches for animal movement modeling
Scharf, Henry; Hooten, Mevin B.; Johnson, Devin S.
2017-01-01
The analysis of telemetry data is common in animal ecological studies. While the collection of telemetry data for individual animals has improved dramatically, the methods to properly account for inherent uncertainties (e.g., measurement error, dependence, barriers to movement) have lagged behind. Still, many new statistical approaches have been developed to infer unknown quantities affecting animal movement or predict movement based on telemetry data. Hierarchical statistical models are useful to account for some of the aforementioned uncertainties, as well as provide population-level inference, but they often come with an increased computational burden. For certain types of statistical models, it is straightforward to provide inference if the latent true animal trajectory is known, but challenging otherwise. In these cases, approaches related to multiple imputation have been employed to account for the uncertainty associated with our knowledge of the latent trajectory. Despite the increasing use of imputation approaches for modeling animal movement, the general sensitivity and accuracy of these methods have not been explored in detail. We provide an introduction to animal movement modeling and describe how imputation approaches may be helpful for certain types of models. We also assess the performance of imputation approaches in two simulation studies. Our simulation studies suggests that inference for model parameters directly related to the location of an individual may be more accurate than inference for parameters associated with higher-order processes such as velocity or acceleration. Finally, we apply these methods to analyze a telemetry data set involving northern fur seals (Callorhinus ursinus) in the Bering Sea. Supplementary materials accompanying this paper appear online.
IMAGINE: Interstellar MAGnetic field INference Engine
NASA Astrophysics Data System (ADS)
Steininger, Theo
2018-03-01
IMAGINE (Interstellar MAGnetic field INference Engine) performs inference on generic parametric models of the Galaxy. The modular open source framework uses highly optimized tools and technology such as the MultiNest sampler (ascl:1109.006) and the information field theory framework NIFTy (ascl:1302.013) to create an instance of the Milky Way based on a set of parameters for physical observables, using Bayesian statistics to judge the mismatch between measured data and model prediction. The flexibility of the IMAGINE framework allows for simple refitting for newly available data sets and makes state-of-the-art Bayesian methods easily accessible particularly for random components of the Galactic magnetic field.
Statistical Inference for Porous Materials using Persistent Homology.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moon, Chul; Heath, Jason E.; Mitchell, Scott A.
2017-12-01
We propose a porous materials analysis pipeline using persistent homology. We rst compute persistent homology of binarized 3D images of sampled material subvolumes. For each image we compute sets of homology intervals, which are represented as summary graphics called persistence diagrams. We convert persistence diagrams into image vectors in order to analyze the similarity of the homology of the material images using the mature tools for image analysis. Each image is treated as a vector and we compute its principal components to extract features. We t a statistical model using the loadings of principal components to estimate material porosity, permeability,more » anisotropy, and tortuosity. We also propose an adaptive version of the structural similarity index (SSIM), a similarity metric for images, as a measure to determine the statistical representative elementary volumes (sREV) for persistence homology. Thus we provide a capability for making a statistical inference of the uid ow and transport properties of porous materials based on their geometry and connectivity.« less
Statistics for nuclear engineers and scientists. Part 1. Basic statistical inference
DOE Office of Scientific and Technical Information (OSTI.GOV)
Beggs, W.J.
1981-02-01
This report is intended for the use of engineers and scientists working in the nuclear industry, especially at the Bettis Atomic Power Laboratory. It serves as the basis for several Bettis in-house statistics courses. The objectives of the report are to introduce the reader to the language and concepts of statistics and to provide a basic set of techniques to apply to problems of the collection and analysis of data. Part 1 covers subjects of basic inference. The subjects include: descriptive statistics; probability; simple inference for normally distributed populations, and for non-normal populations as well; comparison of two populations; themore » analysis of variance; quality control procedures; and linear regression analysis.« less
Stochastic inference with spiking neurons in the high-conductance state
NASA Astrophysics Data System (ADS)
Petrovici, Mihai A.; Bill, Johannes; Bytschok, Ilja; Schemmel, Johannes; Meier, Karlheinz
2016-10-01
The highly variable dynamics of neocortical circuits observed in vivo have been hypothesized to represent a signature of ongoing stochastic inference but stand in apparent contrast to the deterministic response of neurons measured in vitro. Based on a propagation of the membrane autocorrelation across spike bursts, we provide an analytical derivation of the neural activation function that holds for a large parameter space, including the high-conductance state. On this basis, we show how an ensemble of leaky integrate-and-fire neurons with conductance-based synapses embedded in a spiking environment can attain the correct firing statistics for sampling from a well-defined target distribution. For recurrent networks, we examine convergence toward stationarity in computer simulations and demonstrate sample-based Bayesian inference in a mixed graphical model. This points to a new computational role of high-conductance states and establishes a rigorous link between deterministic neuron models and functional stochastic dynamics on the network level.
Developing Young Children's Emergent Inferential Practices in Statistics
ERIC Educational Resources Information Center
Makar, Katie
2016-01-01
Informal statistical inference has now been researched at all levels of schooling and initial tertiary study. Work in informal statistical inference is least understood in the early years, where children have had little if any exposure to data handling. A qualitative study in Australia was carried out through a series of teaching experiments with…
A Story-Based Simulation for Teaching Sampling Distributions
ERIC Educational Resources Information Center
Turner, Stephen; Dabney, Alan R.
2015-01-01
Statistical inference relies heavily on the concept of sampling distributions. However, sampling distributions are difficult to teach. We present a series of short animations that are story-based, with associated assessments. We hope that our contribution can be useful as a tool to teach sampling distributions in the introductory statistics…
Cloern, James E.; Jassby, Alan D.; Carstensen, Jacob; Bennett, William A.; Kimmerer, Wim; Mac Nally, Ralph; Schoellhamer, David H.; Winder, Monika
2012-01-01
We comment on a nonstandard statistical treatment of time-series data first published by Breton et al. (2006) in Limnology and Oceanography and, more recently, used by Glibert (2010) in Reviews in Fisheries Science. In both papers, the authors make strong inferences about the underlying causes of population variability based on correlations between cumulative sum (CUSUM) transformations of organism abundances and environmental variables. Breton et al. (2006) reported correlations between CUSUM-transformed values of diatom biomass in Belgian coastal waters and the North Atlantic Oscillation, and between meteorological and hydrological variables. Each correlation of CUSUM-transformed variables was judged to be statistically significant. On the basis of these correlations, Breton et al. (2006) developed "the first evidence of synergy between climate and human-induced river-based nitrate inputs with respect to their effects on the magnitude of spring Phaeocystis colony blooms and their dominance over diatoms."
Experimental and environmental factors affect spurious detection of ecological thresholds
Daily, Jonathan P.; Hitt, Nathaniel P.; Smith, David; Snyder, Craig D.
2012-01-01
Threshold detection methods are increasingly popular for assessing nonlinear responses to environmental change, but their statistical performance remains poorly understood. We simulated linear change in stream benthic macroinvertebrate communities and evaluated the performance of commonly used threshold detection methods based on model fitting (piecewise quantile regression [PQR]), data partitioning (nonparametric change point analysis [NCPA]), and a hybrid approach (significant zero crossings [SiZer]). We demonstrated that false detection of ecological thresholds (type I errors) and inferences on threshold locations are influenced by sample size, rate of linear change, and frequency of observations across the environmental gradient (i.e., sample-environment distribution, SED). However, the relative importance of these factors varied among statistical methods and between inference types. False detection rates were influenced primarily by user-selected parameters for PQR (τ) and SiZer (bandwidth) and secondarily by sample size (for PQR) and SED (for SiZer). In contrast, the location of reported thresholds was influenced primarily by SED. Bootstrapped confidence intervals for NCPA threshold locations revealed strong correspondence to SED. We conclude that the choice of statistical methods for threshold detection should be matched to experimental and environmental constraints to minimize false detection rates and avoid spurious inferences regarding threshold location.
duVerle, David A; Yotsukura, Sohiya; Nomura, Seitaro; Aburatani, Hiroyuki; Tsuda, Koji
2016-09-13
Single-cell RNA sequencing is fast becoming one the standard method for gene expression measurement, providing unique insights into cellular processes. A number of methods, based on general dimensionality reduction techniques, have been suggested to help infer and visualise the underlying structure of cell populations from single-cell expression levels, yet their models generally lack proper biological grounding and struggle at identifying complex differentiation paths. Here we introduce cellTree: an R/Bioconductor package that uses a novel statistical approach, based on document analysis techniques, to produce tree structures outlining the hierarchical relationship between single-cell samples, while identifying latent groups of genes that can provide biological insights. With cellTree, we provide experimentalists with an easy-to-use tool, based on statistically and biologically-sound algorithms, to efficiently explore and visualise single-cell RNA data. The cellTree package is publicly available in the online Bionconductor repository at: http://bioconductor.org/packages/cellTree/ .
Using GIS to generate spatially balanced random survey designs for natural resource applications.
Theobald, David M; Stevens, Don L; White, Denis; Urquhart, N Scott; Olsen, Anthony R; Norman, John B
2007-07-01
Sampling of a population is frequently required to understand trends and patterns in natural resource management because financial and time constraints preclude a complete census. A rigorous probability-based survey design specifies where to sample so that inferences from the sample apply to the entire population. Probability survey designs should be used in natural resource and environmental management situations because they provide the mathematical foundation for statistical inference. Development of long-term monitoring designs demand survey designs that achieve statistical rigor and are efficient but remain flexible to inevitable logistical or practical constraints during field data collection. Here we describe an approach to probability-based survey design, called the Reversed Randomized Quadrant-Recursive Raster, based on the concept of spatially balanced sampling and implemented in a geographic information system. This provides environmental managers a practical tool to generate flexible and efficient survey designs for natural resource applications. Factors commonly used to modify sampling intensity, such as categories, gradients, or accessibility, can be readily incorporated into the spatially balanced sample design.
Basis function models for animal movement
Hooten, Mevin B.; Johnson, Devin S.
2017-01-01
Advances in satellite-based data collection techniques have served as a catalyst for new statistical methodology to analyze these data. In wildlife ecological studies, satellite-based data and methodology have provided a wealth of information about animal space use and the investigation of individual-based animal–environment relationships. With the technology for data collection improving dramatically over time, we are left with massive archives of historical animal telemetry data of varying quality. While many contemporary statistical approaches for inferring movement behavior are specified in discrete time, we develop a flexible continuous-time stochastic integral equation framework that is amenable to reduced-rank second-order covariance parameterizations. We demonstrate how the associated first-order basis functions can be constructed to mimic behavioral characteristics in realistic trajectory processes using telemetry data from mule deer and mountain lion individuals in western North America. Our approach is parallelizable and provides inference for heterogenous trajectories using nonstationary spatial modeling techniques that are feasible for large telemetry datasets. Supplementary materials for this article are available online.
Ionospheric and Birkeland current distributions inferred from the MAGSAT magnetometer data
NASA Technical Reports Server (NTRS)
Zanetti, L. J.; Potemra, T. A.; Baumjohann, W.
1983-01-01
Ionospheric and field-aligned sheet current density distributions are presently inferred by means of MAGSAT vector magnetometer data, together with an accurate magnetic field model. By comparing Hall current densities inferred from the MAGSAT data and those inferred from simultaneously recorded ground based data acquired by the Scandinavian magnetometer array, it is determined that the former have previously been underestimated due to high damping of magnetic variations with high spatial wave numbers between the ionosphere and the MAGSAT orbit. Among important results of this study is noted the fact that the Birkeland and electrojet current systems are colocated. The analyses have shown a tendency for triangular rather than constant electrojet current distributions as a function of latitude, consistent with the statistical, uniform regions 1 and 2 Birkeland current patterns.
Notes on power of normality tests of error terms in regression models
DOE Office of Scientific and Technical Information (OSTI.GOV)
Střelec, Luboš
2015-03-10
Normality is one of the basic assumptions in applying statistical procedures. For example in linear regression most of the inferential procedures are based on the assumption of normality, i.e. the disturbance vector is assumed to be normally distributed. Failure to assess non-normality of the error terms may lead to incorrect results of usual statistical inference techniques such as t-test or F-test. Thus, error terms should be normally distributed in order to allow us to make exact inferences. As a consequence, normally distributed stochastic errors are necessary in order to make a not misleading inferences which explains a necessity and importancemore » of robust tests of normality. Therefore, the aim of this contribution is to discuss normality testing of error terms in regression models. In this contribution, we introduce the general RT class of robust tests for normality, and present and discuss the trade-off between power and robustness of selected classical and robust normality tests of error terms in regression models.« less
Johnson, Jason K.; Oyen, Diane Adele; Chertkov, Michael; ...
2016-12-01
Inference and learning of graphical models are both well-studied problems in statistics and machine learning that have found many applications in science and engineering. However, exact inference is intractable in general graphical models, which suggests the problem of seeking the best approximation to a collection of random variables within some tractable family of graphical models. In this paper, we focus on the class of planar Ising models, for which exact inference is tractable using techniques of statistical physics. Based on these techniques and recent methods for planarity testing and planar embedding, we propose a greedy algorithm for learning the bestmore » planar Ising model to approximate an arbitrary collection of binary random variables (possibly from sample data). Given the set of all pairwise correlations among variables, we select a planar graph and optimal planar Ising model defined on this graph to best approximate that set of correlations. Finally, we demonstrate our method in simulations and for two applications: modeling senate voting records and identifying geo-chemical depth trends from Mars rover data.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Johnson, Jason K.; Oyen, Diane Adele; Chertkov, Michael
Inference and learning of graphical models are both well-studied problems in statistics and machine learning that have found many applications in science and engineering. However, exact inference is intractable in general graphical models, which suggests the problem of seeking the best approximation to a collection of random variables within some tractable family of graphical models. In this paper, we focus on the class of planar Ising models, for which exact inference is tractable using techniques of statistical physics. Based on these techniques and recent methods for planarity testing and planar embedding, we propose a greedy algorithm for learning the bestmore » planar Ising model to approximate an arbitrary collection of binary random variables (possibly from sample data). Given the set of all pairwise correlations among variables, we select a planar graph and optimal planar Ising model defined on this graph to best approximate that set of correlations. Finally, we demonstrate our method in simulations and for two applications: modeling senate voting records and identifying geo-chemical depth trends from Mars rover data.« less
An argument for mechanism-based statistical inference in cancer
Ochs, Michael; Price, Nathan D.; Tomasetti, Cristian; Younes, Laurent
2015-01-01
Cancer is perhaps the prototypical systems disease, and as such has been the focus of extensive study in quantitative systems biology. However, translating these programs into personalized clinical care remains elusive and incomplete. In this perspective, we argue that realizing this agenda—in particular, predicting disease phenotypes, progression and treatment response for individuals—requires going well beyond standard computational and bioinformatics tools and algorithms. It entails designing global mathematical models over network-scale configurations of genomic states and molecular concentrations, and learning the model parameters from limited available samples of high-dimensional and integrative omics data. As such, any plausible design should accommodate: biological mechanism, necessary for both feasible learning and interpretable decision making; stochasticity, to deal with uncertainty and observed variation at many scales; and a capacity for statistical inference at the patient level. This program, which requires a close, sustained collaboration between mathematicians and biologists, is illustrated in several contexts, including learning bio-markers, metabolism, cell signaling, network inference and tumorigenesis. PMID:25381197
Comparative analysis on the selection of number of clusters in community detection
NASA Astrophysics Data System (ADS)
Kawamoto, Tatsuro; Kabashima, Yoshiyuki
2018-02-01
We conduct a comparative analysis on various estimates of the number of clusters in community detection. An exhaustive comparison requires testing of all possible combinations of frameworks, algorithms, and assessment criteria. In this paper we focus on the framework based on a stochastic block model, and investigate the performance of greedy algorithms, statistical inference, and spectral methods. For the assessment criteria, we consider modularity, map equation, Bethe free energy, prediction errors, and isolated eigenvalues. From the analysis, the tendency of overfit and underfit that the assessment criteria and algorithms have becomes apparent. In addition, we propose that the alluvial diagram is a suitable tool to visualize statistical inference results and can be useful to determine the number of clusters.
BCM: toolkit for Bayesian analysis of Computational Models using samplers.
Thijssen, Bram; Dijkstra, Tjeerd M H; Heskes, Tom; Wessels, Lodewyk F A
2016-10-21
Computational models in biology are characterized by a large degree of uncertainty. This uncertainty can be analyzed with Bayesian statistics, however, the sampling algorithms that are frequently used for calculating Bayesian statistical estimates are computationally demanding, and each algorithm has unique advantages and disadvantages. It is typically unclear, before starting an analysis, which algorithm will perform well on a given computational model. We present BCM, a toolkit for the Bayesian analysis of Computational Models using samplers. It provides efficient, multithreaded implementations of eleven algorithms for sampling from posterior probability distributions and for calculating marginal likelihoods. BCM includes tools to simplify the process of model specification and scripts for visualizing the results. The flexible architecture allows it to be used on diverse types of biological computational models. In an example inference task using a model of the cell cycle based on ordinary differential equations, BCM is significantly more efficient than existing software packages, allowing more challenging inference problems to be solved. BCM represents an efficient one-stop-shop for computational modelers wishing to use sampler-based Bayesian statistics.
Using Alien Coins to Test Whether Simple Inference Is Bayesian
ERIC Educational Resources Information Center
Cassey, Peter; Hawkins, Guy E.; Donkin, Chris; Brown, Scott D.
2016-01-01
Reasoning and inference are well-studied aspects of basic cognition that have been explained as statistically optimal Bayesian inference. Using a simplified experimental design, we conducted quantitative comparisons between Bayesian inference and human inference at the level of individuals. In 3 experiments, with more than 13,000 participants, we…
ERIC Educational Resources Information Center
Henriques, Ana; Oliveira, Hélia
2016-01-01
This paper reports on the results of a study investigating the potential to embed Informal Statistical Inference in statistical investigations, using TinkerPlots, for assisting 8th grade students' informal inferential reasoning to emerge, particularly their articulations of uncertainty. Data collection included students' written work on a…
Mutual Information in Frequency and Its Application to Measure Cross-Frequency Coupling in Epilepsy
NASA Astrophysics Data System (ADS)
Malladi, Rakesh; Johnson, Don H.; Kalamangalam, Giridhar P.; Tandon, Nitin; Aazhang, Behnaam
2018-06-01
We define a metric, mutual information in frequency (MI-in-frequency), to detect and quantify the statistical dependence between different frequency components in the data, referred to as cross-frequency coupling and apply it to electrophysiological recordings from the brain to infer cross-frequency coupling. The current metrics used to quantify the cross-frequency coupling in neuroscience cannot detect if two frequency components in non-Gaussian brain recordings are statistically independent or not. Our MI-in-frequency metric, based on Shannon's mutual information between the Cramer's representation of stochastic processes, overcomes this shortcoming and can detect statistical dependence in frequency between non-Gaussian signals. We then describe two data-driven estimators of MI-in-frequency: one based on kernel density estimation and the other based on the nearest neighbor algorithm and validate their performance on simulated data. We then use MI-in-frequency to estimate mutual information between two data streams that are dependent across time, without making any parametric model assumptions. Finally, we use the MI-in- frequency metric to investigate the cross-frequency coupling in seizure onset zone from electrocorticographic recordings during seizures. The inferred cross-frequency coupling characteristics are essential to optimize the spatial and spectral parameters of electrical stimulation based treatments of epilepsy.
Robust inference for group sequential trials.
Ganju, Jitendra; Lin, Yunzhi; Zhou, Kefei
2017-03-01
For ethical reasons, group sequential trials were introduced to allow trials to stop early in the event of extreme results. Endpoints in such trials are usually mortality or irreversible morbidity. For a given endpoint, the norm is to use a single test statistic and to use that same statistic for each analysis. This approach is risky because the test statistic has to be specified before the study is unblinded, and there is loss in power if the assumptions that ensure optimality for each analysis are not met. To minimize the risk of moderate to substantial loss in power due to a suboptimal choice of a statistic, a robust method was developed for nonsequential trials. The concept is analogous to diversification of financial investments to minimize risk. The method is based on combining P values from multiple test statistics for formal inference while controlling the type I error rate at its designated value.This article evaluates the performance of 2 P value combining methods for group sequential trials. The emphasis is on time to event trials although results from less complex trials are also included. The gain or loss in power with the combination method relative to a single statistic is asymmetric in its favor. Depending on the power of each individual test, the combination method can give more power than any single test or give power that is closer to the test with the most power. The versatility of the method is that it can combine P values from different test statistics for analysis at different times. The robustness of results suggests that inference from group sequential trials can be strengthened with the use of combined tests. Copyright © 2017 John Wiley & Sons, Ltd.
Estimating the probability of rare events: addressing zero failure data.
Quigley, John; Revie, Matthew
2011-07-01
Traditional statistical procedures for estimating the probability of an event result in an estimate of zero when no events are realized. Alternative inferential procedures have been proposed for the situation where zero events have been realized but often these are ad hoc, relying on selecting methods dependent on the data that have been realized. Such data-dependent inference decisions violate fundamental statistical principles, resulting in estimation procedures whose benefits are difficult to assess. In this article, we propose estimating the probability of an event occurring through minimax inference on the probability that future samples of equal size realize no more events than that in the data on which the inference is based. Although motivated by inference on rare events, the method is not restricted to zero event data and closely approximates the maximum likelihood estimate (MLE) for nonzero data. The use of the minimax procedure provides a risk adverse inferential procedure where there are no events realized. A comparison is made with the MLE and regions of the underlying probability are identified where this approach is superior. Moreover, a comparison is made with three standard approaches to supporting inference where no event data are realized, which we argue are unduly pessimistic. We show that for situations of zero events the estimator can be simply approximated with 1/2.5n, where n is the number of trials. © 2011 Society for Risk Analysis.
Wilkinson, Michael
2014-03-01
Decisions about support for predictions of theories in light of data are made using statistical inference. The dominant approach in sport and exercise science is the Neyman-Pearson (N-P) significance-testing approach. When applied correctly it provides a reliable procedure for making dichotomous decisions for accepting or rejecting zero-effect null hypotheses with known and controlled long-run error rates. Type I and type II error rates must be specified in advance and the latter controlled by conducting an a priori sample size calculation. The N-P approach does not provide the probability of hypotheses or indicate the strength of support for hypotheses in light of data, yet many scientists believe it does. Outcomes of analyses allow conclusions only about the existence of non-zero effects, and provide no information about the likely size of true effects or their practical/clinical value. Bayesian inference can show how much support data provide for different hypotheses, and how personal convictions should be altered in light of data, but the approach is complicated by formulating probability distributions about prior subjective estimates of population effects. A pragmatic solution is magnitude-based inference, which allows scientists to estimate the true magnitude of population effects and how likely they are to exceed an effect magnitude of practical/clinical importance, thereby integrating elements of subjective Bayesian-style thinking. While this approach is gaining acceptance, progress might be hastened if scientists appreciate the shortcomings of traditional N-P null hypothesis significance testing.
Inference for lidar-assisted estimation of forest growing stock volume
Ronald E. McRoberts; Erik Næsset; Terje Gobakken
2013-01-01
Estimates of growing stock volume are reported by the national forest inventories (NFI) of most countries and may serve as the basis for aboveground biomass and carbon estimates as required by an increasing number of international agreements. The probability-based (design-based) statistical estimators traditionally used by NFIs to calculate estimates are generally...
Using genetic data to strengthen causal inference in observational research.
Pingault, Jean-Baptiste; O'Reilly, Paul F; Schoeler, Tabea; Ploubidis, George B; Rijsdijk, Frühling; Dudbridge, Frank
2018-06-05
Causal inference is essential across the biomedical, behavioural and social sciences.By progressing from confounded statistical associations to evidence of causal relationships, causal inference can reveal complex pathways underlying traits and diseases and help to prioritize targets for intervention. Recent progress in genetic epidemiology - including statistical innovation, massive genotyped data sets and novel computational tools for deep data mining - has fostered the intense development of methods exploiting genetic data and relatedness to strengthen causal inference in observational research. In this Review, we describe how such genetically informed methods differ in their rationale, applicability and inherent limitations and outline how they should be integrated in the future to offer a rich causal inference toolbox.
Brannigan, V M; Bier, V M; Berg, C
1992-09-01
Toxic torts are product liability cases dealing with alleged injuries due to chemical or biological hazards such as radiation, thalidomide, or Agent Orange. Toxic tort cases typically rely more heavily than other product liability cases on indirect or statistical proof of injury. There have been numerous theoretical analyses of statistical proof of injury in toxic tort cases. However, there have been only a handful of actual legal decisions regarding the use of such statistical evidence, and most of those decisions have been inconclusive. Recently, a major case from the Fifth Circuit, involving allegations that Benedectin (a morning sickness drug) caused birth defects, was decided entirely on the basis of statistical inference. This paper examines both the conceptual basis of that decision, and also the relationships among statistical inference, scientific evidence, and the rules of product liability in general.
Refining the Use of Linkage Disequilibrium as a Robust Signature of Selective Sweeps.
Jacobs, Guy S; Sluckin, Tim J; Kivisild, Toomas
2016-08-01
During a selective sweep, characteristic patterns of linkage disequilibrium can arise in the genomic region surrounding a selected locus. These have been used to infer past selective sweeps. However, the recombination rate is known to vary substantially along the genome for many species. We here investigate the effectiveness of current (Kelly's [Formula: see text] and [Formula: see text]) and novel statistics at inferring hard selective sweeps based on linkage disequilibrium distortions under different conditions, including a human-realistic demographic model and recombination rate variation. When the recombination rate is constant, Kelly's [Formula: see text] offers high power, but is outperformed by a novel statistic that we test, which we call [Formula: see text] We also find this statistic to be effective at detecting sweeps from standing variation. When recombination rate fluctuations are included, there is a considerable reduction in power for all linkage disequilibrium-based statistics. However, this can largely be reversed by appropriately controlling for expected linkage disequilibrium using a genetic map. To further test these different methods, we perform selection scans on well-characterized HapMap data, finding that all three statistics-[Formula: see text] Kelly's [Formula: see text] and [Formula: see text]-are able to replicate signals at regions previously identified as selection candidates based on population differentiation or the site frequency spectrum. While [Formula: see text] replicates most candidates when recombination map data are not available, the [Formula: see text] and [Formula: see text] statistics are more successful when recombination rate variation is controlled for. Given both this and their higher power in simulations of selective sweeps, these statistics are preferred when information on local recombination rate variation is available. Copyright © 2016 by the Genetics Society of America.
Assessing dynamics, spatial scale, and uncertainty in task-related brain network analyses
Stephen, Emily P.; Lepage, Kyle Q.; Eden, Uri T.; Brunner, Peter; Schalk, Gerwin; Brumberg, Jonathan S.; Guenther, Frank H.; Kramer, Mark A.
2014-01-01
The brain is a complex network of interconnected elements, whose interactions evolve dynamically in time to cooperatively perform specific functions. A common technique to probe these interactions involves multi-sensor recordings of brain activity during a repeated task. Many techniques exist to characterize the resulting task-related activity, including establishing functional networks, which represent the statistical associations between brain areas. Although functional network inference is commonly employed to analyze neural time series data, techniques to assess the uncertainty—both in the functional network edges and the corresponding aggregate measures of network topology—are lacking. To address this, we describe a statistically principled approach for computing uncertainty in functional networks and aggregate network measures in task-related data. The approach is based on a resampling procedure that utilizes the trial structure common in experimental recordings. We show in simulations that this approach successfully identifies functional networks and associated measures of confidence emergent during a task in a variety of scenarios, including dynamically evolving networks. In addition, we describe a principled technique for establishing functional networks based on predetermined regions of interest using canonical correlation. Doing so provides additional robustness to the functional network inference. Finally, we illustrate the use of these methods on example invasive brain voltage recordings collected during an overt speech task. The general strategy described here—appropriate for static and dynamic network inference and different statistical measures of coupling—permits the evaluation of confidence in network measures in a variety of settings common to neuroscience. PMID:24678295
Assessing dynamics, spatial scale, and uncertainty in task-related brain network analyses.
Stephen, Emily P; Lepage, Kyle Q; Eden, Uri T; Brunner, Peter; Schalk, Gerwin; Brumberg, Jonathan S; Guenther, Frank H; Kramer, Mark A
2014-01-01
The brain is a complex network of interconnected elements, whose interactions evolve dynamically in time to cooperatively perform specific functions. A common technique to probe these interactions involves multi-sensor recordings of brain activity during a repeated task. Many techniques exist to characterize the resulting task-related activity, including establishing functional networks, which represent the statistical associations between brain areas. Although functional network inference is commonly employed to analyze neural time series data, techniques to assess the uncertainty-both in the functional network edges and the corresponding aggregate measures of network topology-are lacking. To address this, we describe a statistically principled approach for computing uncertainty in functional networks and aggregate network measures in task-related data. The approach is based on a resampling procedure that utilizes the trial structure common in experimental recordings. We show in simulations that this approach successfully identifies functional networks and associated measures of confidence emergent during a task in a variety of scenarios, including dynamically evolving networks. In addition, we describe a principled technique for establishing functional networks based on predetermined regions of interest using canonical correlation. Doing so provides additional robustness to the functional network inference. Finally, we illustrate the use of these methods on example invasive brain voltage recordings collected during an overt speech task. The general strategy described here-appropriate for static and dynamic network inference and different statistical measures of coupling-permits the evaluation of confidence in network measures in a variety of settings common to neuroscience.
ERIC Educational Resources Information Center
Frees, Edward W.; Kim, Jee-Seon
2006-01-01
Multilevel models are proven tools in social research for modeling complex, hierarchical systems. In multilevel modeling, statistical inference is based largely on quantification of random variables. This paper distinguishes among three types of random variables in multilevel modeling--model disturbances, random coefficients, and future response…
Making statistical inferences about software reliability
NASA Technical Reports Server (NTRS)
Miller, Douglas R.
1988-01-01
Failure times of software undergoing random debugging can be modelled as order statistics of independent but nonidentically distributed exponential random variables. Using this model inferences can be made about current reliability and, if debugging continues, future reliability. This model also shows the difficulty inherent in statistical verification of very highly reliable software such as that used by digital avionics in commercial aircraft.
Automatic physical inference with information maximizing neural networks
NASA Astrophysics Data System (ADS)
Charnock, Tom; Lavaux, Guilhem; Wandelt, Benjamin D.
2018-04-01
Compressing large data sets to a manageable number of summaries that are informative about the underlying parameters vastly simplifies both frequentist and Bayesian inference. When only simulations are available, these summaries are typically chosen heuristically, so they may inadvertently miss important information. We introduce a simulation-based machine learning technique that trains artificial neural networks to find nonlinear functionals of data that maximize Fisher information: information maximizing neural networks (IMNNs). In test cases where the posterior can be derived exactly, likelihood-free inference based on automatically derived IMNN summaries produces nearly exact posteriors, showing that these summaries are good approximations to sufficient statistics. In a series of numerical examples of increasing complexity and astrophysical relevance we show that IMNNs are robustly capable of automatically finding optimal, nonlinear summaries of the data even in cases where linear compression fails: inferring the variance of Gaussian signal in the presence of noise, inferring cosmological parameters from mock simulations of the Lyman-α forest in quasar spectra, and inferring frequency-domain parameters from LISA-like detections of gravitational waveforms. In this final case, the IMNN summary outperforms linear data compression by avoiding the introduction of spurious likelihood maxima. We anticipate that the automatic physical inference method described in this paper will be essential to obtain both accurate and precise cosmological parameter estimates from complex and large astronomical data sets, including those from LSST and Euclid.
NASA Astrophysics Data System (ADS)
Haneberg, W. C.
2017-12-01
Remote characterization of new landslides or areas of ongoing movement using differences in high resolution digital elevation models (DEMs) created through time, for example before and after major rains or earthquakes, is an attractive proposition. In the case of large catastrophic landslides, changes may be apparent enough that simple subtraction suffices. In other cases, statistical noise can obscure landslide signatures and place practical limits on detection. In ideal cases on land, GPS surveys of representative areas at the time of DEM creation can quantify the inherent errors. In less-than-ideal terrestrial cases and virtually all submarine cases, it may be impractical or impossible to independently estimate the DEM errors. Examining DEM difference statistics for areas reasonably inferred to have no change, however, can provide insight into the limits of detectability. Data from inferred no-change areas of airborne LiDAR DEM difference maps of the 2014 Oso, Washington landslide and landslide-prone colluvium slopes along the Ohio River valley in northern Kentucky, show that DEM difference maps can have non-zero mean and slope dependent error components consistent with published studies of DEM errors. Statistical thresholds derived from DEM difference error and slope data can help to distinguish between DEM differences that are likely real—and which may indicate landsliding—from those that are likely spurious or irrelevant. This presentation describes and compares two different approaches, one based upon a heuristic assumption about the proportion of the study area likely covered by new landslides and another based upon the amount of change necessary to ensure difference at a specified level of probability.
Approximation of epidemic models by diffusion processes and their statistical inference.
Guy, Romain; Larédo, Catherine; Vergu, Elisabeta
2015-02-01
Multidimensional continuous-time Markov jump processes [Formula: see text] on [Formula: see text] form a usual set-up for modeling [Formula: see text]-like epidemics. However, when facing incomplete epidemic data, inference based on [Formula: see text] is not easy to be achieved. Here, we start building a new framework for the estimation of key parameters of epidemic models based on statistics of diffusion processes approximating [Formula: see text]. First, previous results on the approximation of density-dependent [Formula: see text]-like models by diffusion processes with small diffusion coefficient [Formula: see text], where [Formula: see text] is the population size, are generalized to non-autonomous systems. Second, our previous inference results on discretely observed diffusion processes with small diffusion coefficient are extended to time-dependent diffusions. Consistent and asymptotically Gaussian estimates are obtained for a fixed number [Formula: see text] of observations, which corresponds to the epidemic context, and for [Formula: see text]. A correction term, which yields better estimates non asymptotically, is also included. Finally, performances and robustness of our estimators with respect to various parameters such as [Formula: see text] (the basic reproduction number), [Formula: see text], [Formula: see text] are investigated on simulations. Two models, [Formula: see text] and [Formula: see text], corresponding to single and recurrent outbreaks, respectively, are used to simulate data. The findings indicate that our estimators have good asymptotic properties and behave noticeably well for realistic numbers of observations and population sizes. This study lays the foundations of a generic inference method currently under extension to incompletely observed epidemic data. Indeed, contrary to the majority of current inference techniques for partially observed processes, which necessitates computer intensive simulations, our method being mostly an analytical approach requires only the classical optimization steps.
Mugzach, Omri; Peleg, Mor; Bagley, Steven C; Guter, Stephen J; Cook, Edwin H; Altman, Russ B
2015-08-01
Our goal is to create an ontology that will allow data integration and reasoning with subject data to classify subjects, and based on this classification, to infer new knowledge on Autism Spectrum Disorder (ASD) and related neurodevelopmental disorders (NDD). We take a first step toward this goal by extending an existing autism ontology to allow automatic inference of ASD phenotypes and Diagnostic & Statistical Manual of Mental Disorders (DSM) criteria based on subjects' Autism Diagnostic Interview-Revised (ADI-R) assessment data. Knowledge regarding diagnostic instruments, ASD phenotypes and risk factors was added to augment an existing autism ontology via Ontology Web Language class definitions and semantic web rules. We developed a custom Protégé plugin for enumerating combinatorial OWL axioms to support the many-to-many relations of ADI-R items to diagnostic categories in the DSM. We utilized a reasoner to infer whether 2642 subjects, whose data was obtained from the Simons Foundation Autism Research Initiative, meet DSM-IV-TR (DSM-IV) and DSM-5 diagnostic criteria based on their ADI-R data. We extended the ontology by adding 443 classes and 632 rules that represent phenotypes, along with their synonyms, environmental risk factors, and frequency of comorbidities. Applying the rules on the data set showed that the method produced accurate results: the true positive and true negative rates for inferring autistic disorder diagnosis according to DSM-IV criteria were 1 and 0.065, respectively; the true positive rate for inferring ASD based on DSM-5 criteria was 0.94. The ontology allows automatic inference of subjects' disease phenotypes and diagnosis with high accuracy. The ontology may benefit future studies by serving as a knowledge base for ASD. In addition, by adding knowledge of related NDDs, commonalities and differences in manifestations and risk factors could be automatically inferred, contributing to the understanding of ASD pathophysiology. Copyright © 2015 Elsevier Inc. All rights reserved.
Inference and Analysis of Population Structure Using Genetic Data and Network Theory
Greenbaum, Gili; Templeton, Alan R.; Bar-David, Shirli
2016-01-01
Clustering individuals to subpopulations based on genetic data has become commonplace in many genetic studies. Inference about population structure is most often done by applying model-based approaches, aided by visualization using distance-based approaches such as multidimensional scaling. While existing distance-based approaches suffer from a lack of statistical rigor, model-based approaches entail assumptions of prior conditions such as that the subpopulations are at Hardy-Weinberg equilibria. Here we present a distance-based approach for inference about population structure using genetic data by defining population structure using network theory terminology and methods. A network is constructed from a pairwise genetic-similarity matrix of all sampled individuals. The community partition, a partition of a network to dense subgraphs, is equated with population structure, a partition of the population to genetically related groups. Community-detection algorithms are used to partition the network into communities, interpreted as a partition of the population to subpopulations. The statistical significance of the structure can be estimated by using permutation tests to evaluate the significance of the partition’s modularity, a network theory measure indicating the quality of community partitions. To further characterize population structure, a new measure of the strength of association (SA) for an individual to its assigned community is presented. The strength of association distribution (SAD) of the communities is analyzed to provide additional population structure characteristics, such as the relative amount of gene flow experienced by the different subpopulations and identification of hybrid individuals. Human genetic data and simulations are used to demonstrate the applicability of the analyses. The approach presented here provides a novel, computationally efficient model-free method for inference about population structure that does not entail assumption of prior conditions. The method is implemented in the software NetStruct (available at https://giligreenbaum.wordpress.com/software/). PMID:26888080
Inference and Analysis of Population Structure Using Genetic Data and Network Theory.
Greenbaum, Gili; Templeton, Alan R; Bar-David, Shirli
2016-04-01
Clustering individuals to subpopulations based on genetic data has become commonplace in many genetic studies. Inference about population structure is most often done by applying model-based approaches, aided by visualization using distance-based approaches such as multidimensional scaling. While existing distance-based approaches suffer from a lack of statistical rigor, model-based approaches entail assumptions of prior conditions such as that the subpopulations are at Hardy-Weinberg equilibria. Here we present a distance-based approach for inference about population structure using genetic data by defining population structure using network theory terminology and methods. A network is constructed from a pairwise genetic-similarity matrix of all sampled individuals. The community partition, a partition of a network to dense subgraphs, is equated with population structure, a partition of the population to genetically related groups. Community-detection algorithms are used to partition the network into communities, interpreted as a partition of the population to subpopulations. The statistical significance of the structure can be estimated by using permutation tests to evaluate the significance of the partition's modularity, a network theory measure indicating the quality of community partitions. To further characterize population structure, a new measure of the strength of association (SA) for an individual to its assigned community is presented. The strength of association distribution (SAD) of the communities is analyzed to provide additional population structure characteristics, such as the relative amount of gene flow experienced by the different subpopulations and identification of hybrid individuals. Human genetic data and simulations are used to demonstrate the applicability of the analyses. The approach presented here provides a novel, computationally efficient model-free method for inference about population structure that does not entail assumption of prior conditions. The method is implemented in the software NetStruct (available at https://giligreenbaum.wordpress.com/software/). Copyright © 2016 by the Genetics Society of America.
Goldberg, Tony L; Gillespie, Thomas R; Singer, Randall S
2006-09-01
Repetitive-element PCR (rep-PCR) is a method for genotyping bacteria based on the selective amplification of repetitive genetic elements dispersed throughout bacterial chromosomes. The method has great potential for large-scale epidemiological studies because of its speed and simplicity; however, objective guidelines for inferring relationships among bacterial isolates from rep-PCR data are lacking. We used multilocus sequence typing (MLST) as a "gold standard" to optimize the analytical parameters for inferring relationships among Escherichia coli isolates from rep-PCR data. We chose 12 isolates from a large database to represent a wide range of pairwise genetic distances, based on the initial evaluation of their rep-PCR fingerprints. We conducted MLST with these same isolates and systematically varied the analytical parameters to maximize the correspondence between the relationships inferred from rep-PCR and those inferred from MLST. Methods that compared the shapes of densitometric profiles ("curve-based" methods) yielded consistently higher correspondence values between data types than did methods that calculated indices of similarity based on shared and different bands (maximum correspondences of 84.5% and 80.3%, respectively). Curve-based methods were also markedly more robust in accommodating variations in user-specified analytical parameter values than were "band-sharing coefficient" methods, and they enhanced the reproducibility of rep-PCR. Phylogenetic analyses of rep-PCR data yielded trees with high topological correspondence to trees based on MLST and high statistical support for major clades. These results indicate that rep-PCR yields accurate information for inferring relationships among E. coli isolates and that accuracy can be enhanced with the use of analytical methods that consider the shapes of densitometric profiles.
ERIC Educational Resources Information Center
Watson, Jane
2007-01-01
Inference, or decision making, is seen in curriculum documents as the final step in a statistical investigation. For a formal statistical enquiry this may be associated with sophisticated tests involving probability distributions. For young students without the mathematical background to perform such tests, it is still possible to draw informal…
A Framework for Thinking about Informal Statistical Inference
ERIC Educational Resources Information Center
Makar, Katie; Rubin, Andee
2009-01-01
Informal inferential reasoning has shown some promise in developing students' deeper understanding of statistical processes. This paper presents a framework to think about three key principles of informal inference--generalizations "beyond the data," probabilistic language, and data as evidence. The authors use primary school classroom…
Anderson, Eric C
2012-11-08
Advances in genotyping that allow tens of thousands of individuals to be genotyped at a moderate number of single nucleotide polymorphisms (SNPs) permit parentage inference to be pursued on a very large scale. The intergenerational tagging this capacity allows is revolutionizing the management of cultured organisms (cows, salmon, etc.) and is poised to do the same for scientific studies of natural populations. Currently, however, there are no likelihood-based methods of parentage inference which are implemented in a manner that allows them to quickly handle a very large number of potential parents or parent pairs. Here we introduce an efficient likelihood-based method applicable to the specialized case of cultured organisms in which both parents can be reliably sampled. We develop a Markov chain representation for the cumulative number of Mendelian incompatibilities between an offspring and its putative parents and we exploit it to develop a fast algorithm for simulation-based estimates of statistical confidence in SNP-based assignments of offspring to pairs of parents. The method is implemented in the freely available software SNPPIT. We describe the method in detail, then assess its performance in a large simulation study using known allele frequencies at 96 SNPs from ten hatchery salmon populations. The simulations verify that the method is fast and accurate and that 96 well-chosen SNPs can provide sufficient power to identify the correct pair of parents from amongst millions of candidate pairs.
NASA Astrophysics Data System (ADS)
Alsing, Justin; Wandelt, Benjamin; Feeney, Stephen
2018-07-01
Many statistical models in cosmology can be simulated forwards but have intractable likelihood functions. Likelihood-free inference methods allow us to perform Bayesian inference from these models using only forward simulations, free from any likelihood assumptions or approximations. Likelihood-free inference generically involves simulating mock data and comparing to the observed data; this comparison in data space suffers from the curse of dimensionality and requires compression of the data to a small number of summary statistics to be tractable. In this paper, we use massive asymptotically optimal data compression to reduce the dimensionality of the data space to just one number per parameter, providing a natural and optimal framework for summary statistic choice for likelihood-free inference. Secondly, we present the first cosmological application of Density Estimation Likelihood-Free Inference (DELFI), which learns a parametrized model for joint distribution of data and parameters, yielding both the parameter posterior and the model evidence. This approach is conceptually simple, requires less tuning than traditional Approximate Bayesian Computation approaches to likelihood-free inference and can give high-fidelity posteriors from orders of magnitude fewer forward simulations. As an additional bonus, it enables parameter inference and Bayesian model comparison simultaneously. We demonstrate DELFI with massive data compression on an analysis of the joint light-curve analysis supernova data, as a simple validation case study. We show that high-fidelity posterior inference is possible for full-scale cosmological data analyses with as few as ˜104 simulations, with substantial scope for further improvement, demonstrating the scalability of likelihood-free inference to large and complex cosmological data sets.
A statistical approach for inferring the 3D structure of the genome.
Varoquaux, Nelle; Ay, Ferhat; Noble, William Stafford; Vert, Jean-Philippe
2014-06-15
Recent technological advances allow the measurement, in a single Hi-C experiment, of the frequencies of physical contacts among pairs of genomic loci at a genome-wide scale. The next challenge is to infer, from the resulting DNA-DNA contact maps, accurate 3D models of how chromosomes fold and fit into the nucleus. Many existing inference methods rely on multidimensional scaling (MDS), in which the pairwise distances of the inferred model are optimized to resemble pairwise distances derived directly from the contact counts. These approaches, however, often optimize a heuristic objective function and require strong assumptions about the biophysics of DNA to transform interaction frequencies to spatial distance, and thereby may lead to incorrect structure reconstruction. We propose a novel approach to infer a consensus 3D structure of a genome from Hi-C data. The method incorporates a statistical model of the contact counts, assuming that the counts between two loci follow a Poisson distribution whose intensity decreases with the physical distances between the loci. The method can automatically adjust the transfer function relating the spatial distance to the Poisson intensity and infer a genome structure that best explains the observed data. We compare two variants of our Poisson method, with or without optimization of the transfer function, to four different MDS-based algorithms-two metric MDS methods using different stress functions, a non-metric version of MDS and ChromSDE, a recently described, advanced MDS method-on a wide range of simulated datasets. We demonstrate that the Poisson models reconstruct better structures than all MDS-based methods, particularly at low coverage and high resolution, and we highlight the importance of optimizing the transfer function. On publicly available Hi-C data from mouse embryonic stem cells, we show that the Poisson methods lead to more reproducible structures than MDS-based methods when we use data generated using different restriction enzymes, and when we reconstruct structures at different resolutions. A Python implementation of the proposed method is available at http://cbio.ensmp.fr/pastis. © The Author 2014. Published by Oxford University Press.
Statistical inference of protein structural alignments using information and compression.
Collier, James H; Allison, Lloyd; Lesk, Arthur M; Stuckey, Peter J; Garcia de la Banda, Maria; Konagurthu, Arun S
2017-04-01
Structural molecular biology depends crucially on computational techniques that compare protein three-dimensional structures and generate structural alignments (the assignment of one-to-one correspondences between subsets of amino acids based on atomic coordinates). Despite its importance, the structural alignment problem has not been formulated, much less solved, in a consistent and reliable way. To overcome these difficulties, we present here a statistical framework for the precise inference of structural alignments, built on the Bayesian and information-theoretic principle of Minimum Message Length (MML). The quality of any alignment is measured by its explanatory power-the amount of lossless compression achieved to explain the protein coordinates using that alignment. We have implemented this approach in MMLigner , the first program able to infer statistically significant structural alignments. We also demonstrate the reliability of MMLigner 's alignment results when compared with the state of the art. Importantly, MMLigner can also discover different structural alignments of comparable quality, a challenging problem for oligomers and protein complexes. Source code, binaries and an interactive web version are available at http://lcb.infotech.monash.edu.au/mmligner . arun.konagurthu@monash.edu. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lee, J.; Elmore, R.; Kennedy, C.
This research is to illustrate the use of statistical inference techniques in order to quantify the uncertainty surrounding reliability estimates in a step-stress accelerated degradation testing (SSADT) scenario. SSADT can be used when a researcher is faced with a resource-constrained environment, e.g., limits on chamber time or on the number of units to test. We apply the SSADT methodology to a degradation experiment involving concentrated solar power (CSP) mirrors and compare the results to a more traditional multiple accelerated testing paradigm. Specifically, our work includes: (1) designing a durability testing plan for solar mirrors (3M's new improved silvered acrylic "Solarmore » Reflector Film (SFM) 1100") through the ultra-accelerated weathering system (UAWS), (2) defining degradation paths of optical performance based on the SSADT model which is accelerated by high UV-radiant exposure, and (3) developing service lifetime prediction models for solar mirrors using advanced statistical inference. We use the method of least squares to estimate the model parameters and this serves as the basis for the statistical inference in SSADT. Several quantities of interest can be estimated from this procedure, e.g., mean-time-to-failure (MTTF) and warranty time. The methods allow for the estimation of quantities that may be of interest to the domain scientists.« less
Bayesian inference based on dual generalized order statistics from the exponentiated Weibull model
NASA Astrophysics Data System (ADS)
Al Sobhi, Mashail M.
2015-02-01
Bayesian estimation for the two parameters and the reliability function of the exponentiated Weibull model are obtained based on dual generalized order statistics (DGOS). Also, Bayesian prediction bounds for future DGOS from exponentiated Weibull model are obtained. The symmetric and asymmetric loss functions are considered for Bayesian computations. The Markov chain Monte Carlo (MCMC) methods are used for computing the Bayes estimates and prediction bounds. The results have been specialized to the lower record values. Comparisons are made between Bayesian and maximum likelihood estimators via Monte Carlo simulation.
What Is Design-Based Causal Inference for RCTs and Why Should I Use It? NCEE 2017-4025
ERIC Educational Resources Information Center
Schochet, Peter Z.
2017-01-01
Design-based methods have recently been developed as a way to analyze data from impact evaluations of interventions, programs, and policies. The impact estimators are derived using the building blocks of experimental designs with minimal assumptions, and have good statistical properties. The methods apply to randomized controlled trials (RCTs) and…
Effect of Belief Bias on the Development of Undergraduate Students' Reasoning about Inference
ERIC Educational Resources Information Center
Kaplan, Jennifer K.
2009-01-01
Psychologists have discovered a phenomenon called "Belief Bias" in which subjects rate the strength of arguments based on the believability of the conclusions. This paper reports the results of a small qualitative pilot study of undergraduate students who had previously taken an algebra-based introduction to statistics class. The subjects in this…
Inference of Transmission Network Structure from HIV Phylogenetic Trees
Giardina, Federica; Romero-Severson, Ethan Obie; Albert, Jan; ...
2017-01-13
Phylogenetic inference is an attractive means to reconstruct transmission histories and epidemics. However, there is not a perfect correspondence between transmission history and virus phylogeny. Both node height and topological differences may occur, depending on the interaction between within-host evolutionary dynamics and between-host transmission patterns. To investigate these interactions, we added a within-host evolutionary model in epidemiological simulations and examined if the resulting phylogeny could recover different types of contact networks. To further improve realism, we also introduced patient-specific differences in infectivity across disease stages, and on the epidemic level we considered incomplete sampling and the age of the epidemic.more » Second, we implemented an inference method based on approximate Bayesian computation (ABC) to discriminate among three well-studied network models and jointly estimate both network parameters and key epidemiological quantities such as the infection rate. Our ABC framework used both topological and distance-based tree statistics for comparison between simulated and observed trees. Overall, our simulations showed that a virus time-scaled phylogeny (genealogy) may be substantially different from the between-host transmission tree. This has important implications for the interpretation of what a phylogeny reveals about the underlying epidemic contact network. In particular, we found that while the within-host evolutionary process obscures the transmission tree, the diversification process and infectivity dynamics also add discriminatory power to differentiate between different types of contact networks. We also found that the possibility to differentiate contact networks depends on how far an epidemic has progressed, where distance-based tree statistics have more power early in an epidemic. Finally, we applied our ABC inference on two different outbreaks from the Swedish HIV-1 epidemic.« less
Inference of Transmission Network Structure from HIV Phylogenetic Trees
DOE Office of Scientific and Technical Information (OSTI.GOV)
Giardina, Federica; Romero-Severson, Ethan Obie; Albert, Jan
Phylogenetic inference is an attractive means to reconstruct transmission histories and epidemics. However, there is not a perfect correspondence between transmission history and virus phylogeny. Both node height and topological differences may occur, depending on the interaction between within-host evolutionary dynamics and between-host transmission patterns. To investigate these interactions, we added a within-host evolutionary model in epidemiological simulations and examined if the resulting phylogeny could recover different types of contact networks. To further improve realism, we also introduced patient-specific differences in infectivity across disease stages, and on the epidemic level we considered incomplete sampling and the age of the epidemic.more » Second, we implemented an inference method based on approximate Bayesian computation (ABC) to discriminate among three well-studied network models and jointly estimate both network parameters and key epidemiological quantities such as the infection rate. Our ABC framework used both topological and distance-based tree statistics for comparison between simulated and observed trees. Overall, our simulations showed that a virus time-scaled phylogeny (genealogy) may be substantially different from the between-host transmission tree. This has important implications for the interpretation of what a phylogeny reveals about the underlying epidemic contact network. In particular, we found that while the within-host evolutionary process obscures the transmission tree, the diversification process and infectivity dynamics also add discriminatory power to differentiate between different types of contact networks. We also found that the possibility to differentiate contact networks depends on how far an epidemic has progressed, where distance-based tree statistics have more power early in an epidemic. Finally, we applied our ABC inference on two different outbreaks from the Swedish HIV-1 epidemic.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Morton, April M; Piburn, Jesse O; McManamay, Ryan A
2017-01-01
Monte Carlo simulation is a popular numerical experimentation technique used in a range of scientific fields to obtain the statistics of unknown random output variables. Despite its widespread applicability, it can be difficult to infer required input probability distributions when they are related to population counts unknown at desired spatial resolutions. To overcome this challenge, we propose a framework that uses a dasymetric model to infer the probability distributions needed for a specific class of Monte Carlo simulations which depend on population counts.
Inferring gene regression networks with model trees
2010-01-01
Background Novel strategies are required in order to handle the huge amount of data produced by microarray technologies. To infer gene regulatory networks, the first step is to find direct regulatory relationships between genes building the so-called gene co-expression networks. They are typically generated using correlation statistics as pairwise similarity measures. Correlation-based methods are very useful in order to determine whether two genes have a strong global similarity but do not detect local similarities. Results We propose model trees as a method to identify gene interaction networks. While correlation-based methods analyze each pair of genes, in our approach we generate a single regression tree for each gene from the remaining genes. Finally, a graph from all the relationships among output and input genes is built taking into account whether the pair of genes is statistically significant. For this reason we apply a statistical procedure to control the false discovery rate. The performance of our approach, named REGNET, is experimentally tested on two well-known data sets: Saccharomyces Cerevisiae and E.coli data set. First, the biological coherence of the results are tested. Second the E.coli transcriptional network (in the Regulon database) is used as control to compare the results to that of a correlation-based method. This experiment shows that REGNET performs more accurately at detecting true gene associations than the Pearson and Spearman zeroth and first-order correlation-based methods. Conclusions REGNET generates gene association networks from gene expression data, and differs from correlation-based methods in that the relationship between one gene and others is calculated simultaneously. Model trees are very useful techniques to estimate the numerical values for the target genes by linear regression functions. They are very often more precise than linear regression models because they can add just different linear regressions to separate areas of the search space favoring to infer localized similarities over a more global similarity. Furthermore, experimental results show the good performance of REGNET. PMID:20950452
Refining the Use of Linkage Disequilibrium as a Robust Signature of Selective Sweeps
Jacobs, Guy S.; Sluckin, Timothy J.; Kivisild, Toomas
2016-01-01
During a selective sweep, characteristic patterns of linkage disequilibrium can arise in the genomic region surrounding a selected locus. These have been used to infer past selective sweeps. However, the recombination rate is known to vary substantially along the genome for many species. We here investigate the effectiveness of current (Kelly’s ZnS and ωmax) and novel statistics at inferring hard selective sweeps based on linkage disequilibrium distortions under different conditions, including a human-realistic demographic model and recombination rate variation. When the recombination rate is constant, Kelly’s ZnS offers high power, but is outperformed by a novel statistic that we test, which we call Zα. We also find this statistic to be effective at detecting sweeps from standing variation. When recombination rate fluctuations are included, there is a considerable reduction in power for all linkage disequilibrium-based statistics. However, this can largely be reversed by appropriately controlling for expected linkage disequilibrium using a genetic map. To further test these different methods, we perform selection scans on well-characterized HapMap data, finding that all three statistics—ωmax, Kelly’s ZnS, and Zα—are able to replicate signals at regions previously identified as selection candidates based on population differentiation or the site frequency spectrum. While ωmax replicates most candidates when recombination map data are not available, the ZnS and Zα statistics are more successful when recombination rate variation is controlled for. Given both this and their higher power in simulations of selective sweeps, these statistics are preferred when information on local recombination rate variation is available. PMID:27516617
Balkanization and Unification of Probabilistic Inferences
ERIC Educational Resources Information Center
Yu, Chong-Ho
2005-01-01
Many research-related classes in social sciences present probability as a unified approach based upon mathematical axioms, but neglect the diversity of various probability theories and their associated philosophical assumptions. Although currently the dominant statistical and probabilistic approach is the Fisherian tradition, the use of Fisherian…
The Philosophical Foundations of Prescriptive Statements and Statistical Inference
ERIC Educational Resources Information Center
Sun, Shuyan; Pan, Wei
2011-01-01
From the perspectives of the philosophy of science and statistical inference, we discuss the challenges of making prescriptive statements in quantitative research articles. We first consider the prescriptive nature of educational research and argue that prescriptive statements are a necessity in educational research. The logic of deduction,…
The potential for increased power from combining P-values testing the same hypothesis.
Ganju, Jitendra; Julie Ma, Guoguang
2017-02-01
The conventional approach to hypothesis testing for formal inference is to prespecify a single test statistic thought to be optimal. However, we usually have more than one test statistic in mind for testing the null hypothesis of no treatment effect but we do not know which one is the most powerful. Rather than relying on a single p-value, combining p-values from prespecified multiple test statistics can be used for inference. Combining functions include Fisher's combination test and the minimum p-value. Using randomization-based tests, the increase in power can be remarkable when compared with a single test and Simes's method. The versatility of the method is that it also applies when the number of covariates exceeds the number of observations. The increase in power is large enough to prefer combined p-values over a single p-value. The limitation is that the method does not provide an unbiased estimator of the treatment effect and does not apply to situations when the model includes treatment by covariate interaction.
Improving stochastic estimates with inference methods: calculating matrix diagonals.
Selig, Marco; Oppermann, Niels; Ensslin, Torsten A
2012-02-01
Estimating the diagonal entries of a matrix, that is not directly accessible but only available as a linear operator in the form of a computer routine, is a common necessity in many computational applications, especially in image reconstruction and statistical inference. Here, methods of statistical inference are used to improve the accuracy or the computational costs of matrix probing methods to estimate matrix diagonals. In particular, the generalized Wiener filter methodology, as developed within information field theory, is shown to significantly improve estimates based on only a few sampling probes, in cases in which some form of continuity of the solution can be assumed. The strength, length scale, and precise functional form of the exploited autocorrelation function of the matrix diagonal is determined from the probes themselves. The developed algorithm is successfully applied to mock and real world problems. These performance tests show that, in situations where a matrix diagonal has to be calculated from only a small number of computationally expensive probes, a speedup by a factor of 2 to 10 is possible with the proposed method. © 2012 American Physical Society
Cocco, S; Monasson, R; Sessak, V
2011-05-01
We consider the problem of inferring the interactions between a set of N binary variables from the knowledge of their frequencies and pairwise correlations. The inference framework is based on the Hopfield model, a special case of the Ising model where the interaction matrix is defined through a set of patterns in the variable space, and is of rank much smaller than N. We show that maximum likelihood inference is deeply related to principal component analysis when the amplitude of the pattern components ξ is negligible compared to √N. Using techniques from statistical mechanics, we calculate the corrections to the patterns to the first order in ξ/√N. We stress the need to generalize the Hopfield model and include both attractive and repulsive patterns in order to correctly infer networks with sparse and strong interactions. We present a simple geometrical criterion to decide how many attractive and repulsive patterns should be considered as a function of the sampling noise. We moreover discuss how many sampled configurations are required for a good inference, as a function of the system size N and of the amplitude ξ. The inference approach is illustrated on synthetic and biological data.
Representing Micro-Macro Linkages by Actor-Based Dynamic Network Models
Snijders, Tom A.B.; Steglich, Christian E.G.
2014-01-01
Stochastic actor-based models for network dynamics have the primary aim of statistical inference about processes of network change, but may be regarded as a kind of agent-based models. Similar to many other agent-based models, they are based on local rules for actor behavior. Different from many other agent-based models, by including elements of generalized linear statistical models they aim to be realistic detailed representations of network dynamics in empirical data sets. Statistical parallels to micro-macro considerations can be found in the estimation of parameters determining local actor behavior from empirical data, and the assessment of goodness of fit from the correspondence with network-level descriptives. This article studies several network-level consequences of dynamic actor-based models applied to represent cross-sectional network data. Two examples illustrate how network-level characteristics can be obtained as emergent features implied by micro-specifications of actor-based models. PMID:25960578
Inferring time derivatives including cell growth rates using Gaussian processes
NASA Astrophysics Data System (ADS)
Swain, Peter S.; Stevenson, Keiran; Leary, Allen; Montano-Gutierrez, Luis F.; Clark, Ivan B. N.; Vogel, Jackie; Pilizota, Teuta
2016-12-01
Often the time derivative of a measured variable is of as much interest as the variable itself. For a growing population of biological cells, for example, the population's growth rate is typically more important than its size. Here we introduce a non-parametric method to infer first and second time derivatives as a function of time from time-series data. Our approach is based on Gaussian processes and applies to a wide range of data. In tests, the method is at least as accurate as others, but has several advantages: it estimates errors both in the inference and in any summary statistics, such as lag times, and allows interpolation with the corresponding error estimation. As illustrations, we infer growth rates of microbial cells, the rate of assembly of an amyloid fibril and both the speed and acceleration of two separating spindle pole bodies. Our algorithm should thus be broadly applicable.
Multi-Agent Inference in Social Networks: A Finite Population Learning Approach.
Fan, Jianqing; Tong, Xin; Zeng, Yao
When people in a society want to make inference about some parameter, each person may want to use data collected by other people. Information (data) exchange in social networks is usually costly, so to make reliable statistical decisions, people need to trade off the benefits and costs of information acquisition. Conflicts of interests and coordination problems will arise in the process. Classical statistics does not consider people's incentives and interactions in the data collection process. To address this imperfection, this work explores multi-agent Bayesian inference problems with a game theoretic social network model. Motivated by our interest in aggregate inference at the societal level, we propose a new concept, finite population learning , to address whether with high probability, a large fraction of people in a given finite population network can make "good" inference. Serving as a foundation, this concept enables us to study the long run trend of aggregate inference quality as population grows.
NASA Astrophysics Data System (ADS)
Mandel, Kaisey; Kirshner, R. P.; Narayan, G.; Wood-Vasey, W. M.; Friedman, A. S.; Hicken, M.
2010-01-01
I have constructed a comprehensive statistical model for Type Ia supernova light curves spanning optical through near infrared data simultaneously. The near infrared light curves are found to be excellent standard candles (sigma(MH) = 0.11 +/- 0.03 mag) that are less vulnerable to systematic error from dust extinction, a major confounding factor for cosmological studies. A hierarchical statistical framework incorporates coherently multiple sources of randomness and uncertainty, including photometric error, intrinsic supernova light curve variations and correlations, dust extinction and reddening, peculiar velocity dispersion and distances, for probabilistic inference with Type Ia SN light curves. Inferences are drawn from the full probability density over individual supernovae and the SN Ia and dust populations, conditioned on a dataset of SN Ia light curves and redshifts. To compute probabilistic inferences with hierarchical models, I have developed BayeSN, a Markov Chain Monte Carlo algorithm based on Gibbs sampling. This code explores and samples the global probability density of parameters describing individual supernovae and the population. I have applied this hierarchical model to optical and near infrared data of over 100 nearby Type Ia SN from PAIRITEL, the CfA3 sample, and the literature. Using this statistical model, I find that SN with optical and NIR data have a smaller residual scatter in the Hubble diagram than SN with only optical data. The continued study of Type Ia SN in the near infrared will be important for improving their utility as precise and accurate cosmological distance indicators.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Marzouk, Youssef
Predictive simulation of complex physical systems increasingly rests on the interplay of experimental observations with computational models. Key inputs, parameters, or structural aspects of models may be incomplete or unknown, and must be developed from indirect and limited observations. At the same time, quantified uncertainties are needed to qualify computational predictions in the support of design and decision-making. In this context, Bayesian statistics provides a foundation for inference from noisy and limited data, but at prohibitive computional expense. This project intends to make rigorous predictive modeling *feasible* in complex physical systems, via accelerated and scalable tools for uncertainty quantification, Bayesianmore » inference, and experimental design. Specific objectives are as follows: 1. Develop adaptive posterior approximations and dimensionality reduction approaches for Bayesian inference in high-dimensional nonlinear systems. 2. Extend accelerated Bayesian methodologies to large-scale {\\em sequential} data assimilation, fully treating nonlinear models and non-Gaussian state and parameter distributions. 3. Devise efficient surrogate-based methods for Bayesian model selection and the learning of model structure. 4. Develop scalable simulation/optimization approaches to nonlinear Bayesian experimental design, for both parameter inference and model selection. 5. Demonstrate these inferential tools on chemical kinetic models in reacting flow, constructing and refining thermochemical and electrochemical models from limited data. Demonstrate Bayesian filtering on canonical stochastic PDEs and in the dynamic estimation of inhomogeneous subsurface properties and flow fields.« less
Statistical learning and selective inference.
Taylor, Jonathan; Tibshirani, Robert J
2015-06-23
We describe the problem of "selective inference." This addresses the following challenge: Having mined a set of data to find potential associations, how do we properly assess the strength of these associations? The fact that we have "cherry-picked"--searched for the strongest associations--means that we must set a higher bar for declaring significant the associations that we see. This challenge becomes more important in the era of big data and complex statistical modeling. The cherry tree (dataset) can be very large and the tools for cherry picking (statistical learning methods) are now very sophisticated. We describe some recent new developments in selective inference and illustrate their use in forward stepwise regression, the lasso, and principal components analysis.
Variations on Bayesian Prediction and Inference
2016-05-09
inference 2.2.1 Background There are a number of statistical inference problems that are not generally formulated via a full probability model...problem of inference about an unknown parameter, the Bayesian approach requires a full probability 1. REPORT DATE (DD-MM-YYYY) 4. TITLE AND...the problem of inference about an unknown parameter, the Bayesian approach requires a full probability model/likelihood which can be an obstacle
ERIC Educational Resources Information Center
Griffiths, Thomas L.; Tenenbaum, Joshua B.
2009-01-01
Inducing causal relationships from observations is a classic problem in scientific inference, statistics, and machine learning. It is also a central part of human learning, and a task that people perform remarkably well given its notorious difficulties. People can learn causal structure in various settings, from diverse forms of data: observations…
Meng, Xiang-He; Shen, Hui; Chen, Xiang-Ding; Xiao, Hong-Mei; Deng, Hong-Wen
2018-03-01
Genome-wide association studies (GWAS) have successfully identified numerous genetic variants associated with diverse complex phenotypes and diseases, and provided tremendous opportunities for further analyses using summary association statistics. Recently, Pickrell et al. developed a robust method for causal inference using independent putative causal SNPs. However, this method may fail to infer the causal relationship between two phenotypes when only a limited number of independent putative causal SNPs identified. Here, we extended Pickrell's method to make it more applicable for the general situations. We extended the causal inference method by replacing the putative causal SNPs with the lead SNPs (the set of the most significant SNPs in each independent locus) and tested the performance of our extended method using both simulation and empirical data. Simulations suggested that when the same number of genetic variants is used, our extended method had similar distribution of test statistic under the null model as well as comparable power under the causal model compared with the original method by Pickrell et al. But in practice, our extended method would generally be more powerful because the number of independent lead SNPs was often larger than the number of independent putative causal SNPs. And including more SNPs, on the other hand, would not cause more false positives. By applying our extended method to summary statistics from GWAS for blood metabolites and femoral neck bone mineral density (FN-BMD), we successfully identified ten blood metabolites that may causally influence FN-BMD. We extended a causal inference method for inferring putative causal relationship between two phenotypes using summary statistics from GWAS, and identified a number of potential causal metabolites for FN-BMD, which may provide novel insights into the pathophysiological mechanisms underlying osteoporosis.
Khan, Asaduzzaman; Chien, Chi-Wen; Bagraith, Karl S
2015-04-01
To investigate whether using a parametric statistic in comparing groups leads to different conclusions when using summative scores from rating scales compared with using their corresponding Rasch-based measures. A Monte Carlo simulation study was designed to examine between-group differences in the change scores derived from summative scores from rating scales, and those derived from their corresponding Rasch-based measures, using 1-way analysis of variance. The degree of inconsistency between the 2 scoring approaches (i.e. summative and Rasch-based) was examined, using varying sample sizes, scale difficulties and person ability conditions. This simulation study revealed scaling artefacts that could arise from using summative scores rather than Rasch-based measures for determining the changes between groups. The group differences in the change scores were statistically significant for summative scores under all test conditions and sample size scenarios. However, none of the group differences in the change scores were significant when using the corresponding Rasch-based measures. This study raises questions about the validity of the inference on group differences of summative score changes in parametric analyses. Moreover, it provides a rationale for the use of Rasch-based measures, which can allow valid parametric analyses of rating scale data.
Bootstrapping Methods Applied for Simulating Laboratory Works
ERIC Educational Resources Information Center
Prodan, Augustin; Campean, Remus
2005-01-01
Purpose: The aim of this work is to implement bootstrapping methods into software tools, based on Java. Design/methodology/approach: This paper presents a category of software e-tools aimed at simulating laboratory works and experiments. Findings: Both students and teaching staff use traditional statistical methods to infer the truth from sample…
ERIC Educational Resources Information Center
Hourigan, Mairéad; Leavy, Aisling
2016-01-01
As part of Japanese Lesson study research focusing on "comparing and describing likelihoods", fifth grade elementary students used real-world data in decision-making. Sporting statistics facilitated opportunities for informal inference, where data were used to make and justify predictions.
Optimal Predictions in Everyday Cognition: The Wisdom of Individuals or Crowds?
ERIC Educational Resources Information Center
Mozer, Michael C.; Pashler, Harold; Homaei, Hadjar
2008-01-01
Griffiths and Tenenbaum (2006) asked individuals to make predictions about the duration or extent of everyday events (e.g., cake baking times), and reported that predictions were optimal, employing Bayesian inference based on veridical prior distributions. Although the predictions conformed strikingly to statistics of the world, they reflect…
From Mere Coincidences to Meaningful Discoveries
ERIC Educational Resources Information Center
Griffiths, Thomas L.; Tenenbaum, Joshua B.
2007-01-01
People's reactions to coincidences are often cited as an illustration of the irrationality of human reasoning about chance. We argue that coincidences may be better understood in terms of rational statistical inference, based on their functional role in processes of causal discovery and theory revision. We present a formal definition of…
NASA Astrophysics Data System (ADS)
Goodman, Steven N.
1989-11-01
This dissertation explores the use of a mathematical measure of statistical evidence, the log likelihood ratio, in clinical trials. The methods and thinking behind the use of an evidential measure are contrasted with traditional methods of analyzing data, which depend primarily on a p-value as an estimate of the statistical strength of an observed data pattern. It is contended that neither the behavioral dictates of Neyman-Pearson hypothesis testing methods, nor the coherency dictates of Bayesian methods are realistic models on which to base inference. The use of the likelihood alone is applied to four aspects of trial design or conduct: the calculation of sample size, the monitoring of data, testing for the equivalence of two treatments, and meta-analysis--the combining of results from different trials. Finally, a more general model of statistical inference, using belief functions, is used to see if it is possible to separate the assessment of evidence from our background knowledge. It is shown that traditional and Bayesian methods can be modeled as two ends of a continuum of structured background knowledge, methods which summarize evidence at the point of maximum likelihood assuming no structure, and Bayesian methods assuming complete knowledge. Both schools are seen to be missing a concept of ignorance- -uncommitted belief. This concept provides the key to understanding the problem of sampling to a foregone conclusion and the role of frequency properties in statistical inference. The conclusion is that statistical evidence cannot be defined independently of background knowledge, and that frequency properties of an estimator are an indirect measure of uncommitted belief. Several likelihood summaries need to be used in clinical trials, with the quantitative disparity between summaries being an indirect measure of our ignorance. This conclusion is linked with parallel ideas in the philosophy of science and cognitive psychology.
Thermodynamics of statistical inference by cells.
Lang, Alex H; Fisher, Charles K; Mora, Thierry; Mehta, Pankaj
2014-10-03
The deep connection between thermodynamics, computation, and information is now well established both theoretically and experimentally. Here, we extend these ideas to show that thermodynamics also places fundamental constraints on statistical estimation and learning. To do so, we investigate the constraints placed by (nonequilibrium) thermodynamics on the ability of biochemical signaling networks to estimate the concentration of an external signal. We show that accuracy is limited by energy consumption, suggesting that there are fundamental thermodynamic constraints on statistical inference.
Analysis of a genetically structured variance heterogeneity model using the Box-Cox transformation.
Yang, Ye; Christensen, Ole F; Sorensen, Daniel
2011-02-01
Over recent years, statistical support for the presence of genetic factors operating at the level of the environmental variance has come from fitting a genetically structured heterogeneous variance model to field or experimental data in various species. Misleading results may arise due to skewness of the marginal distribution of the data. To investigate how the scale of measurement affects inferences, the genetically structured heterogeneous variance model is extended to accommodate the family of Box-Cox transformations. Litter size data in rabbits and pigs that had previously been analysed in the untransformed scale were reanalysed in a scale equal to the mode of the marginal posterior distribution of the Box-Cox parameter. In the rabbit data, the statistical evidence for a genetic component at the level of the environmental variance is considerably weaker than that resulting from an analysis in the original metric. In the pig data, the statistical evidence is stronger, but the coefficient of correlation between additive genetic effects affecting mean and variance changes sign, compared to the results in the untransformed scale. The study confirms that inferences on variances can be strongly affected by the presence of asymmetry in the distribution of data. We recommend that to avoid one important source of spurious inferences, future work seeking support for a genetic component acting on environmental variation using a parametric approach based on normality assumptions confirms that these are met.
Raffelt, David A.; Smith, Robert E.; Ridgway, Gerard R.; Tournier, J-Donald; Vaughan, David N.; Rose, Stephen; Henderson, Robert; Connelly, Alan
2015-01-01
In brain regions containing crossing fibre bundles, voxel-average diffusion MRI measures such as fractional anisotropy (FA) are difficult to interpret, and lack within-voxel single fibre population specificity. Recent work has focused on the development of more interpretable quantitative measures that can be associated with a specific fibre population within a voxel containing crossing fibres (herein we use fixel to refer to a specific fibre population within a single voxel). Unfortunately, traditional 3D methods for smoothing and cluster-based statistical inference cannot be used for voxel-based analysis of these measures, since the local neighbourhood for smoothing and cluster formation can be ambiguous when adjacent voxels may have different numbers of fixels, or ill-defined when they belong to different tracts. Here we introduce a novel statistical method to perform whole-brain fixel-based analysis called connectivity-based fixel enhancement (CFE). CFE uses probabilistic tractography to identify structurally connected fixels that are likely to share underlying anatomy and pathology. Probabilistic connectivity information is then used for tract-specific smoothing (prior to the statistical analysis) and enhancement of the statistical map (using a threshold-free cluster enhancement-like approach). To investigate the characteristics of the CFE method, we assessed sensitivity and specificity using a large number of combinations of CFE enhancement parameters and smoothing extents, using simulated pathology generated with a range of test-statistic signal-to-noise ratios in five different white matter regions (chosen to cover a broad range of fibre bundle features). The results suggest that CFE input parameters are relatively insensitive to the characteristics of the simulated pathology. We therefore recommend a single set of CFE parameters that should give near optimal results in future studies where the group effect is unknown. We then demonstrate the proposed method by comparing apparent fibre density between motor neurone disease (MND) patients with control subjects. The MND results illustrate the benefit of fixel-specific statistical inference in white matter regions that contain crossing fibres. PMID:26004503
ERIC Educational Resources Information Center
Noll, Jennifer; Hancock, Stacey
2015-01-01
This research investigates what students' use of statistical language can tell us about their conceptions of distribution and sampling in relation to informal inference. Prior research documents students' challenges in understanding ideas of distribution and sampling as tools for making informal statistical inferences. We know that these…
Arenas, Miguel
2015-04-01
NGS technologies present a fast and cheap generation of genomic data. Nevertheless, ancestral genome inference is not so straightforward due to complex evolutionary processes acting on this material such as inversions, translocations, and other genome rearrangements that, in addition to their implicit complexity, can co-occur and confound ancestral inferences. Recently, models of genome evolution that accommodate such complex genomic events are emerging. This letter explores these novel evolutionary models and proposes their incorporation into robust statistical approaches based on computer simulations, such as approximate Bayesian computation, that may produce a more realistic evolutionary analysis of genomic data. Advantages and pitfalls in using these analytical methods are discussed. Potential applications of these ancestral genomic inferences are also pointed out.
The visual system’s internal model of the world
Lee, Tai Sing
2015-01-01
The Bayesian paradigm has provided a useful conceptual theory for understanding perceptual computation in the brain. While the detailed neural mechanisms of Bayesian inference are not fully understood, recent computational and neurophysiological works have illuminated the underlying computational principles and representational architecture. The fundamental insights are that the visual system is organized as a modular hierarchy to encode an internal model of the world, and that perception is realized by statistical inference based on such internal model. In this paper, I will discuss and analyze the varieties of representational schemes of these internal models and how they might be used to perform learning and inference. I will argue for a unified theoretical framework for relating the internal models to the observed neural phenomena and mechanisms in the visual cortex. PMID:26566294
Mass Conservation and Inference of Metabolic Networks from High-Throughput Mass Spectrometry Data
Bandaru, Pradeep; Bansal, Mukesh
2011-01-01
Abstract We present a step towards the metabolome-wide computational inference of cellular metabolic reaction networks from metabolic profiling data, such as mass spectrometry. The reconstruction is based on identification of irreducible statistical interactions among the metabolite activities using the ARACNE reverse-engineering algorithm and on constraining possible metabolic transformations to satisfy the conservation of mass. The resulting algorithms are validated on synthetic data from an abridged computational model of Escherichia coli metabolism. Precision rates upwards of 50% are routinely observed for identification of full metabolic reactions, and recalls upwards of 20% are also seen. PMID:21314454
Overholser, Brian R; Sowinski, Kevin M
2007-12-01
Biostatistics is the application of statistics to biologic data. The field of statistics can be broken down into 2 fundamental parts: descriptive and inferential. Descriptive statistics are commonly used to categorize, display, and summarize data. Inferential statistics can be used to make predictions based on a sample obtained from a population or some large body of information. It is these inferences that are used to test specific research hypotheses. This 2-part review will outline important features of descriptive and inferential statistics as they apply to commonly conducted research studies in the biomedical literature. Part 1 in this issue will discuss fundamental topics of statistics and data analysis. Additionally, some of the most commonly used statistical tests found in the biomedical literature will be reviewed in Part 2 in the February 2008 issue.
Robust regression for large-scale neuroimaging studies.
Fritsch, Virgile; Da Mota, Benoit; Loth, Eva; Varoquaux, Gaël; Banaschewski, Tobias; Barker, Gareth J; Bokde, Arun L W; Brühl, Rüdiger; Butzek, Brigitte; Conrod, Patricia; Flor, Herta; Garavan, Hugh; Lemaitre, Hervé; Mann, Karl; Nees, Frauke; Paus, Tomas; Schad, Daniel J; Schümann, Gunter; Frouin, Vincent; Poline, Jean-Baptiste; Thirion, Bertrand
2015-05-01
Multi-subject datasets used in neuroimaging group studies have a complex structure, as they exhibit non-stationary statistical properties across regions and display various artifacts. While studies with small sample sizes can rarely be shown to deviate from standard hypotheses (such as the normality of the residuals) due to the poor sensitivity of normality tests with low degrees of freedom, large-scale studies (e.g. >100 subjects) exhibit more obvious deviations from these hypotheses and call for more refined models for statistical inference. Here, we demonstrate the benefits of robust regression as a tool for analyzing large neuroimaging cohorts. First, we use an analytic test based on robust parameter estimates; based on simulations, this procedure is shown to provide an accurate statistical control without resorting to permutations. Second, we show that robust regression yields more detections than standard algorithms using as an example an imaging genetics study with 392 subjects. Third, we show that robust regression can avoid false positives in a large-scale analysis of brain-behavior relationships with over 1500 subjects. Finally we embed robust regression in the Randomized Parcellation Based Inference (RPBI) method and demonstrate that this combination further improves the sensitivity of tests carried out across the whole brain. Altogether, our results show that robust procedures provide important advantages in large-scale neuroimaging group studies. Copyright © 2015 Elsevier Inc. All rights reserved.
Statistical fluctuations of an ocean surface inferred from shoes and ships
NASA Astrophysics Data System (ADS)
Lerche, Ian; Maubeuge, Frédéric
1995-12-01
This paper shows that it is possible to roughly estimate some ocean properties using simple time-dependent statistical models of ocean fluctuations. Based on a real incident, the loss by a vessel of a Nike shoes container in the North Pacific Ocean, a statistical model was tested on data sets consisting of the Nike shoes found by beachcombers a few months later. This statistical treatment of the shoes' motion allows one to infer velocity trends of the Pacific Ocean, together with their fluctuation strengths. The idea is to suppose that there is a mean bulk flow speed that can depend on location on the ocean surface and time. The fluctuations of the surface flow speed are then treated as statistically random. The distribution of shoes is described in space and time using Markov probability processes related to the mean and fluctuating ocean properties. The aim of the exercise is to provide some of the properties of the Pacific Ocean that are otherwise calculated using a sophisticated numerical model, OSCURS, where numerous data are needed. Relevant quantities are sharply estimated, which can be useful to (1) constrain output results from OSCURS computations, and (2) elucidate the behavior patterns of ocean flow characteristics on long time scales.
Managing heteroscedasticity in general linear models.
Rosopa, Patrick J; Schaffer, Meline M; Schroeder, Amber N
2013-09-01
Heteroscedasticity refers to a phenomenon where data violate a statistical assumption. This assumption is known as homoscedasticity. When the homoscedasticity assumption is violated, this can lead to increased Type I error rates or decreased statistical power. Because this can adversely affect substantive conclusions, the failure to detect and manage heteroscedasticity could have serious implications for theory, research, and practice. In addition, heteroscedasticity is not uncommon in the behavioral and social sciences. Thus, in the current article, we synthesize extant literature in applied psychology, econometrics, quantitative psychology, and statistics, and we offer recommendations for researchers and practitioners regarding available procedures for detecting heteroscedasticity and mitigating its effects. In addition to discussing the strengths and weaknesses of various procedures and comparing them in terms of existing simulation results, we describe a 3-step data-analytic process for detecting and managing heteroscedasticity: (a) fitting a model based on theory and saving residuals, (b) the analysis of residuals, and (c) statistical inferences (e.g., hypothesis tests and confidence intervals) involving parameter estimates. We also demonstrate this data-analytic process using an illustrative example. Overall, detecting violations of the homoscedasticity assumption and mitigating its biasing effects can strengthen the validity of inferences from behavioral and social science data.
Biological Parametric Mapping: A Statistical Toolbox for Multi-Modality Brain Image Analysis
Casanova, Ramon; Ryali, Srikanth; Baer, Aaron; Laurienti, Paul J.; Burdette, Jonathan H.; Hayasaka, Satoru; Flowers, Lynn; Wood, Frank; Maldjian, Joseph A.
2006-01-01
In recent years multiple brain MR imaging modalities have emerged; however, analysis methodologies have mainly remained modality specific. In addition, when comparing across imaging modalities, most researchers have been forced to rely on simple region-of-interest type analyses, which do not allow the voxel-by-voxel comparisons necessary to answer more sophisticated neuroscience questions. To overcome these limitations, we developed a toolbox for multimodal image analysis called biological parametric mapping (BPM), based on a voxel-wise use of the general linear model. The BPM toolbox incorporates information obtained from other modalities as regressors in a voxel-wise analysis, thereby permitting investigation of more sophisticated hypotheses. The BPM toolbox has been developed in MATLAB with a user friendly interface for performing analyses, including voxel-wise multimodal correlation, ANCOVA, and multiple regression. It has a high degree of integration with the SPM (statistical parametric mapping) software relying on it for visualization and statistical inference. Furthermore, statistical inference for a correlation field, rather than a widely-used T-field, has been implemented in the correlation analysis for more accurate results. An example with in-vivo data is presented demonstrating the potential of the BPM methodology as a tool for multimodal image analysis. PMID:17070709
Redshift data and statistical inference
NASA Technical Reports Server (NTRS)
Newman, William I.; Haynes, Martha P.; Terzian, Yervant
1994-01-01
Frequency histograms and the 'power spectrum analysis' (PSA) method, the latter developed by Yu & Peebles (1969), have been widely employed as techniques for establishing the existence of periodicities. We provide a formal analysis of these two classes of methods, including controlled numerical experiments, to better understand their proper use and application. In particular, we note that typical published applications of frequency histograms commonly employ far greater numbers of class intervals or bins than is advisable by statistical theory sometimes giving rise to the appearance of spurious patterns. The PSA method generates a sequence of random numbers from observational data which, it is claimed, is exponentially distributed with unit mean and variance, essentially independent of the distribution of the original data. We show that the derived random processes is nonstationary and produces a small but systematic bias in the usual estimate of the mean and variance. Although the derived variable may be reasonably described by an exponential distribution, the tail of the distribution is far removed from that of an exponential, thereby rendering statistical inference and confidence testing based on the tail of the distribution completely unreliable. Finally, we examine a number of astronomical examples wherein these methods have been used giving rise to widespread acceptance of statistically unconfirmed conclusions.
Wang, Yi; Wan, Jianwu; Guo, Jun; Cheung, Yiu-Ming; Yuen, Pong C; Yi Wang; Jianwu Wan; Jun Guo; Yiu-Ming Cheung; Yuen, Pong C; Cheung, Yiu-Ming; Guo, Jun; Yuen, Pong C; Wan, Jianwu; Wang, Yi
2018-07-01
Similarity search is essential to many important applications and often involves searching at scale on high-dimensional data based on their similarity to a query. In biometric applications, recent vulnerability studies have shown that adversarial machine learning can compromise biometric recognition systems by exploiting the biometric similarity information. Existing methods for biometric privacy protection are in general based on pairwise matching of secured biometric templates and have inherent limitations in search efficiency and scalability. In this paper, we propose an inference-based framework for privacy-preserving similarity search in Hamming space. Our approach builds on an obfuscated distance measure that can conceal Hamming distance in a dynamic interval. Such a mechanism enables us to systematically design statistically reliable methods for retrieving most likely candidates without knowing the exact distance values. We further propose to apply Montgomery multiplication for generating search indexes that can withstand adversarial similarity analysis, and show that information leakage in randomized Montgomery domains can be made negligibly small. Our experiments on public biometric datasets demonstrate that the inference-based approach can achieve a search accuracy close to the best performance possible with secure computation methods, but the associated cost is reduced by orders of magnitude compared to cryptographic primitives.
Models for inference in dynamic metacommunity systems
Dorazio, Robert M.; Kery, Marc; Royle, J. Andrew; Plattner, Matthias
2010-01-01
A variety of processes are thought to be involved in the formation and dynamics of species assemblages. For example, various metacommunity theories are based on differences in the relative contributions of dispersal of species among local communities and interactions of species within local communities. Interestingly, metacommunity theories continue to be advanced without much empirical validation. Part of the problem is that statistical models used to analyze typical survey data either fail to specify ecological processes with sufficient complexity or they fail to account for errors in detection of species during sampling. In this paper, we describe a statistical modeling framework for the analysis of metacommunity dynamics that is based on the idea of adopting a unified approach, multispecies occupancy modeling, for computing inferences about individual species, local communities of species, or the entire metacommunity of species. This approach accounts for errors in detection of species during sampling and also allows different metacommunity paradigms to be specified in terms of species- and location-specific probabilities of occurrence, extinction, and colonization: all of which are estimable. In addition, this approach can be used to address inference problems that arise in conservation ecology, such as predicting temporal and spatial changes in biodiversity for use in making conservation decisions. To illustrate, we estimate changes in species composition associated with the species-specific phenologies of flight patterns of butterflies in Switzerland for the purpose of estimating regional differences in biodiversity.
Multi-Agent Inference in Social Networks: A Finite Population Learning Approach
Tong, Xin; Zeng, Yao
2016-01-01
When people in a society want to make inference about some parameter, each person may want to use data collected by other people. Information (data) exchange in social networks is usually costly, so to make reliable statistical decisions, people need to trade off the benefits and costs of information acquisition. Conflicts of interests and coordination problems will arise in the process. Classical statistics does not consider people’s incentives and interactions in the data collection process. To address this imperfection, this work explores multi-agent Bayesian inference problems with a game theoretic social network model. Motivated by our interest in aggregate inference at the societal level, we propose a new concept, finite population learning, to address whether with high probability, a large fraction of people in a given finite population network can make “good” inference. Serving as a foundation, this concept enables us to study the long run trend of aggregate inference quality as population grows. PMID:27076691
An Assessment of the Impact of the Department of Defense Very-High-Speed Integrated Circuit Program.
1982-01-01
analysis, statistical inference, device physics and other such products of basic research. Examples of such information would be: analyses of properties of...TB , for a n-p-n silicon transitor with 1018 cm- 3 base-doping, TB = Wb 2/2Dw becomes 0.4 ps in this limit so that the base contributes little to delay
Discovering Shifts to Suicidal Ideation from Mental Health Content in Social Media
De Choudhury, Munmun; Kiciman, Emre; Dredze, Mark; Coppersmith, Glen; Kumar, Mrinal
2017-01-01
History of mental illness is a major factor behind suicide risk and ideation. However research efforts toward characterizing and forecasting this risk is limited due to the paucity of information regarding suicide ideation, exacerbated by the stigma of mental illness. This paper fills gaps in the literature by developing a statistical methodology to infer which individuals could undergo transitions from mental health discourse to suicidal ideation. We utilize semi-anonymous support communities on Reddit as unobtrusive data sources to infer the likelihood of these shifts. We develop language and interactional measures for this purpose, as well as a propensity score matching based statistical approach. Our approach allows us to derive distinct markers of shifts to suicidal ideation. These markers can be modeled in a prediction framework to identify individuals likely to engage in suicidal ideation in the future. We discuss societal and ethical implications of this research. PMID:29082385
NASA Astrophysics Data System (ADS)
Schmidt, S.; Heyns, P. S.; de Villiers, J. P.
2018-02-01
In this paper, a fault diagnostic methodology is developed which is able to detect, locate and trend gear faults under fluctuating operating conditions when only vibration data from a single transducer, measured on a healthy gearbox are available. A two-phase feature extraction and modelling process is proposed to infer the operating condition and based on the operating condition, to detect changes in the machine condition. Information from optimised machine and operating condition hidden Markov models are statistically combined to generate a discrepancy signal which is post-processed to infer the condition of the gearbox. The discrepancy signal is processed and combined with statistical methods for automatic fault detection and localisation and to perform fault trending over time. The proposed methodology is validated on experimental data and a tacholess order tracking methodology is used to enhance the cost-effectiveness of the diagnostic methodology.
Discovering Shifts to Suicidal Ideation from Mental Health Content in Social Media.
De Choudhury, Munmun; Kiciman, Emre; Dredze, Mark; Coppersmith, Glen; Kumar, Mrinal
2016-05-01
History of mental illness is a major factor behind suicide risk and ideation. However research efforts toward characterizing and forecasting this risk is limited due to the paucity of information regarding suicide ideation, exacerbated by the stigma of mental illness. This paper fills gaps in the literature by developing a statistical methodology to infer which individuals could undergo transitions from mental health discourse to suicidal ideation. We utilize semi-anonymous support communities on Reddit as unobtrusive data sources to infer the likelihood of these shifts. We develop language and interactional measures for this purpose, as well as a propensity score matching based statistical approach. Our approach allows us to derive distinct markers of shifts to suicidal ideation. These markers can be modeled in a prediction framework to identify individuals likely to engage in suicidal ideation in the future. We discuss societal and ethical implications of this research.
Linear models: permutation methods
Cade, B.S.; Everitt, B.S.; Howell, D.C.
2005-01-01
Permutation tests (see Permutation Based Inference) for the linear model have applications in behavioral studies when traditional parametric assumptions about the error term in a linear model are not tenable. Improved validity of Type I error rates can be achieved with properly constructed permutation tests. Perhaps more importantly, increased statistical power, improved robustness to effects of outliers, and detection of alternative distributional differences can be achieved by coupling permutation inference with alternative linear model estimators. For example, it is well-known that estimates of the mean in linear model are extremely sensitive to even a single outlying value of the dependent variable compared to estimates of the median [7, 19]. Traditionally, linear modeling focused on estimating changes in the center of distributions (means or medians). However, quantile regression allows distributional changes to be estimated in all or any selected part of a distribution or responses, providing a more complete statistical picture that has relevance to many biological questions [6]...
Bayesian inference of physiologically meaningful parameters from body sway measurements.
Tietäväinen, A; Gutmann, M U; Keski-Vakkuri, E; Corander, J; Hæggström, E
2017-06-19
The control of the human body sway by the central nervous system, muscles, and conscious brain is of interest since body sway carries information about the physiological status of a person. Several models have been proposed to describe body sway in an upright standing position, however, due to the statistical intractability of the more realistic models, no formal parameter inference has previously been conducted and the expressive power of such models for real human subjects remains unknown. Using the latest advances in Bayesian statistical inference for intractable models, we fitted a nonlinear control model to posturographic measurements, and we showed that it can accurately predict the sway characteristics of both simulated and real subjects. Our method provides a full statistical characterization of the uncertainty related to all model parameters as quantified by posterior probability density functions, which is useful for comparisons across subjects and test settings. The ability to infer intractable control models from sensor data opens new possibilities for monitoring and predicting body status in health applications.
NASA Technical Reports Server (NTRS)
Krantz, Timothy L.
2002-01-01
The Weibull distribution has been widely adopted for the statistical description and inference of fatigue data. This document provides user instructions, examples, and verification for software to analyze gear fatigue test data. The software was developed presuming the data are adequately modeled using a two-parameter Weibull distribution. The calculations are based on likelihood methods, and the approach taken is valid for data that include type 1 censoring. The software was verified by reproducing results published by others.
NASA Technical Reports Server (NTRS)
Kranz, Timothy L.
2002-01-01
The Weibull distribution has been widely adopted for the statistical description and inference of fatigue data. This document provides user instructions, examples, and verification for software to analyze gear fatigue test data. The software was developed presuming the data are adequately modeled using a two-parameter Weibull distribution. The calculations are based on likelihood methods, and the approach taken is valid for data that include type I censoring. The software was verified by reproducing results published by others.
Takemoto, Kazuhiro; Aie, Kazuki
2017-05-25
Host-pathogen interactions are important in a wide range of research fields. Given the importance of metabolic crosstalk between hosts and pathogens, a metabolic network-based reverse ecology method was proposed to infer these interactions. However, the validity of this method remains unclear because of the various explanations presented and the influence of potentially confounding factors that have thus far been neglected. We re-evaluated the importance of the reverse ecology method for evaluating host-pathogen interactions while statistically controlling for confounding effects using oxygen requirement, genome, metabolic network, and phylogeny data. Our data analyses showed that host-pathogen interactions were more strongly influenced by genome size, primary network parameters (e.g., number of edges), oxygen requirement, and phylogeny than the reserve ecology-based measures. These results indicate the limitations of the reverse ecology method; however, they do not discount the importance of adopting reverse ecology approaches altogether. Rather, we highlight the need for developing more suitable methods for inferring host-pathogen interactions and conducting more careful examinations of the relationships between metabolic networks and host-pathogen interactions.
Research participant compensation: A matter of statistical inference as well as ethics.
Swanson, David M; Betensky, Rebecca A
2015-11-01
The ethics of compensation of research subjects for participation in clinical trials has been debated for years. One ethical issue of concern is variation among subjects in the level of compensation for identical treatments. Surprisingly, the impact of variation on the statistical inferences made from trial results has not been examined. We seek to identify how variation in compensation may influence any existing dependent censoring in clinical trials, thereby also influencing inference about the survival curve, hazard ratio, or other measures of treatment efficacy. In simulation studies, we consider a model for how compensation structure may influence the censoring model. Under existing dependent censoring, we estimate survival curves under different compensation structures and observe how these structures induce variability in the estimates. We show through this model that if the compensation structure affects the censoring model and dependent censoring is present, then variation in that structure induces variation in the estimates and affects the accuracy of estimation and inference on treatment efficacy. From the perspectives of both ethics and statistical inference, standardization and transparency in the compensation of participants in clinical trials are warranted. Copyright © 2015 Elsevier Inc. All rights reserved.
ERIC Educational Resources Information Center
Kazak, Sibel; Pratt, Dave
2017-01-01
This study considers probability models as tools for both making informal statistical inferences and building stronger conceptual connections between data and chance topics in teaching statistics. In this paper, we aim to explore pre-service mathematics teachers' use of probability models for a chance game, where the sum of two dice matters in…
Statistical atlas based extrapolation of CT data
NASA Astrophysics Data System (ADS)
Chintalapani, Gouthami; Murphy, Ryan; Armiger, Robert S.; Lepisto, Jyri; Otake, Yoshito; Sugano, Nobuhiko; Taylor, Russell H.; Armand, Mehran
2010-02-01
We present a framework to estimate the missing anatomical details from a partial CT scan with the help of statistical shape models. The motivating application is periacetabular osteotomy (PAO), a technique for treating developmental hip dysplasia, an abnormal condition of the hip socket that, if untreated, may lead to osteoarthritis. The common goals of PAO are to reduce pain, joint subluxation and improve contact pressure distribution by increasing the coverage of the femoral head by the hip socket. While current diagnosis and planning is based on radiological measurements, because of significant structural variations in dysplastic hips, a computer-assisted geometrical and biomechanical planning based on CT data is desirable to help the surgeon achieve optimal joint realignments. Most of the patients undergoing PAO are young females, hence it is usually desirable to minimize the radiation dose by scanning only the joint portion of the hip anatomy. These partial scans, however, do not provide enough information for biomechanical analysis due to missing iliac region. A statistical shape model of full pelvis anatomy is constructed from a database of CT scans. The partial volume is first aligned with the statistical atlas using an iterative affine registration, followed by a deformable registration step and the missing information is inferred from the atlas. The atlas inferences are further enhanced by the use of X-ray images of the patient, which are very common in an osteotomy procedure. The proposed method is validated with a leave-one-out analysis method. Osteotomy cuts are simulated and the effect of atlas predicted models on the actual procedure is evaluated.
Phylogeography Takes a Relaxed Random Walk in Continuous Space and Time
Lemey, Philippe; Rambaut, Andrew; Welch, John J.; Suchard, Marc A.
2010-01-01
Research aimed at understanding the geographic context of evolutionary histories is burgeoning across biological disciplines. Recent endeavors attempt to interpret contemporaneous genetic variation in the light of increasingly detailed geographical and environmental observations. Such interest has promoted the development of phylogeographic inference techniques that explicitly aim to integrate such heterogeneous data. One promising development involves reconstructing phylogeographic history on a continuous landscape. Here, we present a Bayesian statistical approach to infer continuous phylogeographic diffusion using random walk models while simultaneously reconstructing the evolutionary history in time from molecular sequence data. Moreover, by accommodating branch-specific variation in dispersal rates, we relax the most restrictive assumption of the standard Brownian diffusion process and demonstrate increased statistical efficiency in spatial reconstructions of overdispersed random walks by analyzing both simulated and real viral genetic data. We further illustrate how drawing inference about summary statistics from a fully specified stochastic process over both sequence evolution and spatial movement reveals important characteristics of a rabies epidemic. Together with recent advances in discrete phylogeographic inference, the continuous model developments furnish a flexible statistical framework for biogeographical reconstructions that is easily expanded upon to accommodate various landscape genetic features. PMID:20203288
New robust statistical procedures for the polytomous logistic regression models.
Castilla, Elena; Ghosh, Abhik; Martin, Nirian; Pardo, Leandro
2018-05-17
This article derives a new family of estimators, namely the minimum density power divergence estimators, as a robust generalization of the maximum likelihood estimator for the polytomous logistic regression model. Based on these estimators, a family of Wald-type test statistics for linear hypotheses is introduced. Robustness properties of both the proposed estimators and the test statistics are theoretically studied through the classical influence function analysis. Appropriate real life examples are presented to justify the requirement of suitable robust statistical procedures in place of the likelihood based inference for the polytomous logistic regression model. The validity of the theoretical results established in the article are further confirmed empirically through suitable simulation studies. Finally, an approach for the data-driven selection of the robustness tuning parameter is proposed with empirical justifications. © 2018, The International Biometric Society.
Bal, Mert; Amasyali, M Fatih; Sever, Hayri; Kose, Guven; Demirhan, Ayse
2014-01-01
The importance of the decision support systems is increasingly supporting the decision making process in cases of uncertainty and the lack of information and they are widely used in various fields like engineering, finance, medicine, and so forth, Medical decision support systems help the healthcare personnel to select optimal method during the treatment of the patients. Decision support systems are intelligent software systems that support decision makers on their decisions. The design of decision support systems consists of four main subjects called inference mechanism, knowledge-base, explanation module, and active memory. Inference mechanism constitutes the basis of decision support systems. There are various methods that can be used in these mechanisms approaches. Some of these methods are decision trees, artificial neural networks, statistical methods, rule-based methods, and so forth. In decision support systems, those methods can be used separately or a hybrid system, and also combination of those methods. In this study, synthetic data with 10, 100, 1000, and 2000 records have been produced to reflect the probabilities on the ALARM network. The accuracy of 11 machine learning methods for the inference mechanism of medical decision support system is compared on various data sets.
Carvalho, Pedro; Marques, Rui Cunha
2016-02-15
This study aims to search for economies of size and scope in the Portuguese water sector applying Bayesian and classical statistics to make inference in stochastic frontier analysis (SFA). This study proves the usefulness and advantages of the application of Bayesian statistics for making inference in SFA over traditional SFA which just uses classical statistics. The resulting Bayesian methods allow overcoming some problems that arise in the application of the traditional SFA, such as the bias in small samples and skewness of residuals. In the present case study of the water sector in Portugal, these Bayesian methods provide more plausible and acceptable results. Based on the results obtained we found that there are important economies of output density, economies of size, economies of vertical integration and economies of scope in the Portuguese water sector, pointing out to the huge advantages in undertaking mergers by joining the retail and wholesale components and by joining the drinking water and wastewater services. Copyright © 2015 Elsevier B.V. All rights reserved.
Inference of median difference based on the Box-Cox model in randomized clinical trials.
Maruo, K; Isogawa, N; Gosho, M
2015-05-10
In randomized clinical trials, many medical and biological measurements are not normally distributed and are often skewed. The Box-Cox transformation is a powerful procedure for comparing two treatment groups for skewed continuous variables in terms of a statistical test. However, it is difficult to directly estimate and interpret the location difference between the two groups on the original scale of the measurement. We propose a helpful method that infers the difference of the treatment effect on the original scale in a more easily interpretable form. We also provide statistical analysis packages that consistently include an estimate of the treatment effect, covariance adjustments, standard errors, and statistical hypothesis tests. The simulation study that focuses on randomized parallel group clinical trials with two treatment groups indicates that the performance of the proposed method is equivalent to or better than that of the existing non-parametric approaches in terms of the type-I error rate and power. We illustrate our method with cluster of differentiation 4 data in an acquired immune deficiency syndrome clinical trial. Copyright © 2015 John Wiley & Sons, Ltd.
Statistical Inference and Spatial Patterns in Correlates of IQ
ERIC Educational Resources Information Center
Hassall, Christopher; Sherratt, Thomas N.
2011-01-01
Cross-national comparisons of IQ have become common since the release of a large dataset of international IQ scores. However, these studies have consistently failed to consider the potential lack of independence of these scores based on spatial proximity. To demonstrate the importance of this omission, we present a re-evaluation of several…
A Framework to Support Research on Informal Inferential Reasoning
ERIC Educational Resources Information Center
Zieffler, Andrew; Garfield, Joan; delMas, Robert; Reading, Chris
2008-01-01
Informal inferential reasoning is a relatively recent concept in the research literature. Several research studies have defined this type of cognitive process in slightly different ways. In this paper, a working definition of informal inferential reasoning based on an analysis of the key aspects of statistical inference, and on research from…
What's in a Name? Probabilistic Inference of Religious Community from South Asian Names
ERIC Educational Resources Information Center
Susewind, Raphael
2015-01-01
Fine-grained data on religious communities are often considered sensitive in South Asia and consequently remain inaccessible. Yet without such data, statistical research on communal relations and group-based inequality remains superficial, hampering the development of appropriate policy measures to prevent further social exclusion on the basis of…
Demographic inference under the coalescent in a spatial continuum.
Guindon, Stéphane; Guo, Hongbin; Welch, David
2016-10-01
Understanding population dynamics from the analysis of molecular and spatial data requires sound statistical modeling. Current approaches assume that populations are naturally partitioned into discrete demes, thereby failing to be relevant in cases where individuals are scattered on a spatial continuum. Other models predict the formation of increasingly tight clusters of individuals in space, which, again, conflicts with biological evidence. Building on recent theoretical work, we introduce a new genealogy-based inference framework that alleviates these issues. This approach effectively implements a stochastic model in which the distribution of individuals is homogeneous and stationary, thereby providing a relevant null model for the fluctuation of genetic diversity in time and space. Importantly, the spatial density of individuals in a population and their range of dispersal during the course of evolution are two parameters that can be inferred separately with this method. The validity of the new inference framework is confirmed with extensive simulations and the analysis of influenza sequences collected over five seasons in the USA. Copyright © 2016 Elsevier Inc. All rights reserved.
NASA Astrophysics Data System (ADS)
Albert, Carlo; Ulzega, Simone; Stoop, Ruedi
2016-04-01
Measured time-series of both precipitation and runoff are known to exhibit highly non-trivial statistical properties. For making reliable probabilistic predictions in hydrology, it is therefore desirable to have stochastic models with output distributions that share these properties. When parameters of such models have to be inferred from data, we also need to quantify the associated parametric uncertainty. For non-trivial stochastic models, however, this latter step is typically very demanding, both conceptually and numerically, and always never done in hydrology. Here, we demonstrate that methods developed in statistical physics make a large class of stochastic differential equation (SDE) models amenable to a full-fledged Bayesian parameter inference. For concreteness we demonstrate these methods by means of a simple yet non-trivial toy SDE model. We consider a natural catchment that can be described by a linear reservoir, at the scale of observation. All the neglected processes are assumed to happen at much shorter time-scales and are therefore modeled with a Gaussian white noise term, the standard deviation of which is assumed to scale linearly with the system state (water volume in the catchment). Even for constant input, the outputs of this simple non-linear SDE model show a wealth of desirable statistical properties, such as fat-tailed distributions and long-range correlations. Standard algorithms for Bayesian inference fail, for models of this kind, because their likelihood functions are extremely high-dimensional intractable integrals over all possible model realizations. The use of Kalman filters is illegitimate due to the non-linearity of the model. Particle filters could be used but become increasingly inefficient with growing number of data points. Hamiltonian Monte Carlo algorithms allow us to translate this inference problem to the problem of simulating the dynamics of a statistical mechanics system and give us access to most sophisticated methods that have been developed in the statistical physics community over the last few decades. We demonstrate that such methods, along with automated differentiation algorithms, allow us to perform a full-fledged Bayesian inference, for a large class of SDE models, in a highly efficient and largely automatized manner. Furthermore, our algorithm is highly parallelizable. For our toy model, discretized with a few hundred points, a full Bayesian inference can be performed in a matter of seconds on a standard PC.
NASA Astrophysics Data System (ADS)
Bakker, Arthur; Ben-Zvi, Dani; Makar, Katie
2017-12-01
To understand how statistical and other types of reasoning are coordinated with actions to reduce uncertainty, we conducted a case study in vocational education that involved statistical hypothesis testing. We analyzed an intern's research project in a hospital laboratory in which reducing uncertainties was crucial to make a valid statistical inference. In his project, the intern, Sam, investigated whether patients' blood could be sent through pneumatic post without influencing the measurement of particular blood components. We asked, in the process of making a statistical inference, how are reasons and actions coordinated to reduce uncertainty? For the analysis, we used the semantic theory of inferentialism, specifically, the concept of webs of reasons and actions—complexes of interconnected reasons for facts and actions; these reasons include premises and conclusions, inferential relations, implications, motives for action, and utility of tools for specific purposes in a particular context. Analysis of interviews with Sam, his supervisor and teacher as well as video data of Sam in the classroom showed that many of Sam's actions aimed to reduce variability, rule out errors, and thus reduce uncertainties so as to arrive at a valid inference. Interestingly, the decisive factor was not the outcome of a t test but of the reference change value, a clinical chemical measure of analytic and biological variability. With insights from this case study, we expect that students can be better supported in connecting statistics with context and in dealing with uncertainty.
Web-based tools for modelling and analysis of multivariate data: California ozone pollution activity
Dinov, Ivo D.; Christou, Nicolas
2014-01-01
This article presents a hands-on web-based activity motivated by the relation between human health and ozone pollution in California. This case study is based on multivariate data collected monthly at 20 locations in California between 1980 and 2006. Several strategies and tools for data interrogation and exploratory data analysis, model fitting and statistical inference on these data are presented. All components of this case study (data, tools, activity) are freely available online at: http://wiki.stat.ucla.edu/socr/index.php/SOCR_MotionCharts_CAOzoneData. Several types of exploratory (motion charts, box-and-whisker plots, spider charts) and quantitative (inference, regression, analysis of variance (ANOVA)) data analyses tools are demonstrated. Two specific human health related questions (temporal and geographic effects of ozone pollution) are discussed as motivational challenges. PMID:24465054
NASA Astrophysics Data System (ADS)
Moura, Ricardo; Sinha, Bimal; Coelho, Carlos A.
2017-06-01
The recent popularity of the use of synthetic data as a Statistical Disclosure Control technique has enabled the development of several methods of generating and analyzing such data, but almost always relying in asymptotic distributions and in consequence being not adequate for small sample datasets. Thus, a likelihood-based exact inference procedure is derived for the matrix of regression coefficients of the multivariate regression model, for multiply imputed synthetic data generated via Posterior Predictive Sampling. Since it is based in exact distributions this procedure may even be used in small sample datasets. Simulation studies compare the results obtained from the proposed exact inferential procedure with the results obtained from an adaptation of Reiters combination rule to multiply imputed synthetic datasets and an application to the 2000 Current Population Survey is discussed.
Dinov, Ivo D; Christou, Nicolas
2011-09-01
This article presents a hands-on web-based activity motivated by the relation between human health and ozone pollution in California. This case study is based on multivariate data collected monthly at 20 locations in California between 1980 and 2006. Several strategies and tools for data interrogation and exploratory data analysis, model fitting and statistical inference on these data are presented. All components of this case study (data, tools, activity) are freely available online at: http://wiki.stat.ucla.edu/socr/index.php/SOCR_MotionCharts_CAOzoneData. Several types of exploratory (motion charts, box-and-whisker plots, spider charts) and quantitative (inference, regression, analysis of variance (ANOVA)) data analyses tools are demonstrated. Two specific human health related questions (temporal and geographic effects of ozone pollution) are discussed as motivational challenges.
Some challenges with statistical inference in adaptive designs.
Hung, H M James; Wang, Sue-Jane; Yang, Peiling
2014-01-01
Adaptive designs have generated a great deal of attention to clinical trial communities. The literature contains many statistical methods to deal with added statistical uncertainties concerning the adaptations. Increasingly encountered in regulatory applications are adaptive statistical information designs that allow modification of sample size or related statistical information and adaptive selection designs that allow selection of doses or patient populations during the course of a clinical trial. For adaptive statistical information designs, a few statistical testing methods are mathematically equivalent, as a number of articles have stipulated, but arguably there are large differences in their practical ramifications. We pinpoint some undesirable features of these methods in this work. For adaptive selection designs, the selection based on biomarker data for testing the correlated clinical endpoints may increase statistical uncertainty in terms of type I error probability, and most importantly the increased statistical uncertainty may be impossible to assess.
Grover, Sandeep; Del Greco M, Fabiola; Stein, Catherine M; Ziegler, Andreas
2017-01-01
Confounding and reverse causality have prevented us from drawing meaningful clinical interpretation even in well-powered observational studies. Confounding may be attributed to our inability to randomize the exposure variable in observational studies. Mendelian randomization (MR) is one approach to overcome confounding. It utilizes one or more genetic polymorphisms as a proxy for the exposure variable of interest. Polymorphisms are randomly distributed in a population, they are static throughout an individual's lifetime, and may thus help in inferring directionality in exposure-outcome associations. Genome-wide association studies (GWAS) or meta-analyses of GWAS are characterized by large sample sizes and the availability of many single nucleotide polymorphisms (SNPs), making GWAS-based MR an attractive approach. GWAS-based MR comes with specific challenges, including multiple causality. Despite shortcomings, it still remains one of the most powerful techniques for inferring causality.With MR still an evolving concept with complex statistical challenges, the literature is relatively scarce in terms of providing working examples incorporating real datasets. In this chapter, we provide a step-by-step guide for causal inference based on the principles of MR with a real dataset using both individual and summary data from unrelated individuals. We suggest best possible practices and give recommendations based on the current literature.
INfORM: Inference of NetwOrk Response Modules.
Marwah, Veer Singh; Kinaret, Pia Anneli Sofia; Serra, Angela; Scala, Giovanni; Lauerma, Antti; Fortino, Vittorio; Greco, Dario
2018-06-15
Detecting and interpreting responsive modules from gene expression data by using network-based approaches is a common but laborious task. It often requires the application of several computational methods implemented in different software packages, forcing biologists to compile complex analytical pipelines. Here we introduce INfORM (Inference of NetwOrk Response Modules), an R shiny application that enables non-expert users to detect, evaluate and select gene modules with high statistical and biological significance. INfORM is a comprehensive tool for the identification of biologically meaningful response modules from consensus gene networks inferred by using multiple algorithms. It is accessible through an intuitive graphical user interface allowing for a level of abstraction from the computational steps. INfORM is freely available for academic use at https://github.com/Greco-Lab/INfORM. Supplementary data are available at Bioinformatics online.
PyClone: statistical inference of clonal population structure in cancer.
Roth, Andrew; Khattra, Jaswinder; Yap, Damian; Wan, Adrian; Laks, Emma; Biele, Justina; Ha, Gavin; Aparicio, Samuel; Bouchard-Côté, Alexandre; Shah, Sohrab P
2014-04-01
We introduce PyClone, a statistical model for inference of clonal population structures in cancers. PyClone is a Bayesian clustering method for grouping sets of deeply sequenced somatic mutations into putative clonal clusters while estimating their cellular prevalences and accounting for allelic imbalances introduced by segmental copy-number changes and normal-cell contamination. Single-cell sequencing validation demonstrates PyClone's accuracy.
Statistical Signal Models and Algorithms for Image Analysis
1984-10-25
In this report, two-dimensional stochastic linear models are used in developing algorithms for image analysis such as classification, segmentation, and object detection in images characterized by textured backgrounds. These models generate two-dimensional random processes as outputs to which statistical inference procedures can naturally be applied. A common thread throughout our algorithms is the interpretation of the inference procedures in terms of linear prediction
Statistical inference of the generation probability of T-cell receptors from sequence repertoires.
Murugan, Anand; Mora, Thierry; Walczak, Aleksandra M; Callan, Curtis G
2012-10-02
Stochastic rearrangement of germline V-, D-, and J-genes to create variable coding sequence for certain cell surface receptors is at the origin of immune system diversity. This process, known as "VDJ recombination", is implemented via a series of stochastic molecular events involving gene choices and random nucleotide insertions between, and deletions from, genes. We use large sequence repertoires of the variable CDR3 region of human CD4+ T-cell receptor beta chains to infer the statistical properties of these basic biochemical events. Because any given CDR3 sequence can be produced in multiple ways, the probability distribution of hidden recombination events cannot be inferred directly from the observed sequences; we therefore develop a maximum likelihood inference method to achieve this end. To separate the properties of the molecular rearrangement mechanism from the effects of selection, we focus on nonproductive CDR3 sequences in T-cell DNA. We infer the joint distribution of the various generative events that occur when a new T-cell receptor gene is created. We find a rich picture of correlation (and absence thereof), providing insight into the molecular mechanisms involved. The generative event statistics are consistent between individuals, suggesting a universal biochemical process. Our probabilistic model predicts the generation probability of any specific CDR3 sequence by the primitive recombination process, allowing us to quantify the potential diversity of the T-cell repertoire and to understand why some sequences are shared between individuals. We argue that the use of formal statistical inference methods, of the kind presented in this paper, will be essential for quantitative understanding of the generation and evolution of diversity in the adaptive immune system.
NASA Astrophysics Data System (ADS)
Shafii, M.; Tolson, B.; Matott, L. S.
2012-04-01
Hydrologic modeling has benefited from significant developments over the past two decades. This has resulted in building of higher levels of complexity into hydrologic models, which eventually makes the model evaluation process (parameter estimation via calibration and uncertainty analysis) more challenging. In order to avoid unreasonable parameter estimates, many researchers have suggested implementation of multi-criteria calibration schemes. Furthermore, for predictive hydrologic models to be useful, proper consideration of uncertainty is essential. Consequently, recent research has emphasized comprehensive model assessment procedures in which multi-criteria parameter estimation is combined with statistically-based uncertainty analysis routines such as Bayesian inference using Markov Chain Monte Carlo (MCMC) sampling. Such a procedure relies on the use of formal likelihood functions based on statistical assumptions, and moreover, the Bayesian inference structured on MCMC samplers requires a considerably large number of simulations. Due to these issues, especially in complex non-linear hydrological models, a variety of alternative informal approaches have been proposed for uncertainty analysis in the multi-criteria context. This study aims at exploring a number of such informal uncertainty analysis techniques in multi-criteria calibration of hydrological models. The informal methods addressed in this study are (i) Pareto optimality which quantifies the parameter uncertainty using the Pareto solutions, (ii) DDS-AU which uses the weighted sum of objective functions to derive the prediction limits, and (iii) GLUE which describes the total uncertainty through identification of behavioral solutions. The main objective is to compare such methods with MCMC-based Bayesian inference with respect to factors such as computational burden, and predictive capacity, which are evaluated based on multiple comparative measures. The measures for comparison are calculated both for calibration and evaluation periods. The uncertainty analysis methodologies are applied to a simple 5-parameter rainfall-runoff model, called HYMOD.
Emmert-Streib, Frank; Glazko, Galina V.; Altay, Gökmen; de Matos Simoes, Ricardo
2012-01-01
In this paper, we present a systematic and conceptual overview of methods for inferring gene regulatory networks from observational gene expression data. Further, we discuss two classic approaches to infer causal structures and compare them with contemporary methods by providing a conceptual categorization thereof. We complement the above by surveying global and local evaluation measures for assessing the performance of inference algorithms. PMID:22408642
Information Entropy Production of Maximum Entropy Markov Chains from Spike Trains
NASA Astrophysics Data System (ADS)
Cofré, Rodrigo; Maldonado, Cesar
2018-01-01
We consider the maximum entropy Markov chain inference approach to characterize the collective statistics of neuronal spike trains, focusing on the statistical properties of the inferred model. We review large deviations techniques useful in this context to describe properties of accuracy and convergence in terms of sampling size. We use these results to study the statistical fluctuation of correlations, distinguishability and irreversibility of maximum entropy Markov chains. We illustrate these applications using simple examples where the large deviation rate function is explicitly obtained for maximum entropy models of relevance in this field.
Statistical Inference for Data Adaptive Target Parameters.
Hubbard, Alan E; Kherad-Pajouh, Sara; van der Laan, Mark J
2016-05-01
Consider one observes n i.i.d. copies of a random variable with a probability distribution that is known to be an element of a particular statistical model. In order to define our statistical target we partition the sample in V equal size sub-samples, and use this partitioning to define V splits in an estimation sample (one of the V subsamples) and corresponding complementary parameter-generating sample. For each of the V parameter-generating samples, we apply an algorithm that maps the sample to a statistical target parameter. We define our sample-split data adaptive statistical target parameter as the average of these V-sample specific target parameters. We present an estimator (and corresponding central limit theorem) of this type of data adaptive target parameter. This general methodology for generating data adaptive target parameters is demonstrated with a number of practical examples that highlight new opportunities for statistical learning from data. This new framework provides a rigorous statistical methodology for both exploratory and confirmatory analysis within the same data. Given that more research is becoming "data-driven", the theory developed within this paper provides a new impetus for a greater involvement of statistical inference into problems that are being increasingly addressed by clever, yet ad hoc pattern finding methods. To suggest such potential, and to verify the predictions of the theory, extensive simulation studies, along with a data analysis based on adaptively determined intervention rules are shown and give insight into how to structure such an approach. The results show that the data adaptive target parameter approach provides a general framework and resulting methodology for data-driven science.
On System Engineering a Barter-Based Re-allocation of Space System Key Development Resources
NASA Astrophysics Data System (ADS)
Kosmann, William J.
NASA has had a decades-long problem with cost growth during the development of space science missions. Numerous agency-sponsored studies have produced average mission level development cost growths ranging from 23 to 77%. A new study of 26 historical NASA science instrument set developments using expert judgment to re-allocate key development resources has an average cost growth of 73.77%. Twice in history, during the Cassini and EOS-Terra science instrument developments, a barter-based mechanism has been used to re-allocate key development resources. The mean instrument set development cost growth was -1.55%. Performing a bivariate inference on the means of these two distributions, there is statistical evidence to support the claim that using a barter-based mechanism to re-allocate key instrument development resources will result in a lower expected cost growth than using the expert judgment approach. Agent-based discrete event simulation is the natural way to model a trade environment. A NetLogo agent-based barter-based simulation of science instrument development was created. The agent-based model was validated against the Cassini historical example, as the starting and ending instrument development conditions are available. The resulting validated agent-based barter-based science instrument resource re-allocation simulation was used to perform 300 instrument development simulations, using barter to re-allocate development resources. The mean cost growth was -3.365%. A bivariate inference on the means was performed to determine that additional significant statistical evidence exists to support a claim that using barter-based resource re-allocation will result in lower expected cost growth, with respect to the historical expert judgment approach. Barter-based key development resource re-allocation should work on science spacecraft development as well as it has worked on science instrument development. A new study of 28 historical NASA science spacecraft developments has an average cost growth of 46.04%. As barter-based key development resource re-allocation has never been tried in a spacecraft development, no historical results exist, and an inference on the means test is not possible. A simulation of using barter-based resource re-allocation should be developed. The NetLogo instrument development simulation should be modified to account for spacecraft development market participant differences. The resulting agent-based barter-based spacecraft resource re-allocation simulation would then be used to determine if significant statistical evidence exists to prove a claim that using barter-based resource re-allocation will result in lower expected cost growth.
Pilot points method for conditioning multiple-point statistical facies simulation on flow data
NASA Astrophysics Data System (ADS)
Ma, Wei; Jafarpour, Behnam
2018-05-01
We propose a new pilot points method for conditioning discrete multiple-point statistical (MPS) facies simulation on dynamic flow data. While conditioning MPS simulation on static hard data is straightforward, their calibration against nonlinear flow data is nontrivial. The proposed method generates conditional models from a conceptual model of geologic connectivity, known as a training image (TI), by strategically placing and estimating pilot points. To place pilot points, a score map is generated based on three sources of information: (i) the uncertainty in facies distribution, (ii) the model response sensitivity information, and (iii) the observed flow data. Once the pilot points are placed, the facies values at these points are inferred from production data and then are used, along with available hard data at well locations, to simulate a new set of conditional facies realizations. While facies estimation at the pilot points can be performed using different inversion algorithms, in this study the ensemble smoother (ES) is adopted to update permeability maps from production data, which are then used to statistically infer facies types at the pilot point locations. The developed method combines the information in the flow data and the TI by using the former to infer facies values at selected locations away from the wells and the latter to ensure consistent facies structure and connectivity where away from measurement locations. Several numerical experiments are used to evaluate the performance of the developed method and to discuss its important properties.
NASA Astrophysics Data System (ADS)
Ma, W.; Jafarpour, B.
2017-12-01
We develop a new pilot points method for conditioning discrete multiple-point statistical (MPS) facies simulation on dynamic flow data. While conditioning MPS simulation on static hard data is straightforward, their calibration against nonlinear flow data is nontrivial. The proposed method generates conditional models from a conceptual model of geologic connectivity, known as a training image (TI), by strategically placing and estimating pilot points. To place pilot points, a score map is generated based on three sources of information:: (i) the uncertainty in facies distribution, (ii) the model response sensitivity information, and (iii) the observed flow data. Once the pilot points are placed, the facies values at these points are inferred from production data and are used, along with available hard data at well locations, to simulate a new set of conditional facies realizations. While facies estimation at the pilot points can be performed using different inversion algorithms, in this study the ensemble smoother (ES) and its multiple data assimilation variant (ES-MDA) are adopted to update permeability maps from production data, which are then used to statistically infer facies types at the pilot point locations. The developed method combines the information in the flow data and the TI by using the former to infer facies values at select locations away from the wells and the latter to ensure consistent facies structure and connectivity where away from measurement locations. Several numerical experiments are used to evaluate the performance of the developed method and to discuss its important properties.
Structured statistical models of inductive reasoning.
Kemp, Charles; Tenenbaum, Joshua B
2009-01-01
Everyday inductive inferences are often guided by rich background knowledge. Formal models of induction should aim to incorporate this knowledge and should explain how different kinds of knowledge lead to the distinctive patterns of reasoning found in different inductive contexts. This article presents a Bayesian framework that attempts to meet both goals and describes [corrected] 4 applications of the framework: a taxonomic model, a spatial model, a threshold model, and a causal model. Each model makes probabilistic inferences about the extensions of novel properties, but the priors for the 4 models are defined over different kinds of structures that capture different relationships between the categories in a domain. The framework therefore shows how statistical inference can operate over structured background knowledge, and the authors argue that this interaction between structure and statistics is critical for explaining the power and flexibility of human reasoning.
Goyal, Ravi; De Gruttola, Victor
2018-01-30
Analysis of sexual history data intended to describe sexual networks presents many challenges arising from the fact that most surveys collect information on only a very small fraction of the population of interest. In addition, partners are rarely identified and responses are subject to reporting biases. Typically, each network statistic of interest, such as mean number of sexual partners for men or women, is estimated independently of other network statistics. There is, however, a complex relationship among networks statistics; and knowledge of these relationships can aid in addressing concerns mentioned earlier. We develop a novel method that constrains a posterior predictive distribution of a collection of network statistics in order to leverage the relationships among network statistics in making inference about network properties of interest. The method ensures that inference on network properties is compatible with an actual network. Through extensive simulation studies, we also demonstrate that use of this method can improve estimates in settings where there is uncertainty that arises both from sampling and from systematic reporting bias compared with currently available approaches to estimation. To illustrate the method, we apply it to estimate network statistics using data from the Chicago Health and Social Life Survey. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Sukumaran, Jeet; Knowles, L Lacey
2018-06-01
The development of process-based probabilistic models for historical biogeography has transformed the field by grounding it in modern statistical hypothesis testing. However, most of these models abstract away biological differences, reducing species to interchangeable lineages. We present here the case for reintegration of biology into probabilistic historical biogeographical models, allowing a broader range of questions about biogeographical processes beyond ancestral range estimation or simple correlation between a trait and a distribution pattern, as well as allowing us to assess how inferences about ancestral ranges themselves might be impacted by differential biological traits. We show how new approaches to inference might cope with the computational challenges resulting from the increased complexity of these trait-based historical biogeographical models. Copyright © 2018 Elsevier Ltd. All rights reserved.
Fast maximum likelihood estimation using continuous-time neural point process models.
Lepage, Kyle Q; MacDonald, Christopher J
2015-06-01
A recent report estimates that the number of simultaneously recorded neurons is growing exponentially. A commonly employed statistical paradigm using discrete-time point process models of neural activity involves the computation of a maximum-likelihood estimate. The time to computate this estimate, per neuron, is proportional to the number of bins in a finely spaced discretization of time. By using continuous-time models of neural activity and the optimally efficient Gaussian quadrature, memory requirements and computation times are dramatically decreased in the commonly encountered situation where the number of parameters p is much less than the number of time-bins n. In this regime, with q equal to the quadrature order, memory requirements are decreased from O(np) to O(qp), and the number of floating-point operations are decreased from O(np(2)) to O(qp(2)). Accuracy of the proposed estimates is assessed based upon physiological consideration, error bounds, and mathematical results describing the relation between numerical integration error and numerical error affecting both parameter estimates and the observed Fisher information. A check is provided which is used to adapt the order of numerical integration. The procedure is verified in simulation and for hippocampal recordings. It is found that in 95 % of hippocampal recordings a q of 60 yields numerical error negligible with respect to parameter estimate standard error. Statistical inference using the proposed methodology is a fast and convenient alternative to statistical inference performed using a discrete-time point process model of neural activity. It enables the employment of the statistical methodology available with discrete-time inference, but is faster, uses less memory, and avoids any error due to discretization.
NASA Astrophysics Data System (ADS)
Lee, K. David; Wiesenfeld, Eric; Gelfand, Andrew
2007-04-01
One of the greatest challenges in modern combat is maintaining a high level of timely Situational Awareness (SA). In many situations, computational complexity and accuracy considerations make the development and deployment of real-time, high-level inference tools very difficult. An innovative hybrid framework that combines Bayesian inference, in the form of Bayesian Networks, and Possibility Theory, in the form of Fuzzy Logic systems, has recently been introduced to provide a rigorous framework for high-level inference. In previous research, the theoretical basis and benefits of the hybrid approach have been developed. However, lacking is a concrete experimental comparison of the hybrid framework with traditional fusion methods, to demonstrate and quantify this benefit. The goal of this research, therefore, is to provide a statistical analysis on the comparison of the accuracy and performance of hybrid network theory, with pure Bayesian and Fuzzy systems and an inexact Bayesian system approximated using Particle Filtering. To accomplish this task, domain specific models will be developed under these different theoretical approaches and then evaluated, via Monte Carlo Simulation, in comparison to situational ground truth to measure accuracy and fidelity. Following this, a rigorous statistical analysis of the performance results will be performed, to quantify the benefit of hybrid inference to other fusion tools.
How to infer relative fitness from a sample of genomic sequences.
Dayarian, Adel; Shraiman, Boris I
2014-07-01
Mounting evidence suggests that natural populations can harbor extensive fitness diversity with numerous genomic loci under selection. It is also known that genealogical trees for populations under selection are quantifiably different from those expected under neutral evolution and described statistically by Kingman's coalescent. While differences in the statistical structure of genealogies have long been used as a test for the presence of selection, the full extent of the information that they contain has not been exploited. Here we demonstrate that the shape of the reconstructed genealogical tree for a moderately large number of random genomic samples taken from a fitness diverse, but otherwise unstructured, asexual population can be used to predict the relative fitness of individuals within the sample. To achieve this we define a heuristic algorithm, which we test in silico, using simulations of a Wright-Fisher model for a realistic range of mutation rates and selection strength. Our inferred fitness ranking is based on a linear discriminator that identifies rapidly coalescing lineages in the reconstructed tree. Inferred fitness ranking correlates strongly with actual fitness, with a genome in the top 10% ranked being in the top 20% fittest with false discovery rate of 0.1-0.3, depending on the mutation/selection parameters. The ranking also enables us to predict the genotypes that future populations inherit from the present one. While the inference accuracy increases monotonically with sample size, samples of 200 nearly saturate the performance. We propose that our approach can be used for inferring relative fitness of genomes obtained in single-cell sequencing of tumors and in monitoring viral outbreaks. Copyright © 2014 by the Genetics Society of America.
Gerhard, Felipe; Kispersky, Tilman; Gutierrez, Gabrielle J.; Marder, Eve; Kramer, Mark; Eden, Uri
2013-01-01
Identifying the structure and dynamics of synaptic interactions between neurons is the first step to understanding neural network dynamics. The presence of synaptic connections is traditionally inferred through the use of targeted stimulation and paired recordings or by post-hoc histology. More recently, causal network inference algorithms have been proposed to deduce connectivity directly from electrophysiological signals, such as extracellularly recorded spiking activity. Usually, these algorithms have not been validated on a neurophysiological data set for which the actual circuitry is known. Recent work has shown that traditional network inference algorithms based on linear models typically fail to identify the correct coupling of a small central pattern generating circuit in the stomatogastric ganglion of the crab Cancer borealis. In this work, we show that point process models of observed spike trains can guide inference of relative connectivity estimates that match the known physiological connectivity of the central pattern generator up to a choice of threshold. We elucidate the necessary steps to derive faithful connectivity estimates from a model that incorporates the spike train nature of the data. We then apply the model to measure changes in the effective connectivity pattern in response to two pharmacological interventions, which affect both intrinsic neural dynamics and synaptic transmission. Our results provide the first successful application of a network inference algorithm to a circuit for which the actual physiological synapses between neurons are known. The point process methodology presented here generalizes well to larger networks and can describe the statistics of neural populations. In general we show that advanced statistical models allow for the characterization of effective network structure, deciphering underlying network dynamics and estimating information-processing capabilities. PMID:23874181
A simulation-based evaluation of methods for inferring linear barriers to gene flow
Christopher Blair; Dana E. Weigel; Matthew Balazik; Annika T. H. Keeley; Faith M. Walker; Erin Landguth; Sam Cushman; Melanie Murphy; Lisette Waits; Niko Balkenhol
2012-01-01
Different analytical techniques used on the same data set may lead to different conclusions about the existence and strength of genetic structure. Therefore, reliable interpretation of the results from different methods depends on the efficacy and reliability of different statistical methods. In this paper, we evaluated the performance of multiple analytical methods to...
Statistical Inference-Based Cache Management for Mobile Learning
ERIC Educational Resources Information Center
Li, Qing; Zhao, Jianmin; Zhu, Xinzhong
2009-01-01
Supporting efficient data access in the mobile learning environment is becoming a hot research problem in recent years, and the problem becomes tougher when the clients are using light-weight mobile devices such as cell phones whose limited storage space prevents the clients from holding a large cache. A practical solution is to store the cache…
Preschool Center Care Quality Effects on Academic Achievement: An Instrumental Variables Analysis
ERIC Educational Resources Information Center
Auger, Anamarie; Farkas, George; Burchinal, Margaret R.; Duncan, Greg J.; Vandell, Deborah Lowe
2014-01-01
Much of child care research has focused on the effects of the quality of care in early childhood settings on children's school readiness skills. Although researchers increased the statistical rigor of their approaches over the past 15 years, researchers' ability to draw causal inferences has been limited because the studies are based on…
Slow Thinking and Deep Learning: Tversky and Kahneman's Taxi Cabs
ERIC Educational Resources Information Center
Bedwell, Mike
2015-01-01
This article is based on classroom application of a problem story constructed by Amos Tversky in the 1970s. His intention was to evaluate human beings' intuitions about statistical inference. The problem was revisited by his colleague, the Nobel Prize winner Daniel Kahneman. The aim of this article is to show how popular science textbooks can…
On the analysis of very small samples of Gaussian repeated measurements: an alternative approach.
Westgate, Philip M; Burchett, Woodrow W
2017-03-15
The analysis of very small samples of Gaussian repeated measurements can be challenging. First, due to a very small number of independent subjects contributing outcomes over time, statistical power can be quite small. Second, nuisance covariance parameters must be appropriately accounted for in the analysis in order to maintain the nominal test size. However, available statistical strategies that ensure valid statistical inference may lack power, whereas more powerful methods may have the potential for inflated test sizes. Therefore, we explore an alternative approach to the analysis of very small samples of Gaussian repeated measurements, with the goal of maintaining valid inference while also improving statistical power relative to other valid methods. This approach uses generalized estimating equations with a bias-corrected empirical covariance matrix that accounts for all small-sample aspects of nuisance correlation parameter estimation in order to maintain valid inference. Furthermore, the approach utilizes correlation selection strategies with the goal of choosing the working structure that will result in the greatest power. In our study, we show that when accurate modeling of the nuisance correlation structure impacts the efficiency of regression parameter estimation, this method can improve power relative to existing methods that yield valid inference. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Models for inference in dynamic metacommunity systems
Dorazio, R.M.; Kery, M.; Royle, J. Andrew; Plattner, M.
2010-01-01
A variety of processes are thought to be involved in the formation and dynamics of species assemblages. For example, various metacommunity theories are based on differences in the relative contributions of dispersal of species among local communities and interactions of species within local communities. Interestingly, metacommunity theories continue to be advanced without much empirical validation. Part of the problem is that statistical models used to analyze typical survey data either fail to specify ecological processes with sufficient complexity or they fail to account for errors in detection of species during sampling. In this paper, we describe a statistical modeling framework for the analysis of metacommunity dynamics that is based on the idea of adopting a unified approach, multispecies occupancy modeling, for computing inferences about individual species, local communities of species, or the entire metacommunity of species. This approach accounts for errors in detection of species during sampling and also allows different metacommunity paradigms to be specified in terms of species-and location-specific probabilities of occurrence, extinction, and colonization: all of which are estimable. In addition, this approach can be used to address inference problems that arise in conservation ecology, such as predicting temporal and spatial changes in biodiversity for use in making conservation decisions. To illustrate, we estimate changes in species composition associated with the species-specific phenologies of flight patterns of butterflies in Switzerland for the purpose of estimating regional differences in biodiversity. ?? 2010 by the Ecological Society of America.
Meyer, Patrick E; Lafitte, Frédéric; Bontempi, Gianluca
2008-10-29
This paper presents the R/Bioconductor package minet (version 1.1.6) which provides a set of functions to infer mutual information networks from a dataset. Once fed with a microarray dataset, the package returns a network where nodes denote genes, edges model statistical dependencies between genes and the weight of an edge quantifies the statistical evidence of a specific (e.g transcriptional) gene-to-gene interaction. Four different entropy estimators are made available in the package minet (empirical, Miller-Madow, Schurmann-Grassberger and shrink) as well as four different inference methods, namely relevance networks, ARACNE, CLR and MRNET. Also, the package integrates accuracy assessment tools, like F-scores, PR-curves and ROC-curves in order to compare the inferred network with a reference one. The package minet provides a series of tools for inferring transcriptional networks from microarray data. It is freely available from the Comprehensive R Archive Network (CRAN) as well as from the Bioconductor website.
Equivalent statistics and data interpretation.
Francis, Gregory
2017-08-01
Recent reform efforts in psychological science have led to a plethora of choices for scientists to analyze their data. A scientist making an inference about their data must now decide whether to report a p value, summarize the data with a standardized effect size and its confidence interval, report a Bayes Factor, or use other model comparison methods. To make good choices among these options, it is necessary for researchers to understand the characteristics of the various statistics used by the different analysis frameworks. Toward that end, this paper makes two contributions. First, it shows that for the case of a two-sample t test with known sample sizes, many different summary statistics are mathematically equivalent in the sense that they are based on the very same information in the data set. When the sample sizes are known, the p value provides as much information about a data set as the confidence interval of Cohen's d or a JZS Bayes factor. Second, this equivalence means that different analysis methods differ only in their interpretation of the empirical data. At first glance, it might seem that mathematical equivalence of the statistics suggests that it does not matter much which statistic is reported, but the opposite is true because the appropriateness of a reported statistic is relative to the inference it promotes. Accordingly, scientists should choose an analysis method appropriate for their scientific investigation. A direct comparison of the different inferential frameworks provides some guidance for scientists to make good choices and improve scientific practice.
Functional imaging with low-resolution brain electromagnetic tomography (LORETA): a review.
Pascual-Marqui, R D; Esslen, M; Kochi, K; Lehmann, D
2002-01-01
This paper reviews several recent publications that have successfully used the functional brain imaging method known as LORETA. Emphasis is placed on the electrophysiological and neuroanatomical basis of the method, on the localization properties of the method, and on the validation of the method in real experimental human data. Papers that criticize LORETA are briefly discussed. LORETA publications in the 1994-1997 period based localization inference on images of raw electric neuronal activity. In 1998, a series of papers appeared that based localization inference on the statistical parametric mapping methodology applied to high-time resolution LORETA images. Starting in 1999, quantitative neuroanatomy was added to the methodology, based on the digitized Talairach atlas provided by the Brain Imaging Centre, Montreal Neurological Institute. The combination of these methodological developments has placed LORETA at a level that compares favorably to the more classical functional imaging methods, such as PET and fMRI.
Teach a Confidence Interval for the Median in the First Statistics Course
ERIC Educational Resources Information Center
Howington, Eric B.
2017-01-01
Few introductory statistics courses consider statistical inference for the median. This article argues in favour of adding a confidence interval for the median to the first statistics course. Several methods suitable for introductory statistics students are identified and briefly reviewed.
Scandol, James P; Moore, Helen A
2012-01-01
Health Statistics NSW is a new web-based application developed by the Centre for Epidemiology and Research at the NSW Ministry of Health. The application is designed to be an efficient vehicle for the timely delivery of health statistics to a diverse audience including the general public, health planners, researchers, students and policy analysts. The development and implementation of this web application required the consideration of a series of competing demands such as: the public interest in providing health data while maintaining the privacy interests of the individuals whose health is being reported; reporting data at spatial scales of relevance to health planners while maintaining the statistical integrity of any inferences drawn; the use of hardware and software systems which are publicly accessible, scalable and robust, while ensuring high levels of security. These three competing demands and the relationships between them are discussed in the context of Health Statistics NSW.
Friction Laws Derived From the Acoustic Emissions of a Laboratory Fault by Machine Learning
NASA Astrophysics Data System (ADS)
Rouet-Leduc, B.; Hulbert, C.; Ren, C. X.; Bolton, D. C.; Marone, C.; Johnson, P. A.
2017-12-01
Fault friction controls nearly all aspects of fault rupture, yet it is only possible to measure in the laboratory. Here we describe laboratory experiments where acoustic emissions are recorded from the fault. We find that by applying a machine learning approach known as "extreme gradient boosting trees" to the continuous acoustical signal, the fault friction can be directly inferred, showing that instantaneous characteristics of the acoustic signal are a fingerprint of the frictional state. This machine learning-based inference leads to a simple law that links the acoustic signal to the friction state, and holds for every stress cycle the laboratory fault goes through. The approach does not use any other measured parameter than instantaneous statistics of the acoustic signal. This finding may have importance for inferring frictional characteristics from seismic waves in Earth where fault friction cannot be measured.
The influence of cognitive ability and instructional set on causal conditional inference.
Evans, Jonathan St B T; Handley, Simon J; Neilens, Helen; Over, David
2010-05-01
We report a large study in which participants are invited to draw inferences from causal conditional sentences with varying degrees of believability. General intelligence was measured, and participants were split into groups of high and low ability. Under strict deductive-reasoning instructions, it was observed that higher ability participants were significantly less influenced by prior belief than were those of lower ability. This effect disappeared, however, when pragmatic reasoning instructions were employed in a separate group. These findings are in accord with dual-process theories of reasoning. We also took detailed measures of beliefs in the conditional sentences used for the reasoning tasks. Statistical modelling showed that it is not belief in the conditional statement per se that is the causal factor, but rather correlates of it. Two different models of belief-based reasoning were found to fit the data according to the kind of instructions and the type of inference under consideration.
Progressive statistics for studies in sports medicine and exercise science.
Hopkins, William G; Marshall, Stephen W; Batterham, Alan M; Hanin, Juri
2009-01-01
Statistical guidelines and expert statements are now available to assist in the analysis and reporting of studies in some biomedical disciplines. We present here a more progressive resource for sample-based studies, meta-analyses, and case studies in sports medicine and exercise science. We offer forthright advice on the following controversial or novel issues: using precision of estimation for inferences about population effects in preference to null-hypothesis testing, which is inadequate for assessing clinical or practical importance; justifying sample size via acceptable precision or confidence for clinical decisions rather than via adequate power for statistical significance; showing SD rather than SEM, to better communicate the magnitude of differences in means and nonuniformity of error; avoiding purely nonparametric analyses, which cannot provide inferences about magnitude and are unnecessary; using regression statistics in validity studies, in preference to the impractical and biased limits of agreement; making greater use of qualitative methods to enrich sample-based quantitative projects; and seeking ethics approval for public access to the depersonalized raw data of a study, to address the need for more scrutiny of research and better meta-analyses. Advice on less contentious issues includes the following: using covariates in linear models to adjust for confounders, to account for individual differences, and to identify potential mechanisms of an effect; using log transformation to deal with nonuniformity of effects and error; identifying and deleting outliers; presenting descriptive, effect, and inferential statistics in appropriate formats; and contending with bias arising from problems with sampling, assignment, blinding, measurement error, and researchers' prejudices. This article should advance the field by stimulating debate, promoting innovative approaches, and serving as a useful checklist for authors, reviewers, and editors.
Using Semantic Association to Extend and Infer Literature-Oriented Relativity Between Terms.
Cheng, Liang; Li, Jie; Hu, Yang; Jiang, Yue; Liu, Yongzhuang; Chu, Yanshuo; Wang, Zhenxing; Wang, Yadong
2015-01-01
Relative terms often appear together in the literature. Methods have been presented for weighting relativity of pairwise terms by their co-occurring literature and inferring new relationship. Terms in the literature are also in the directed acyclic graph of ontologies, such as Gene Ontology and Disease Ontology. Therefore, semantic association between terms may help for establishing relativities between terms in literature. However, current methods do not use these associations. In this paper, an adjusted R-scaled score (ARSS) based on information content (ARSSIC) method is introduced to infer new relationship between terms. First, set inclusion relationship between terms of ontology was exploited to extend relationships between these terms and literature. Next, the ARSS method was presented to measure relativity between terms across ontologies according to these extensional relationships. Then, the ARSSIC method using ratios of information shared of term's ancestors was designed to infer new relationship between terms across ontologies. The result of the experiment shows that ARSS identified more pairs of statistically significant terms based on corresponding gene sets than other methods. And the high average area under the receiver operating characteristic curve (0.9293) shows that ARSSIC achieved a high true positive rate and a low false positive rate. Data is available at http://mlg.hit.edu.cn/ARSSIC/.
Pointwise probability reinforcements for robust statistical inference.
Frénay, Benoît; Verleysen, Michel
2014-02-01
Statistical inference using machine learning techniques may be difficult with small datasets because of abnormally frequent data (AFDs). AFDs are observations that are much more frequent in the training sample that they should be, with respect to their theoretical probability, and include e.g. outliers. Estimates of parameters tend to be biased towards models which support such data. This paper proposes to introduce pointwise probability reinforcements (PPRs): the probability of each observation is reinforced by a PPR and a regularisation allows controlling the amount of reinforcement which compensates for AFDs. The proposed solution is very generic, since it can be used to robustify any statistical inference method which can be formulated as a likelihood maximisation. Experiments show that PPRs can be easily used to tackle regression, classification and projection: models are freed from the influence of outliers. Moreover, outliers can be filtered manually since an abnormality degree is obtained for each observation. Copyright © 2013 Elsevier Ltd. All rights reserved.
Probability, statistics, and computational science.
Beerenwinkel, Niko; Siebourg, Juliane
2012-01-01
In this chapter, we review basic concepts from probability theory and computational statistics that are fundamental to evolutionary genomics. We provide a very basic introduction to statistical modeling and discuss general principles, including maximum likelihood and Bayesian inference. Markov chains, hidden Markov models, and Bayesian network models are introduced in more detail as they occur frequently and in many variations in genomics applications. In particular, we discuss efficient inference algorithms and methods for learning these models from partially observed data. Several simple examples are given throughout the text, some of which point to models that are discussed in more detail in subsequent chapters.
A Not-So-Fundamental Limitation on Studying Complex Systems with Statistics: Comment on Rabin (2011)
NASA Astrophysics Data System (ADS)
Thomas, Drew M.
2012-12-01
Although living organisms are affected by many interrelated and unidentified variables, this complexity does not automatically impose a fundamental limitation on statistical inference. Nor need one invoke such complexity as an explanation of the "Truth Wears Off" or "decline" effect; similar "decline" effects occur with far simpler systems studied in physics. Selective reporting and publication bias, and scientists' biases in favor of reporting eye-catching results (in general) or conforming to others' results (in physics) better explain this feature of the "Truth Wears Off" effect than Rabin's suggested limitation on statistical inference.
Influence function for robust phylogenetic reconstructions.
Bar-Hen, Avner; Mariadassou, Mahendra; Poursat, Marie-Anne; Vandenkoornhuyse, Philippe
2008-05-01
Based on the computation of the influence function, a tool to measure the impact of each piece of sampled data on the statistical inference of a parameter, we propose to analyze the support of the maximum-likelihood (ML) tree for each site. We provide a new tool for filtering data sets (nucleotides, amino acids, and others) in the context of ML phylogenetic reconstructions. Because different sites support different phylogenic topologies in different ways, outlier sites, that is, sites with a very negative influence value, are important: they can drastically change the topology resulting from the statistical inference. Therefore, these outlier sites must be clearly identified and their effects accounted for before drawing biological conclusions from the inferred tree. A matrix containing 158 fungal terminals all belonging to Chytridiomycota, Zygomycota, and Glomeromycota is analyzed. We show that removing the strongest outlier from the analysis strikingly modifies the ML topology, with a loss of as many as 20% of the internal nodes. As a result, estimating the topology on the filtered data set results in a topology with enhanced bootstrap support. From this analysis, the polyphyletic status of the fungal phyla Chytridiomycota and Zygomycota is reinforced, suggesting the necessity of revisiting the systematics of these fungal groups. We show the ability of influence function to produce new evolution hypotheses.
Efficient Moment-Based Inference of Admixture Parameters and Sources of Gene Flow
Levin, Alex; Reich, David; Patterson, Nick; Berger, Bonnie
2013-01-01
The recent explosion in available genetic data has led to significant advances in understanding the demographic histories of and relationships among human populations. It is still a challenge, however, to infer reliable parameter values for complicated models involving many populations. Here, we present MixMapper, an efficient, interactive method for constructing phylogenetic trees including admixture events using single nucleotide polymorphism (SNP) genotype data. MixMapper implements a novel two-phase approach to admixture inference using moment statistics, first building an unadmixed scaffold tree and then adding admixed populations by solving systems of equations that express allele frequency divergences in terms of mixture parameters. Importantly, all features of the model, including topology, sources of gene flow, branch lengths, and mixture proportions, are optimized automatically from the data and include estimates of statistical uncertainty. MixMapper also uses a new method to express branch lengths in easily interpretable drift units. We apply MixMapper to recently published data for Human Genome Diversity Cell Line Panel individuals genotyped on a SNP array designed especially for use in population genetics studies, obtaining confident results for 30 populations, 20 of them admixed. Notably, we confirm a signal of ancient admixture in European populations—including previously undetected admixture in Sardinians and Basques—involving a proportion of 20–40% ancient northern Eurasian ancestry. PMID:23709261
Hozo, Iztok; Schell, Michael J; Djulbegovic, Benjamin
2008-07-01
The absolute truth in research is unobtainable, as no evidence or research hypothesis is ever 100% conclusive. Therefore, all data and inferences can in principle be considered as "inconclusive." Scientific inference and decision-making need to take into account errors, which are unavoidable in the research enterprise. The errors can occur at the level of conclusions that aim to discern the truthfulness of research hypothesis based on the accuracy of research evidence and hypothesis, and decisions, the goal of which is to enable optimal decision-making under present and specific circumstances. To optimize the chance of both correct conclusions and correct decisions, the synthesis of all major statistical approaches to clinical research is needed. The integration of these approaches (frequentist, Bayesian, and decision-analytic) can be accomplished through formal risk:benefit (R:B) analysis. This chapter illustrates the rational choice of a research hypothesis using R:B analysis based on decision-theoretic expected utility theory framework and the concept of "acceptable regret" to calculate the threshold probability of the "truth" above which the benefit of accepting a research hypothesis outweighs its risks.
Gopinath, Kaundinya; Krishnamurthy, Venkatagiri; Sathian, K
2018-02-01
In a recent study, Eklund et al. employed resting-state functional magnetic resonance imaging data as a surrogate for null functional magnetic resonance imaging (fMRI) datasets and posited that cluster-wise family-wise error (FWE) rate-corrected inferences made by using parametric statistical methods in fMRI studies over the past two decades may have been invalid, particularly for cluster defining thresholds less stringent than p < 0.001; this was principally because the spatial autocorrelation functions (sACF) of fMRI data had been modeled incorrectly to follow a Gaussian form, whereas empirical data suggested otherwise. Here, we show that accounting for non-Gaussian signal components such as those arising from resting-state neural activity as well as physiological responses and motion artifacts in the null fMRI datasets yields first- and second-level general linear model analysis residuals with nearly uniform and Gaussian sACF. Further comparison with nonparametric permutation tests indicates that cluster-based FWE corrected inferences made with Gaussian spatial noise approximations are valid.
Interpretable inference on the mixed effect model with the Box-Cox transformation.
Maruo, K; Yamaguchi, Y; Noma, H; Gosho, M
2017-07-10
We derived results for inference on parameters of the marginal model of the mixed effect model with the Box-Cox transformation based on the asymptotic theory approach. We also provided a robust variance estimator of the maximum likelihood estimator of the parameters of this model in consideration of the model misspecifications. Using these results, we developed an inference procedure for the difference of the model median between treatment groups at the specified occasion in the context of mixed effects models for repeated measures analysis for randomized clinical trials, which provided interpretable estimates of the treatment effect. From simulation studies, it was shown that our proposed method controlled type I error of the statistical test for the model median difference in almost all the situations and had moderate or high performance for power compared with the existing methods. We illustrated our method with cluster of differentiation 4 (CD4) data in an AIDS clinical trial, where the interpretability of the analysis results based on our proposed method is demonstrated. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
NASA Astrophysics Data System (ADS)
Noori, Roohollah; Safavi, Salman; Nateghi Shahrokni, Seyyed Afshin
2013-07-01
The five-day biochemical oxygen demand (BOD5) is one of the key parameters in water quality management. In this study, a novel approach, i.e., reduced-order adaptive neuro-fuzzy inference system (ROANFIS) model was developed for rapid estimation of BOD5. In addition, an uncertainty analysis of adaptive neuro-fuzzy inference system (ANFIS) and ROANFIS models was carried out based on Monte-Carlo simulation. Accuracy analysis of ANFIS and ROANFIS models based on both developed discrepancy ratio and threshold statistics revealed that the selected ROANFIS model was superior. Pearson correlation coefficient (R) and root mean square error for the best fitted ROANFIS model were 0.96 and 7.12, respectively. Furthermore, uncertainty analysis of the developed models indicated that the selected ROANFIS had less uncertainty than the ANFIS model and accurately forecasted BOD5 in the Sefidrood River Basin. Besides, the uncertainty analysis also showed that bracketed predictions by 95% confidence bound and d-factor in the testing steps for the selected ROANFIS model were 94% and 0.83, respectively.
Controlling false-negative errors in microarray differential expression analysis: a PRIM approach.
Cole, Steve W; Galic, Zoran; Zack, Jerome A
2003-09-22
Theoretical considerations suggest that current microarray screening algorithms may fail to detect many true differences in gene expression (Type II analytic errors). We assessed 'false negative' error rates in differential expression analyses by conventional linear statistical models (e.g. t-test), microarray-adapted variants (e.g. SAM, Cyber-T), and a novel strategy based on hold-out cross-validation. The latter approach employs the machine-learning algorithm Patient Rule Induction Method (PRIM) to infer minimum thresholds for reliable change in gene expression from Boolean conjunctions of fold-induction and raw fluorescence measurements. Monte Carlo analyses based on four empirical data sets show that conventional statistical models and their microarray-adapted variants overlook more than 50% of genes showing significant up-regulation. Conjoint PRIM prediction rules recover approximately twice as many differentially expressed transcripts while maintaining strong control over false-positive (Type I) errors. As a result, experimental replication rates increase and total analytic error rates decline. RT-PCR studies confirm that gene inductions detected by PRIM but overlooked by other methods represent true changes in mRNA levels. PRIM-based conjoint inference rules thus represent an improved strategy for high-sensitivity screening of DNA microarrays. Freestanding JAVA application at http://microarray.crump.ucla.edu/focus
Zhang, Qin; Yao, Quanying
2018-05-01
The dynamic uncertain causality graph (DUCG) is a newly presented framework for uncertain causality representation and probabilistic reasoning. It has been successfully applied to online fault diagnoses of large, complex industrial systems, and decease diagnoses. This paper extends the DUCG to model more complex cases than what could be previously modeled, e.g., the case in which statistical data are in different groups with or without overlap, and some domain knowledge and actions (new variables with uncertain causalities) are introduced. In other words, this paper proposes to use -mode, -mode, and -mode of the DUCG to model such complex cases and then transform them into either the standard -mode or the standard -mode. In the former situation, if no directed cyclic graph is involved, the transformed result is simply a Bayesian network (BN), and existing inference methods for BNs can be applied. In the latter situation, an inference method based on the DUCG is proposed. Examples are provided to illustrate the methodology.
Reconstruction of a Real World Social Network using the Potts Model and Loopy Belief Propagation.
Bisconti, Cristian; Corallo, Angelo; Fortunato, Laura; Gentile, Antonio A; Massafra, Andrea; Pellè, Piergiuseppe
2015-01-01
The scope of this paper is to test the adoption of a statistical model derived from Condensed Matter Physics, for the reconstruction of the structure of a social network. The inverse Potts model, traditionally applied to recursive observations of quantum states in an ensemble of particles, is here addressed to observations of the members' states in an organization and their (anti)correlations, thus inferring interactions as links among the members. Adopting proper (Bethe) approximations, such an inverse problem is showed to be tractable. Within an operational framework, this network-reconstruction method is tested for a small real-world social network, the Italian parliament. In this study case, it is easy to track statuses of the parliament members, using (co)sponsorships of law proposals as the initial dataset. In previous studies of similar activity-based networks, the graph structure was inferred directly from activity co-occurrences: here we compare our statistical reconstruction with such standard methods, outlining discrepancies and advantages.
Reconstruction of a Real World Social Network using the Potts Model and Loopy Belief Propagation
Bisconti, Cristian; Corallo, Angelo; Fortunato, Laura; Gentile, Antonio A.; Massafra, Andrea; Pellè, Piergiuseppe
2015-01-01
The scope of this paper is to test the adoption of a statistical model derived from Condensed Matter Physics, for the reconstruction of the structure of a social network. The inverse Potts model, traditionally applied to recursive observations of quantum states in an ensemble of particles, is here addressed to observations of the members' states in an organization and their (anti)correlations, thus inferring interactions as links among the members. Adopting proper (Bethe) approximations, such an inverse problem is showed to be tractable. Within an operational framework, this network-reconstruction method is tested for a small real-world social network, the Italian parliament. In this study case, it is easy to track statuses of the parliament members, using (co)sponsorships of law proposals as the initial dataset. In previous studies of similar activity-based networks, the graph structure was inferred directly from activity co-occurrences: here we compare our statistical reconstruction with such standard methods, outlining discrepancies and advantages. PMID:26617539
Jager, Tjalling
2013-02-05
The individuals of a species are not equal. These differences frustrate experimental biologists and ecotoxicologists who wish to study the response of a species (in general) to a treatment. In the analysis of data, differences between model predictions and observations on individual animals are usually treated as random measurement error around the true response. These deviations, however, are mainly caused by real differences between the individuals (e.g., differences in physiology and in initial conditions). Understanding these intraspecies differences, and accounting for them in the data analysis, will improve our understanding of the response to the treatment we are investigating and allow for a more powerful, less biased, statistical analysis. Here, I explore a basic scheme for statistical inference to estimate parameters governing stress that allows individuals to differ in their basic physiology. This scheme is illustrated using a simple toxicokinetic-toxicodynamic model and a data set for growth of the springtail Folsomia candida exposed to cadmium in food. This article should be seen as proof of concept; a first step in bringing more realism into the statistical inference for process-based models in ecotoxicology.
BiomeNet: A Bayesian Model for Inference of Metabolic Divergence among Microbial Communities
Chipman, Hugh; Gu, Hong; Bielawski, Joseph P.
2014-01-01
Metagenomics yields enormous numbers of microbial sequences that can be assigned a metabolic function. Using such data to infer community-level metabolic divergence is hindered by the lack of a suitable statistical framework. Here, we describe a novel hierarchical Bayesian model, called BiomeNet (Bayesian inference of metabolic networks), for inferring differential prevalence of metabolic subnetworks among microbial communities. To infer the structure of community-level metabolic interactions, BiomeNet applies a mixed-membership modelling framework to enzyme abundance information. The basic idea is that the mixture components of the model (metabolic reactions, subnetworks, and networks) are shared across all groups (microbiome samples), but the mixture proportions vary from group to group. Through this framework, the model can capture nested structures within the data. BiomeNet is unique in modeling each metagenome sample as a mixture of complex metabolic systems (metabosystems). The metabosystems are composed of mixtures of tightly connected metabolic subnetworks. BiomeNet differs from other unsupervised methods by allowing researchers to discriminate groups of samples through the metabolic patterns it discovers in the data, and by providing a framework for interpreting them. We describe a collapsed Gibbs sampler for inference of the mixture weights under BiomeNet, and we use simulation to validate the inference algorithm. Application of BiomeNet to human gut metagenomes revealed a metabosystem with greater prevalence among inflammatory bowel disease (IBD) patients. Based on the discriminatory subnetworks for this metabosystem, we inferred that the community is likely to be closely associated with the human gut epithelium, resistant to dietary interventions, and interfere with human uptake of an antioxidant connected to IBD. Because this metabosystem has a greater capacity to exploit host-associated glycans, we speculate that IBD-associated communities might arise from opportunist growth of bacteria that can circumvent the host's nutrient-based mechanism for bacterial partner selection. PMID:25412107
ERIC Educational Resources Information Center
Zhang, Jinming; Lu, Ting
2007-01-01
In practical applications of item response theory (IRT), item parameters are usually estimated first from a calibration sample. After treating these estimates as fixed and known, ability parameters are then estimated. However, the statistical inferences based on the estimated abilities can be misleading if the uncertainty of the item parameter…
Bayesian Inference: with ecological applications
Link, William A.; Barker, Richard J.
2010-01-01
This text provides a mathematically rigorous yet accessible and engaging introduction to Bayesian inference with relevant examples that will be of interest to biologists working in the fields of ecology, wildlife management and environmental studies as well as students in advanced undergraduate statistics.. This text opens the door to Bayesian inference, taking advantage of modern computational efficiencies and easily accessible software to evaluate complex hierarchical models.
Differences in Performance Among Test Statistics for Assessing Phylogenomic Model Adequacy.
Duchêne, David A; Duchêne, Sebastian; Ho, Simon Y W
2018-05-18
Statistical phylogenetic analyses of genomic data depend on models of nucleotide or amino acid substitution. The adequacy of these substitution models can be assessed using a number of test statistics, allowing the model to be rejected when it is found to provide a poor description of the evolutionary process. A potentially valuable use of model-adequacy test statistics is to identify when data sets are likely to produce unreliable phylogenetic estimates, but their differences in performance are rarely explored. We performed a comprehensive simulation study to identify test statistics that are sensitive to some of the most commonly cited sources of phylogenetic estimation error. Our results show that, for many test statistics, traditional thresholds for assessing model adequacy can fail to reject the model when the phylogenetic inferences are inaccurate and imprecise. This is particularly problematic when analysing loci that have few variable informative sites. We propose new thresholds for assessing substitution model adequacy and demonstrate their effectiveness in analyses of three phylogenomic data sets. These thresholds lead to frequent rejection of the model for loci that yield topological inferences that are imprecise and are likely to be inaccurate. We also propose the use of a summary statistic that provides a practical assessment of overall model adequacy. Our approach offers a promising means of enhancing model choice in genome-scale data sets, potentially leading to improvements in the reliability of phylogenomic inference.
Computational approaches to protein inference in shotgun proteomics
2012-01-01
Shotgun proteomics has recently emerged as a powerful approach to characterizing proteomes in biological samples. Its overall objective is to identify the form and quantity of each protein in a high-throughput manner by coupling liquid chromatography with tandem mass spectrometry. As a consequence of its high throughput nature, shotgun proteomics faces challenges with respect to the analysis and interpretation of experimental data. Among such challenges, the identification of proteins present in a sample has been recognized as an important computational task. This task generally consists of (1) assigning experimental tandem mass spectra to peptides derived from a protein database, and (2) mapping assigned peptides to proteins and quantifying the confidence of identified proteins. Protein identification is fundamentally a statistical inference problem with a number of methods proposed to address its challenges. In this review we categorize current approaches into rule-based, combinatorial optimization and probabilistic inference techniques, and present them using integer programing and Bayesian inference frameworks. We also discuss the main challenges of protein identification and propose potential solutions with the goal of spurring innovative research in this area. PMID:23176300
Inferring action structure and causal relationships in continuous sequences of human action.
Buchsbaum, Daphna; Griffiths, Thomas L; Plunkett, Dillon; Gopnik, Alison; Baldwin, Dare
2015-02-01
In the real world, causal variables do not come pre-identified or occur in isolation, but instead are embedded within a continuous temporal stream of events. A challenge faced by both human learners and machine learning algorithms is identifying subsequences that correspond to the appropriate variables for causal inference. A specific instance of this problem is action segmentation: dividing a sequence of observed behavior into meaningful actions, and determining which of those actions lead to effects in the world. Here we present a Bayesian analysis of how statistical and causal cues to segmentation should optimally be combined, as well as four experiments investigating human action segmentation and causal inference. We find that both people and our model are sensitive to statistical regularities and causal structure in continuous action, and are able to combine these sources of information in order to correctly infer both causal relationships and segmentation boundaries. Copyright © 2014. Published by Elsevier Inc.
Applications of statistics to medical science (1) Fundamental concepts.
Watanabe, Hiroshi
2011-01-01
The conceptual framework of statistical tests and statistical inferences are discussed, and the epidemiological background of statistics is briefly reviewed. This study is one of a series in which we survey the basics of statistics and practical methods used in medical statistics. Arguments related to actual statistical analysis procedures will be made in subsequent papers.
Analytic continuation of quantum Monte Carlo data by stochastic analytical inference.
Fuchs, Sebastian; Pruschke, Thomas; Jarrell, Mark
2010-05-01
We present an algorithm for the analytic continuation of imaginary-time quantum Monte Carlo data which is strictly based on principles of Bayesian statistical inference. Within this framework we are able to obtain an explicit expression for the calculation of a weighted average over possible energy spectra, which can be evaluated by standard Monte Carlo simulations, yielding as by-product also the distribution function as function of the regularization parameter. Our algorithm thus avoids the usual ad hoc assumptions introduced in similar algorithms to fix the regularization parameter. We apply the algorithm to imaginary-time quantum Monte Carlo data and compare the resulting energy spectra with those from a standard maximum-entropy calculation.
Astrophysical data analysis with information field theory
NASA Astrophysics Data System (ADS)
Enßlin, Torsten
2014-12-01
Non-parametric imaging and data analysis in astrophysics and cosmology can be addressed by information field theory (IFT), a means of Bayesian, data based inference on spatially distributed signal fields. IFT is a statistical field theory, which permits the construction of optimal signal recovery algorithms. It exploits spatial correlations of the signal fields even for nonlinear and non-Gaussian signal inference problems. The alleviation of a perception threshold for recovering signals of unknown correlation structure by using IFT will be discussed in particular as well as a novel improvement on instrumental self-calibration schemes. IFT can be applied to many areas. Here, applications in in cosmology (cosmic microwave background, large-scale structure) and astrophysics (galactic magnetism, radio interferometry) are presented.
Jacquin, Hugo; Gilson, Amy; Shakhnovich, Eugene; Cocco, Simona; Monasson, Rémi
2016-05-01
Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those inverse statistical approaches. We build MSA of highly stable sequences in target LP structures, and infer the effective pairwise Potts Hamiltonians from those MSA. We find that inferred Potts Hamiltonians reproduce many important aspects of 'true' LP structures and energetics. Careful analysis reveals that effective pairwise couplings in inferred Potts Hamiltonians depend not only on the energetics of the native structure but also on competing folds; in particular, the coupling values reflect both positive design (stabilization of native conformation) and negative design (destabilization of competing folds). In addition to providing detailed structural information, the inferred Potts models used as protein Hamiltonian for design of new sequences are able to generate with high probability completely new sequences with the desired folds, which is not possible using independent-site models. Those are remarkable results as the effective LP Hamiltonians used to generate MSA are not simple pairwise models due to the competition between the folds. Our findings elucidate the reasons for the success of inverse approaches to the modelling of proteins from sequence data, and their limitations.
Using Historical Data to Automatically Identify Air-Traffic Control Behavior
NASA Technical Reports Server (NTRS)
Lauderdale, Todd A.; Wu, Yuefeng; Tretto, Celeste
2014-01-01
This project seeks to develop statistical-based machine learning models to characterize the types of errors present when using current systems to predict future aircraft states. These models will be data-driven - based on large quantities of historical data. Once these models are developed, they will be used to infer situations in the historical data where an air-traffic controller intervened on an aircraft's route, even when there is no direct recording of this action.
Graffelman, Jan; Sánchez, Milagros; Cook, Samantha; Moreno, Victor
2013-01-01
In genetic association studies, tests for Hardy-Weinberg proportions are often employed as a quality control checking procedure. Missing genotypes are typically discarded prior to testing. In this paper we show that inference for Hardy-Weinberg proportions can be biased when missing values are discarded. We propose to use multiple imputation of missing values in order to improve inference for Hardy-Weinberg proportions. For imputation we employ a multinomial logit model that uses information from allele intensities and/or neighbouring markers. Analysis of an empirical data set of single nucleotide polymorphisms possibly related to colon cancer reveals that missing genotypes are not missing completely at random. Deviation from Hardy-Weinberg proportions is mostly due to a lack of heterozygotes. Inbreeding coefficients estimated by multiple imputation of the missings are typically lowered with respect to inbreeding coefficients estimated by discarding the missings. Accounting for missings by multiple imputation qualitatively changed the results of 10 to 17% of the statistical tests performed. Estimates of inbreeding coefficients obtained by multiple imputation showed high correlation with estimates obtained by single imputation using an external reference panel. Our conclusion is that imputation of missing data leads to improved statistical inference for Hardy-Weinberg proportions.
[Application of statistics on chronic-diseases-relating observational research papers].
Hong, Zhi-heng; Wang, Ping; Cao, Wei-hua
2012-09-01
To study the application of statistics on Chronic-diseases-relating observational research papers which were recently published in the Chinese Medical Association Magazines, with influential index above 0.5. Using a self-developed criterion, two investigators individually participated in assessing the application of statistics on Chinese Medical Association Magazines, with influential index above 0.5. Different opinions reached an agreement through discussion. A total number of 352 papers from 6 magazines, including the Chinese Journal of Epidemiology, Chinese Journal of Oncology, Chinese Journal of Preventive Medicine, Chinese Journal of Cardiology, Chinese Journal of Internal Medicine and Chinese Journal of Endocrinology and Metabolism, were reviewed. The rate of clear statement on the following contents as: research objectives, t target audience, sample issues, objective inclusion criteria and variable definitions were 99.43%, 98.57%, 95.43%, 92.86% and 96.87%. The correct rates of description on quantitative and qualitative data were 90.94% and 91.46%, respectively. The rates on correctly expressing the results, on statistical inference methods related to quantitative, qualitative data and modeling were 100%, 95.32% and 87.19%, respectively. 89.49% of the conclusions could directly response to the research objectives. However, 69.60% of the papers did not mention the exact names of the study design, statistically, that the papers were using. 11.14% of the papers were in lack of further statement on the exclusion criteria. Percentage of the papers that could clearly explain the sample size estimation only taking up as 5.16%. Only 24.21% of the papers clearly described the variable value assignment. Regarding the introduction on statistical conduction and on database methods, the rate was only 24.15%. 18.75% of the papers did not express the statistical inference methods sufficiently. A quarter of the papers did not use 'standardization' appropriately. As for the aspect of statistical inference, the rate of description on statistical testing prerequisite was only 24.12% while 9.94% papers did not even employ the statistical inferential method that should be used. The main deficiencies on the application of Statistics used in papers related to Chronic-diseases-related observational research were as follows: lack of sample-size determination, variable value assignment description not sufficient, methods on statistics were not introduced clearly or properly, lack of consideration for pre-requisition regarding the use of statistical inferences.
NASA Astrophysics Data System (ADS)
Cunningham, Laura; Holmes, Naomi; Bigler, Christian; Dadal, Anna; Bergman, Jonas; Eriksson, Lars; Brooks, Stephen; Langdon, Pete; Caseldine, Chris
2010-05-01
Over the past two decades considerable effort has been devoted to quantitatively reconstructing temperatures from biological proxies preserved in lake sediments, via transfer functions. Such transfer functions typically consist of modern sediment samples, collected over a broad environmental gradient. Correlations between the biological communities and environmental parameters observed over these broad gradients are assumed to be equally valid temporally. The predictive ability of such spatially based transfer functions has traditionally been assessed by comparisons of measured and inferred temperatures within the calibration sets, with little validation against historical data. Although statistical techniques such as bootstrapping may improve error estimation, this approach remains partly a circular argument. This raises the question of how reliable such reconstructions are for inferring past changes in temperature? In order to address this question, we used transfer functions to reconstruct July temperatures from diatoms and chironomids from several locations across northern Europe. The transfer functions used showed good internal calibration statistics (r2 = 0.66 - 0.91). The diatom and chironomid inferred July air temperatures were compared to local observational records. As the sediment records were non-annual, all data were first smoothed using a 15 yr moving average filter. None of the five biologically-inferred temperature records were correlated with the local meteorological records. Furthermore, diatom inferred temperatures did not agree with chironomid inferred temperatures from the same cores from the same sites. In an attempt to understand this poor performance the biological proxy data was compressed using principal component analysis (PCA), and the PCA axes compared to the local meteorological data. These analyses clearly demonstrated that July temperatures were not correlated with the biological data at these locations. Some correlations were observed between the biological proxies and autumn and spring temperatures, although this varied slightly between sites and proxies. For example, chironomid data from Iceland was most strongly correlated with temperatures in February, March and April whilst in northern Sweden, the chironomid data was most strongly correlated with temperatures in March, April and May. It is suggested that the biological data at these sites may be responding to changes in the length of the ice-free period or hydrological regimes (including snow melt), rather than temperature per se. Our findings demonstrate the need to validate inferred temperatures against local meteorological data. Where such validation cannot be undertaken, inferred temperature reconstructions should be treated cautiously.
Sileshi, G
2006-10-01
Researchers and regulatory agencies often make statistical inferences from insect count data using modelling approaches that assume homogeneous variance. Such models do not allow for formal appraisal of variability which in its different forms is the subject of interest in ecology. Therefore, the objectives of this paper were to (i) compare models suitable for handling variance heterogeneity and (ii) select optimal models to ensure valid statistical inferences from insect count data. The log-normal, standard Poisson, Poisson corrected for overdispersion, zero-inflated Poisson, the negative binomial distribution and zero-inflated negative binomial models were compared using six count datasets on foliage-dwelling insects and five families of soil-dwelling insects. Akaike's and Schwarz Bayesian information criteria were used for comparing the various models. Over 50% of the counts were zeros even in locally abundant species such as Ootheca bennigseni Weise, Mesoplatys ochroptera Stål and Diaecoderus spp. The Poisson model after correction for overdispersion and the standard negative binomial distribution model provided better description of the probability distribution of seven out of the 11 insects than the log-normal, standard Poisson, zero-inflated Poisson or zero-inflated negative binomial models. It is concluded that excess zeros and variance heterogeneity are common data phenomena in insect counts. If not properly modelled, these properties can invalidate the normal distribution assumptions resulting in biased estimation of ecological effects and jeopardizing the integrity of the scientific inferences. Therefore, it is recommended that statistical models appropriate for handling these data properties be selected using objective criteria to ensure efficient statistical inference.
Uncertainty and inference in the world of paleoecological data
NASA Astrophysics Data System (ADS)
McLachlan, J. S.; Dawson, A.; Dietze, M.; Finley, M.; Hooten, M.; Itter, M.; Jackson, S. T.; Marlon, J. R.; Raiho, A.; Tipton, J.; Williams, J.
2017-12-01
Proxy data in paleoecology and paleoclimatology share a common set of biases and uncertainties: spatiotemporal error associated with the taphonomic processes of deposition, preservation, and dating; calibration error between proxy data and the ecosystem states of interest; and error in the interpolation of calibrated estimates across space and time. Researchers often account for this daunting suite of challenges by applying qualitave expert judgment: inferring the past states of ecosystems and assessing the level of uncertainty in those states subjectively. The effectiveness of this approach can be seen by the extent to which future observations confirm previous assertions. Hierarchical Bayesian (HB) statistical approaches allow an alternative approach to accounting for multiple uncertainties in paleo data. HB estimates of ecosystem state formally account for each of the common uncertainties listed above. HB approaches can readily incorporate additional data, and data of different types into estimates of ecosystem state. And HB estimates of ecosystem state, with associated uncertainty, can be used to constrain forecasts of ecosystem dynamics based on mechanistic ecosystem models using data assimilation. Decisions about how to structure an HB model are also subjective, which creates a parallel framework for deciding how to interpret data from the deep past.Our group, the Paleoecological Observatory Network (PalEON), has applied hierarchical Bayesian statistics to formally account for uncertainties in proxy based estimates of past climate, fire, primary productivity, biomass, and vegetation composition. Our estimates often reveal new patterns of past ecosystem change, which is an unambiguously good thing, but we also often estimate a level of uncertainty that is uncomfortably high for many researchers. High levels of uncertainty are due to several features of the HB approach: spatiotemporal smoothing, the formal aggregation of multiple types of uncertainty, and a coarseness in statistical models of taphonomic process. Each of these features provides useful opportunities for statisticians and data-generating researchers to assess what we know about the signal and the noise in paleo data and to improve inference about past changes in ecosystem state.
Wright, Marvin N; Dankowski, Theresa; Ziegler, Andreas
2017-04-15
The most popular approach for analyzing survival data is the Cox regression model. The Cox model may, however, be misspecified, and its proportionality assumption may not always be fulfilled. An alternative approach for survival prediction is random forests for survival outcomes. The standard split criterion for random survival forests is the log-rank test statistic, which favors splitting variables with many possible split points. Conditional inference forests avoid this split variable selection bias. However, linear rank statistics are utilized by default in conditional inference forests to select the optimal splitting variable, which cannot detect non-linear effects in the independent variables. An alternative is to use maximally selected rank statistics for the split point selection. As in conditional inference forests, splitting variables are compared on the p-value scale. However, instead of the conditional Monte-Carlo approach used in conditional inference forests, p-value approximations are employed. We describe several p-value approximations and the implementation of the proposed random forest approach. A simulation study demonstrates that unbiased split variable selection is possible. However, there is a trade-off between unbiased split variable selection and runtime. In benchmark studies of prediction performance on simulated and real datasets, the new method performs better than random survival forests if informative dichotomous variables are combined with uninformative variables with more categories and better than conditional inference forests if non-linear covariate effects are included. In a runtime comparison, the method proves to be computationally faster than both alternatives, if a simple p-value approximation is used. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Model-based Bayesian inference for ROC data analysis
NASA Astrophysics Data System (ADS)
Lei, Tianhu; Bae, K. Ty
2013-03-01
This paper presents a study of model-based Bayesian inference to Receiver Operating Characteristics (ROC) data. The model is a simple version of general non-linear regression model. Different from Dorfman model, it uses a probit link function with a covariate variable having zero-one two values to express binormal distributions in a single formula. Model also includes a scale parameter. Bayesian inference is implemented by Markov Chain Monte Carlo (MCMC) method carried out by Bayesian analysis Using Gibbs Sampling (BUGS). Contrast to the classical statistical theory, Bayesian approach considers model parameters as random variables characterized by prior distributions. With substantial amount of simulated samples generated by sampling algorithm, posterior distributions of parameters as well as parameters themselves can be accurately estimated. MCMC-based BUGS adopts Adaptive Rejection Sampling (ARS) protocol which requires the probability density function (pdf) which samples are drawing from be log concave with respect to the targeted parameters. Our study corrects a common misconception and proves that pdf of this regression model is log concave with respect to its scale parameter. Therefore, ARS's requirement is satisfied and a Gaussian prior which is conjugate and possesses many analytic and computational advantages is assigned to the scale parameter. A cohort of 20 simulated data sets and 20 simulations from each data set are used in our study. Output analysis and convergence diagnostics for MCMC method are assessed by CODA package. Models and methods by using continuous Gaussian prior and discrete categorical prior are compared. Intensive simulations and performance measures are given to illustrate our practice in the framework of model-based Bayesian inference using MCMC method.
Geoscience in the Big Data Era: Are models obsolete?
NASA Astrophysics Data System (ADS)
Yuen, D. A.; Zheng, L.; Stark, P. B.; Morra, G.; Knepley, M.; Wang, X.
2016-12-01
In last few decades, the velocity, volume, and variety of geophysical data have increased, while the development of the Internet and distributed computing has led to the emergence of "data science." Fitting and running numerical models, especially based on PDEs, is the main consumer of flops in geoscience. Can large amounts of diverse data supplant modeling? Without the ability to conduct randomized, controlled experiments, causal inference requires understanding the physics. It is sometimes possible to predict well without understanding the system—if (1) the system is predictable, (2) data on "important" variables are available, and (3) the system changes slowly enough. And sometimes even a crude model can help the data "speak for themselves" much more clearly. For example, Shearer (1991) used a 1-dimensional velocity model to stack long-period seismograms, revealing upper mantle discontinuities. This was a "big data" approach: the main use of computing was in the data processing, rather than in modeling, yet the "signal" became clear. In contrast, modelers tend to use all available computing power to fit even more complex models, resulting in a cycle where uncertainty quantification (UQ) is never possible: even if realistic UQ required only 1,000 model evaluations, it is never in reach. To make more reliable inferences requires better data analysis and statistics, not more complex models. Geoscientists need to learn new skills and tools: sound software engineering practices; open programming languages suitable for big data; parallel and distributed computing; data visualization; and basic nonparametric, computationally based statistical inference, such as permutation tests. They should work reproducibly, scripting all analyses and avoiding point-and-click tools.
Interpreting support vector machine models for multivariate group wise analysis in neuroimaging
Gaonkar, Bilwaj; Shinohara, Russell T; Davatzikos, Christos
2015-01-01
Machine learning based classification algorithms like support vector machines (SVMs) have shown great promise for turning a high dimensional neuroimaging data into clinically useful decision criteria. However, tracing imaging based patterns that contribute significantly to classifier decisions remains an open problem. This is an issue of critical importance in imaging studies seeking to determine which anatomical or physiological imaging features contribute to the classifier’s decision, thereby allowing users to critically evaluate the findings of such machine learning methods and to understand disease mechanisms. The majority of published work addresses the question of statistical inference for support vector classification using permutation tests based on SVM weight vectors. Such permutation testing ignores the SVM margin, which is critical in SVM theory. In this work we emphasize the use of a statistic that explicitly accounts for the SVM margin and show that the null distributions associated with this statistic are asymptotically normal. Further, our experiments show that this statistic is a lot less conservative as compared to weight based permutation tests and yet specific enough to tease out multivariate patterns in the data. Thus, we can better understand the multivariate patterns that the SVM uses for neuroimaging based classification. PMID:26210913
Incorporating Biological Knowledge into Evaluation of Casual Regulatory Hypothesis
NASA Technical Reports Server (NTRS)
Chrisman, Lonnie; Langley, Pat; Bay, Stephen; Pohorille, Andrew; DeVincenzi, D. (Technical Monitor)
2002-01-01
Biological data can be scarce and costly to obtain. The small number of samples available typically limits statistical power and makes reliable inference of causal relations extremely difficult. However, we argue that statistical power can be increased substantially by incorporating prior knowledge and data from diverse sources. We present a Bayesian framework that combines information from different sources and we show empirically that this lets one make correct causal inferences with small sample sizes that otherwise would be impossible.
A Review of Some Aspects of Robust Inference for Time Series.
1984-09-01
REVIEW OF SOME ASPECTSOF ROBUST INFERNCE FOR TIME SERIES by Ad . Dougla Main TE "iAL REPOW No. 63 Septermber 1984 Department of Statistics University of ...clear. One cannot hope to have a good method for dealing with outliers in time series by using only an instantaneous nonlinear transformation of the data...AI.49 716 A REVIEWd OF SOME ASPECTS OF ROBUST INFERENCE FOR TIME 1/1 SERIES(U) WASHINGTON UNIV SEATTLE DEPT OF STATISTICS R D MARTIN SEP 84 TR-53
The researcher and the consultant: from testing to probability statements.
Hamra, Ghassan B; Stang, Andreas; Poole, Charles
2015-09-01
In the first instalment of this series, Stang and Poole provided an overview of Fisher significance testing (ST), Neyman-Pearson null hypothesis testing (NHT), and their unfortunate and unintended offspring, null hypothesis significance testing. In addition to elucidating the distinction between the first two and the evolution of the third, the authors alluded to alternative models of statistical inference; namely, Bayesian statistics. Bayesian inference has experienced a revival in recent decades, with many researchers advocating for its use as both a complement and an alternative to NHT and ST. This article will continue in the direction of the first instalment, providing practicing researchers with an introduction to Bayesian inference. Our work will draw on the examples and discussion of the previous dialogue.
Spectral likelihood expansions for Bayesian inference
NASA Astrophysics Data System (ADS)
Nagel, Joseph B.; Sudret, Bruno
2016-03-01
A spectral approach to Bayesian inference is presented. It pursues the emulation of the posterior probability density. The starting point is a series expansion of the likelihood function in terms of orthogonal polynomials. From this spectral likelihood expansion all statistical quantities of interest can be calculated semi-analytically. The posterior is formally represented as the product of a reference density and a linear combination of polynomial basis functions. Both the model evidence and the posterior moments are related to the expansion coefficients. This formulation avoids Markov chain Monte Carlo simulation and allows one to make use of linear least squares instead. The pros and cons of spectral Bayesian inference are discussed and demonstrated on the basis of simple applications from classical statistics and inverse modeling.
Brenton J. Dickinson; Brett J. Butler
2013-01-01
The USDA Forest Service's National Woodland Owner Survey (NWOS) is conducted to better understand the attitudes and behaviors of private forest ownerships, which control more than half of US forestland. Inferences about the populations of interest should be based on theoretically sound estimation procedures. A recent review of the procedures disclosed an error in...
Tan, Ming T; Liu, Jian-ping; Lao, Lixing
2012-08-01
Recently, proper use of the statistical methods in traditional Chinese medicine (TCM) randomized controlled trials (RCTs) has received increased attention. Statistical inference based on hypothesis testing is the foundation of clinical trials and evidence-based medicine. In this article, the authors described the methodological differences between literature published in Chinese and Western journals in the design and analysis of acupuncture RCTs and the application of basic statistical principles. In China, qualitative analysis method has been widely used in acupuncture and TCM clinical trials, while the between-group quantitative analysis methods on clinical symptom scores are commonly used in the West. The evidence for and against these analytical differences were discussed based on the data of RCTs assessing acupuncture for pain relief. The authors concluded that although both methods have their unique advantages, quantitative analysis should be used as the primary analysis while qualitative analysis can be a secondary criterion for analysis. The purpose of this paper is to inspire further discussion of such special issues in clinical research design and thus contribute to the increased scientific rigor of TCM research.
2006-09-01
STELLA and PowerLoomn. These modules comunicate with a knowledge basec using KIF and stan(lardl relational database systelnis using either standard...groups ontology as well as a rule that infers additional seed members based on joint participation in a terrorism event. EDB schema files are a special... terrorism links from the Ali Baba EDB. Our interpretation of such links is that they KOJAK Manual E-42 encode that two people committed an act of
NASA Astrophysics Data System (ADS)
Torres Irribarra, D.; Freund, R.; Fisher, W.; Wilson, M.
2015-02-01
Computer-based, online assessments modelled, designed, and evaluated for adaptively administered invariant measurement are uniquely suited to defining and maintaining traceability to standardized units in education. An assessment of this kind is embedded in the Assessing Data Modeling and Statistical Reasoning (ADM) middle school mathematics curriculum. Diagnostic information about middle school students' learning of statistics and modeling is provided via computer-based formative assessments for seven constructs that comprise a learning progression for statistics and modeling from late elementary through the middle school grades. The seven constructs are: Data Display, Meta-Representational Competence, Conceptions of Statistics, Chance, Modeling Variability, Theory of Measurement, and Informal Inference. The end product is a web-delivered system built with Ruby on Rails for use by curriculum development teams working with classroom teachers in designing, developing, and delivering formative assessments. The online accessible system allows teachers to accurately diagnose students' unique comprehension and learning needs in a common language of real-time assessment, logging, analysis, feedback, and reporting.
Hybrid regulatory models: a statistically tractable approach to model regulatory network dynamics.
Ocone, Andrea; Millar, Andrew J; Sanguinetti, Guido
2013-04-01
Computational modelling of the dynamics of gene regulatory networks is a central task of systems biology. For networks of small/medium scale, the dominant paradigm is represented by systems of coupled non-linear ordinary differential equations (ODEs). ODEs afford great mechanistic detail and flexibility, but calibrating these models to data is often an extremely difficult statistical problem. Here, we develop a general statistical inference framework for stochastic transcription-translation networks. We use a coarse-grained approach, which represents the system as a network of stochastic (binary) promoter and (continuous) protein variables. We derive an exact inference algorithm and an efficient variational approximation that allows scalable inference and learning of the model parameters. We demonstrate the power of the approach on two biological case studies, showing that the method allows a high degree of flexibility and is capable of testable novel biological predictions. http://homepages.inf.ed.ac.uk/gsanguin/software.html. Supplementary data are available at Bioinformatics online.
Statistical Estimation of Orbital Debris Populations with a Spectrum of Object Size
NASA Technical Reports Server (NTRS)
Xu, Y. -l; Horstman, M.; Krisko, P. H.; Liou, J. -C; Matney, M.; Stansbery, E. G.; Stokely, C. L.; Whitlock, D.
2008-01-01
Orbital debris is a real concern for the safe operations of satellites. In general, the hazard of debris impact is a function of the size and spatial distributions of the debris populations. To describe and characterize the debris environment as reliably as possible, the current NASA Orbital Debris Engineering Model (ORDEM2000) is being upgraded to a new version based on new and better quality data. The data-driven ORDEM model covers a wide range of object sizes from 10 microns to greater than 1 meter. This paper reviews the statistical process for the estimation of the debris populations in the new ORDEM upgrade, and discusses the representation of large-size (greater than or equal to 1 m and greater than or equal to 10 cm) populations by SSN catalog objects and the validation of the statistical approach. Also, it presents results for the populations with sizes of greater than or equal to 3.3 cm, greater than or equal to 1 cm, greater than or equal to 100 micrometers, and greater than or equal to 10 micrometers. The orbital debris populations used in the new version of ORDEM are inferred from data based upon appropriate reference (or benchmark) populations instead of the binning of the multi-dimensional orbital-element space. This paper describes all of the major steps used in the population-inference procedure for each size-range. Detailed discussions on data analysis, parameter definition, the correlation between parameters and data, and uncertainty assessment are included.
Reading biological processes from nucleotide sequences
NASA Astrophysics Data System (ADS)
Murugan, Anand
Cellular processes have traditionally been investigated by techniques of imaging and biochemical analysis of the molecules involved. The recent rapid progress in our ability to manipulate and read nucleic acid sequences gives us direct access to the genetic information that directs and constrains biological processes. While sequence data is being used widely to investigate genotype-phenotype relationships and population structure, here we use sequencing to understand biophysical mechanisms. We present work on two different systems. First, in chapter 2, we characterize the stochastic genetic editing mechanism that produces diverse T-cell receptors in the human immune system. We do this by inferring statistical distributions of the underlying biochemical events that generate T-cell receptor coding sequences from the statistics of the observed sequences. This inferred model quantitatively describes the potential repertoire of T-cell receptors that can be produced by an individual, providing insight into its potential diversity and the probability of generation of any specific T-cell receptor. Then in chapter 3, we present work on understanding the functioning of regulatory DNA sequences in both prokaryotes and eukaryotes. Here we use experiments that measure the transcriptional activity of large libraries of mutagenized promoters and enhancers and infer models of the sequence-function relationship from this data. For the bacterial promoter, we infer a physically motivated 'thermodynamic' model of the interaction of DNA-binding proteins and RNA polymerase determining the transcription rate of the downstream gene. For the eukaryotic enhancers, we infer heuristic models of the sequence-function relationship and use these models to find synthetic enhancer sequences that optimize inducibility of expression. Both projects demonstrate the utility of sequence information in conjunction with sophisticated statistical inference techniques for dissecting underlying biophysical mechanisms.
Suner, Aslı; Karakülah, Gökhan; Dicle, Oğuz
2014-01-01
Statistical hypothesis testing is an essential component of biological and medical studies for making inferences and estimations from the collected data in the study; however, the misuse of statistical tests is widely common. In order to prevent possible errors in convenient statistical test selection, it is currently possible to consult available test selection algorithms developed for various purposes. However, the lack of an algorithm presenting the most common statistical tests used in biomedical research in a single flowchart causes several problems such as shifting users among the algorithms, poor decision support in test selection and lack of satisfaction of potential users. Herein, we demonstrated a unified flowchart; covers mostly used statistical tests in biomedical domain, to provide decision aid to non-statistician users while choosing the appropriate statistical test for testing their hypothesis. We also discuss some of the findings while we are integrating the flowcharts into each other to develop a single but more comprehensive decision algorithm.
Thompson, Steven K
2006-12-01
A flexible class of adaptive sampling designs is introduced for sampling in network and spatial settings. In the designs, selections are made sequentially with a mixture distribution based on an active set that changes as the sampling progresses, using network or spatial relationships as well as sample values. The new designs have certain advantages compared with previously existing adaptive and link-tracing designs, including control over sample sizes and of the proportion of effort allocated to adaptive selections. Efficient inference involves averaging over sample paths consistent with the minimal sufficient statistic. A Markov chain resampling method makes the inference computationally feasible. The designs are evaluated in network and spatial settings using two empirical populations: a hidden human population at high risk for HIV/AIDS and an unevenly distributed bird population.
Schloss, Patrick D; Handelsman, Jo
2006-10-01
The recent advent of tools enabling statistical inferences to be drawn from comparisons of microbial communities has enabled the focus of microbial ecology to move from characterizing biodiversity to describing the distribution of that biodiversity. Although statistical tools have been developed to compare community structures across a phylogenetic tree, we lack tools to compare the memberships and structures of two communities at a particular operational taxonomic unit (OTU) definition. Furthermore, current tests of community structure do not indicate the similarity of the communities but only report the probability of a statistical hypothesis. Here we present a computer program, SONS, which implements nonparametric estimators for the fraction and richness of OTUs shared between two communities.
CADDIS Volume 4. Data Analysis: PECBO Appendix - R Scripts for Non-Parametric Regressions
Script for computing nonparametric regression analysis. Overview of using scripts to infer environmental conditions from biological observations, statistically estimating species-environment relationships, statistical scripts.
ERIC Educational Resources Information Center
Mislevy, Robert J.
Educational test theory consists of statistical and methodological tools to support inferences about examinees' knowledge, skills, and accomplishments. The evolution of test theory has been shaped by the nature of users' inferences which, until recently, have been framed almost exclusively in terms of trait and behavioral psychology. Progress in…
Accurate continuous geographic assignment from low- to high-density SNP data.
Guillot, Gilles; Jónsson, Hákon; Hinge, Antoine; Manchih, Nabil; Orlando, Ludovic
2016-04-01
Large-scale genotype datasets can help track the dispersal patterns of epidemiological outbreaks and predict the geographic origins of individuals. Such genetically-based geographic assignments also show a range of possible applications in forensics for profiling both victims and criminals, and in wildlife management, where poaching hotspot areas can be located. They, however, require fast and accurate statistical methods to handle the growing amount of genetic information made available from genotype arrays and next-generation sequencing technologies. We introduce a novel statistical method for geopositioning individuals of unknown origin from genotypes. Our method is based on a geostatistical model trained with a dataset of georeferenced genotypes. Statistical inference under this model can be implemented within the theoretical framework of Integrated Nested Laplace Approximation, which represents one of the major recent breakthroughs in statistics, as it does not require Monte Carlo simulations. We compare the performance of our method and an alternative method for geospatial inference, SPA in a simulation framework. We highlight the accuracy and limits of continuous spatial assignment methods at various scales by analyzing genotype datasets from a diversity of species, including Florida Scrub-jay birds Aphelocoma coerulescens, Arabidopsis thaliana and humans, representing 41-197,146 SNPs. Our method appears to be best suited for the analysis of medium-sized datasets (a few tens of thousands of loci), such as reduced-representation sequencing data that become increasingly available in ecology. http://www2.imm.dtu.dk/∼gigu/Spasiba/ gilles.b.guillot@gmail.com Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Automating approximate Bayesian computation by local linear regression.
Thornton, Kevin R
2009-07-07
In several biological contexts, parameter inference often relies on computationally-intensive techniques. "Approximate Bayesian Computation", or ABC, methods based on summary statistics have become increasingly popular. A particular flavor of ABC based on using a linear regression to approximate the posterior distribution of the parameters, conditional on the summary statistics, is computationally appealing, yet no standalone tool exists to automate the procedure. Here, I describe a program to implement the method. The software package ABCreg implements the local linear-regression approach to ABC. The advantages are: 1. The code is standalone, and fully-documented. 2. The program will automatically process multiple data sets, and create unique output files for each (which may be processed immediately in R), facilitating the testing of inference procedures on simulated data, or the analysis of multiple data sets. 3. The program implements two different transformation methods for the regression step. 4. Analysis options are controlled on the command line by the user, and the program is designed to output warnings for cases where the regression fails. 5. The program does not depend on any particular simulation machinery (coalescent, forward-time, etc.), and therefore is a general tool for processing the results from any simulation. 6. The code is open-source, and modular.Examples of applying the software to empirical data from Drosophila melanogaster, and testing the procedure on simulated data, are shown. In practice, the ABCreg simplifies implementing ABC based on local-linear regression.
Optimal moment determination in POME-copula based hydrometeorological dependence modelling
NASA Astrophysics Data System (ADS)
Liu, Dengfeng; Wang, Dong; Singh, Vijay P.; Wang, Yuankun; Wu, Jichun; Wang, Lachun; Zou, Xinqing; Chen, Yuanfang; Chen, Xi
2017-07-01
Copula has been commonly applied in multivariate modelling in various fields where marginal distribution inference is a key element. To develop a flexible, unbiased mathematical inference framework in hydrometeorological multivariate applications, the principle of maximum entropy (POME) is being increasingly coupled with copula. However, in previous POME-based studies, determination of optimal moment constraints has generally not been considered. The main contribution of this study is the determination of optimal moments for POME for developing a coupled optimal moment-POME-copula framework to model hydrometeorological multivariate events. In this framework, margins (marginals, or marginal distributions) are derived with the use of POME, subject to optimal moment constraints. Then, various candidate copulas are constructed according to the derived margins, and finally the most probable one is determined, based on goodness-of-fit statistics. This optimal moment-POME-copula framework is applied to model the dependence patterns of three types of hydrometeorological events: (i) single-site streamflow-water level; (ii) multi-site streamflow; and (iii) multi-site precipitation, with data collected from Yichang and Hankou in the Yangtze River basin, China. Results indicate that the optimal-moment POME is more accurate in margin fitting and the corresponding copulas reflect a good statistical performance in correlation simulation. Also, the derived copulas, capturing more patterns which traditional correlation coefficients cannot reflect, provide an efficient way in other applied scenarios concerning hydrometeorological multivariate modelling.
Spectral Entropies as Information-Theoretic Tools for Complex Network Comparison
NASA Astrophysics Data System (ADS)
De Domenico, Manlio; Biamonte, Jacob
2016-10-01
Any physical system can be viewed from the perspective that information is implicitly represented in its state. However, the quantification of this information when it comes to complex networks has remained largely elusive. In this work, we use techniques inspired by quantum statistical mechanics to define an entropy measure for complex networks and to develop a set of information-theoretic tools, based on network spectral properties, such as Rényi q entropy, generalized Kullback-Leibler and Jensen-Shannon divergences, the latter allowing us to define a natural distance measure between complex networks. First, we show that by minimizing the Kullback-Leibler divergence between an observed network and a parametric network model, inference of model parameter(s) by means of maximum-likelihood estimation can be achieved and model selection can be performed with appropriate information criteria. Second, we show that the information-theoretic metric quantifies the distance between pairs of networks and we can use it, for instance, to cluster the layers of a multilayer system. By applying this framework to networks corresponding to sites of the human microbiome, we perform hierarchical cluster analysis and recover with high accuracy existing community-based associations. Our results imply that spectral-based statistical inference in complex networks results in demonstrably superior performance as well as a conceptual backbone, filling a gap towards a network information theory.
Local dependence in random graph models: characterization, properties and statistical inference
Schweinberger, Michael; Handcock, Mark S.
2015-01-01
Summary Dependent phenomena, such as relational, spatial and temporal phenomena, tend to be characterized by local dependence in the sense that units which are close in a well-defined sense are dependent. In contrast with spatial and temporal phenomena, though, relational phenomena tend to lack a natural neighbourhood structure in the sense that it is unknown which units are close and thus dependent. Owing to the challenge of characterizing local dependence and constructing random graph models with local dependence, many conventional exponential family random graph models induce strong dependence and are not amenable to statistical inference. We take first steps to characterize local dependence in random graph models, inspired by the notion of finite neighbourhoods in spatial statistics and M-dependence in time series, and we show that local dependence endows random graph models with desirable properties which make them amenable to statistical inference. We show that random graph models with local dependence satisfy a natural domain consistency condition which every model should satisfy, but conventional exponential family random graph models do not satisfy. In addition, we establish a central limit theorem for random graph models with local dependence, which suggests that random graph models with local dependence are amenable to statistical inference. We discuss how random graph models with local dependence can be constructed by exploiting either observed or unobserved neighbourhood structure. In the absence of observed neighbourhood structure, we take a Bayesian view and express the uncertainty about the neighbourhood structure by specifying a prior on a set of suitable neighbourhood structures. We present simulation results and applications to two real world networks with ‘ground truth’. PMID:26560142
NASA Technical Reports Server (NTRS)
Wilson, Robert M.
2014-01-01
Examined are sunspot cycle- (SC-) length averages of the annual January-December values of the Global Land-Ocean Temperature Index (
Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ahn, Tae-Hyuk; Chai, Juanjuan; Pan, Chongle
Motivation: Metagenomic sequencing of clinical samples provides a promising technique for direct pathogen detection and characterization in biosurveillance. Taxonomic analysis at the strain level can be used to resolve serotypes of a pathogen in biosurveillance. Sigma was developed for strain-level identification and quantification of pathogens using their reference genomes based on metagenomic analysis. Results: Sigma provides not only accurate strain-level inferences, but also three unique capabilities: (i) Sigma quantifies the statistical uncertainty of its inferences, which includes hypothesis testing of identified genomes and confidence interval estimation of their relative abundances; (ii) Sigma enables strain variant calling by assigning metagenomic readsmore » to their most likely reference genomes; and (iii) Sigma supports parallel computing for fast analysis of large datasets. In conclusion, the algorithm performance was evaluated using simulated mock communities and fecal samples with spike-in pathogen strains. Availability and Implementation: Sigma was implemented in C++ with source codes and binaries freely available at http://sigma.omicsbio.org.« less
Sigma: Strain-level inference of genomes from metagenomic analysis for biosurveillance
Ahn, Tae-Hyuk; Chai, Juanjuan; Pan, Chongle
2014-09-29
Motivation: Metagenomic sequencing of clinical samples provides a promising technique for direct pathogen detection and characterization in biosurveillance. Taxonomic analysis at the strain level can be used to resolve serotypes of a pathogen in biosurveillance. Sigma was developed for strain-level identification and quantification of pathogens using their reference genomes based on metagenomic analysis. Results: Sigma provides not only accurate strain-level inferences, but also three unique capabilities: (i) Sigma quantifies the statistical uncertainty of its inferences, which includes hypothesis testing of identified genomes and confidence interval estimation of their relative abundances; (ii) Sigma enables strain variant calling by assigning metagenomic readsmore » to their most likely reference genomes; and (iii) Sigma supports parallel computing for fast analysis of large datasets. In conclusion, the algorithm performance was evaluated using simulated mock communities and fecal samples with spike-in pathogen strains. Availability and Implementation: Sigma was implemented in C++ with source codes and binaries freely available at http://sigma.omicsbio.org.« less
Robust inference in discrete hazard models for randomized clinical trials.
Nguyen, Vinh Q; Gillen, Daniel L
2012-10-01
Time-to-event data in which failures are only assessed at discrete time points are common in many clinical trials. Examples include oncology studies where events are observed through periodic screenings such as radiographic scans. When the survival endpoint is acknowledged to be discrete, common methods for the analysis of observed failure times include the discrete hazard models (e.g., the discrete-time proportional hazards and the continuation ratio model) and the proportional odds model. In this manuscript, we consider estimation of a marginal treatment effect in discrete hazard models where the constant treatment effect assumption is violated. We demonstrate that the estimator resulting from these discrete hazard models is consistent for a parameter that depends on the underlying censoring distribution. An estimator that removes the dependence on the censoring mechanism is proposed and its asymptotic distribution is derived. Basing inference on the proposed estimator allows for statistical inference that is scientifically meaningful and reproducible. Simulation is used to assess the performance of the presented methodology in finite samples.
A Probabilistic Model of Meter Perception: Simulating Enculturation.
van der Weij, Bastiaan; Pearce, Marcus T; Honing, Henkjan
2017-01-01
Enculturation is known to shape the perception of meter in music but this is not explicitly accounted for by current cognitive models of meter perception. We hypothesize that the induction of meter is a result of predictive coding: interpreting onsets in a rhythm relative to a periodic meter facilitates prediction of future onsets. Such prediction, we hypothesize, is based on previous exposure to rhythms. As such, predictive coding provides a possible explanation for the way meter perception is shaped by the cultural environment. Based on this hypothesis, we present a probabilistic model of meter perception that uses statistical properties of the relation between rhythm and meter to infer meter from quantized rhythms. We show that our model can successfully predict annotated time signatures from quantized rhythmic patterns derived from folk melodies. Furthermore, we show that by inferring meter, our model improves prediction of the onsets of future events compared to a similar probabilistic model that does not infer meter. Finally, as a proof of concept, we demonstrate how our model can be used in a simulation of enculturation. From the results of this simulation, we derive a class of rhythms that are likely to be interpreted differently by enculturated listeners with different histories of exposure to rhythms.
Discrete Inverse and State Estimation Problems
NASA Astrophysics Data System (ADS)
Wunsch, Carl
2006-06-01
The problems of making inferences about the natural world from noisy observations and imperfect theories occur in almost all scientific disciplines. This book addresses these problems using examples taken from geophysical fluid dynamics. It focuses on discrete formulations, both static and time-varying, known variously as inverse, state estimation or data assimilation problems. Starting with fundamental algebraic and statistical ideas, the book guides the reader through a range of inference tools including the singular value decomposition, Gauss-Markov and minimum variance estimates, Kalman filters and related smoothers, and adjoint (Lagrange multiplier) methods. The final chapters discuss a variety of practical applications to geophysical flow problems. Discrete Inverse and State Estimation Problems is an ideal introduction to the topic for graduate students and researchers in oceanography, meteorology, climate dynamics, and geophysical fluid dynamics. It is also accessible to a wider scientific audience; the only prerequisite is an understanding of linear algebra. Provides a comprehensive introduction to discrete methods of inference from incomplete information Based upon 25 years of practical experience using real data and models Develops sequential and whole-domain analysis methods from simple least-squares Contains many examples and problems, and web-based support through MIT opencourseware
Inference of reaction rate parameters based on summary statistics from experiments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Khalil, Mohammad; Chowdhary, Kamaljit Singh; Safta, Cosmin
Here, we present the results of an application of Bayesian inference and maximum entropy methods for the estimation of the joint probability density for the Arrhenius rate para meters of the rate coefficient of the H 2/O 2-mechanism chain branching reaction H + O 2 → OH + O. Available published data is in the form of summary statistics in terms of nominal values and error bars of the rate coefficient of this reaction at a number of temperature values obtained from shock-tube experiments. Our approach relies on generating data, in this case OH concentration profiles, consistent with the givenmore » summary statistics, using Approximate Bayesian Computation methods and a Markov Chain Monte Carlo procedure. The approach permits the forward propagation of parametric uncertainty through the computational model in a manner that is consistent with the published statistics. A consensus joint posterior on the parameters is obtained by pooling the posterior parameter densities given each consistent data set. To expedite this process, we construct efficient surrogates for the OH concentration using a combination of Pad'e and polynomial approximants. These surrogate models adequately represent forward model observables and their dependence on input parameters and are computationally efficient to allow their use in the Bayesian inference procedure. We also utilize Gauss-Hermite quadrature with Gaussian proposal probability density functions for moment computation resulting in orders of magnitude speedup in data likelihood evaluation. Despite the strong non-linearity in the model, the consistent data sets all res ult in nearly Gaussian conditional parameter probability density functions. The technique also accounts for nuisance parameters in the form of Arrhenius parameters of other rate coefficients with prescribed uncertainty. The resulting pooled parameter probability density function is propagated through stoichiometric hydrogen-air auto-ignition computations to illustrate the need to account for correlation among the Arrhenius rate parameters of one reaction and across rate parameters of different reactions.« less
Inference of reaction rate parameters based on summary statistics from experiments
Khalil, Mohammad; Chowdhary, Kamaljit Singh; Safta, Cosmin; ...
2016-10-15
Here, we present the results of an application of Bayesian inference and maximum entropy methods for the estimation of the joint probability density for the Arrhenius rate para meters of the rate coefficient of the H 2/O 2-mechanism chain branching reaction H + O 2 → OH + O. Available published data is in the form of summary statistics in terms of nominal values and error bars of the rate coefficient of this reaction at a number of temperature values obtained from shock-tube experiments. Our approach relies on generating data, in this case OH concentration profiles, consistent with the givenmore » summary statistics, using Approximate Bayesian Computation methods and a Markov Chain Monte Carlo procedure. The approach permits the forward propagation of parametric uncertainty through the computational model in a manner that is consistent with the published statistics. A consensus joint posterior on the parameters is obtained by pooling the posterior parameter densities given each consistent data set. To expedite this process, we construct efficient surrogates for the OH concentration using a combination of Pad'e and polynomial approximants. These surrogate models adequately represent forward model observables and their dependence on input parameters and are computationally efficient to allow their use in the Bayesian inference procedure. We also utilize Gauss-Hermite quadrature with Gaussian proposal probability density functions for moment computation resulting in orders of magnitude speedup in data likelihood evaluation. Despite the strong non-linearity in the model, the consistent data sets all res ult in nearly Gaussian conditional parameter probability density functions. The technique also accounts for nuisance parameters in the form of Arrhenius parameters of other rate coefficients with prescribed uncertainty. The resulting pooled parameter probability density function is propagated through stoichiometric hydrogen-air auto-ignition computations to illustrate the need to account for correlation among the Arrhenius rate parameters of one reaction and across rate parameters of different reactions.« less
Austin, Peter C; Goldwasser, Meredith A
2008-03-01
We examined the impact on statistical inference when a chi(2) test is used to compare the proportion of successes in the level of a categorical variable that has the highest observed proportion of successes with the proportion of successes in all other levels of the categorical variable combined. Monte Carlo simulations and a case study examining the association between astrological sign and hospitalization for heart failure. A standard chi(2) test results in an inflation of the type I error rate, with the type I error rate increasing as the number of levels of the categorical variable increases. Using a standard chi(2) test, the hospitalization rate for Pisces was statistically significantly different from that of the other 11 astrological signs combined (P=0.026). After accounting for the fact that the selection of Pisces was based on it having the highest observed proportion of heart failure hospitalizations, subjects born under the sign of Pisces no longer had a significantly higher rate of heart failure hospitalization compared to the other residents of Ontario (P=0.152). Post hoc comparisons of the proportions of successes across different levels of a categorical variable can result in incorrect inferences.
Temporal variation and scale in movement-based resource selection functions
Hooten, M.B.; Hanks, E.M.; Johnson, D.S.; Alldredge, M.W.
2013-01-01
A common population characteristic of interest in animal ecology studies pertains to the selection of resources. That is, given the resources available to animals, what do they ultimately choose to use? A variety of statistical approaches have been employed to examine this question and each has advantages and disadvantages with respect to the form of available data and the properties of estimators given model assumptions. A wealth of high resolution telemetry data are now being collected to study animal population movement and space use and these data present both challenges and opportunities for statistical inference. We summarize traditional methods for resource selection and then describe several extensions to deal with measurement uncertainty and an explicit movement process that exists in studies involving high-resolution telemetry data. Our approach uses a correlated random walk movement model to obtain temporally varying use and availability distributions that are employed in a weighted distribution context to estimate selection coefficients. The temporally varying coefficients are then weighted by their contribution to selection and combined to provide inference at the population level. The result is an intuitive and accessible statistical procedure that uses readily available software and is computationally feasible for large datasets. These methods are demonstrated using data collected as part of a large-scale mountain lion monitoring study in Colorado, USA.
TIME-DOMAIN METHODS FOR DIFFUSIVE TRANSPORT IN SOFT MATTER
Fricks, John; Yao, Lingxing; Elston, Timothy C.; Gregory Forest, And M.
2015-01-01
Passive microrheology [12] utilizes measurements of noisy, entropic fluctuations (i.e., diffusive properties) of micron-scale spheres in soft matter to infer bulk frequency-dependent loss and storage moduli. Here, we are concerned exclusively with diffusion of Brownian particles in viscoelastic media, for which the Mason-Weitz theoretical-experimental protocol is ideal, and the more challenging inference of bulk viscoelastic moduli is decoupled. The diffusive theory begins with a generalized Langevin equation (GLE) with a memory drag law specified by a kernel [7, 16, 22, 23]. We start with a discrete formulation of the GLE as an autoregressive stochastic process governing microbead paths measured by particle tracking. For the inverse problem (recovery of the memory kernel from experimental data) we apply time series analysis (maximum likelihood estimators via the Kalman filter) directly to bead position data, an alternative to formulas based on mean-squared displacement statistics in frequency space. For direct modeling, we present statistically exact GLE algorithms for individual particle paths as well as statistical correlations for displacement and velocity. Our time-domain methods rest upon a generalization of well-known results for a single-mode exponential kernel [1, 7, 22, 23] to an arbitrary M-mode exponential series, for which the GLE is transformed to a vector Ornstein-Uhlenbeck process. PMID:26412904
FUNSTAT and statistical image representations
NASA Technical Reports Server (NTRS)
Parzen, E.
1983-01-01
General ideas of functional statistical inference analysis of one sample and two samples, univariate and bivariate are outlined. ONESAM program is applied to analyze the univariate probability distributions of multi-spectral image data.
Localized Smart-Interpretation
NASA Astrophysics Data System (ADS)
Lundh Gulbrandsen, Mats; Mejer Hansen, Thomas; Bach, Torben; Pallesen, Tom
2014-05-01
The complex task of setting up a geological model consists not only of combining available geological information into a conceptual plausible model, but also requires consistency with availably data, e.g. geophysical data. However, in many cases the direct geological information, e.g borehole samples, are very sparse, so in order to create a geological model, the geologist needs to rely on the geophysical data. The problem is however, that the amount of geophysical data in many cases are so vast that it is practically impossible to integrate all of them in the manual interpretation process. This means that a lot of the information available from the geophysical surveys are unexploited, which is a problem, due to the fact that the resulting geological model does not fulfill its full potential and hence are less trustworthy. We suggest an approach to geological modeling that 1. allow all geophysical data to be considered when building the geological model 2. is fast 3. allow quantification of geological modeling. The method is constructed to build a statistical model, f(d,m), describing the relation between what the geologists interpret, d, and what the geologist knows, m. The para- meter m reflects any available information that can be quantified, such as geophysical data, the result of a geophysical inversion, elevation maps, etc... The parameter d reflects an actual interpretation, such as for example the depth to the base of a ground water reservoir. First we infer a statistical model f(d,m), by examining sets of actual interpretations made by a geological expert, [d1, d2, ...], and the information used to perform the interpretation; [m1, m2, ...]. This makes it possible to quantify how the geological expert performs interpolation through f(d,m). As the geological expert proceeds interpreting, the number of interpreted datapoints from which the statistical model is inferred increases, and therefore the accuracy of the statistical model increases. When a model f(d,m) successfully has been inferred, we are able to simulate how the geological expert would perform an interpretation given some external information m, through f(d|m). We will demonstrate this method applied on geological interpretation and densely sampled airborne electromagnetic data. In short, our goal is to build a statistical model describing how a geological expert performs geological interpretation given some geophysical data. We then wish to use this statistical model to perform semi automatic interpretation, everywhere where such geophysical data exist, in a manner consistent with the choices made by a geological expert. Benefits of such a statistical model are that 1. it provides a quantification of how a geological expert performs interpretation based on available diverse data 2. all available geophysical information can be used 3. it allows much faster interpretation of large data sets.
NASA Astrophysics Data System (ADS)
Rajabi, Mohammad Mahdi; Ataie-Ashtiani, Behzad
2016-05-01
Bayesian inference has traditionally been conceived as the proper framework for the formal incorporation of expert knowledge in parameter estimation of groundwater models. However, conventional Bayesian inference is incapable of taking into account the imprecision essentially embedded in expert provided information. In order to solve this problem, a number of extensions to conventional Bayesian inference have been introduced in recent years. One of these extensions is 'fuzzy Bayesian inference' which is the result of integrating fuzzy techniques into Bayesian statistics. Fuzzy Bayesian inference has a number of desirable features which makes it an attractive approach for incorporating expert knowledge in the parameter estimation process of groundwater models: (1) it is well adapted to the nature of expert provided information, (2) it allows to distinguishably model both uncertainty and imprecision, and (3) it presents a framework for fusing expert provided information regarding the various inputs of the Bayesian inference algorithm. However an important obstacle in employing fuzzy Bayesian inference in groundwater numerical modeling applications is the computational burden, as the required number of numerical model simulations often becomes extremely exhaustive and often computationally infeasible. In this paper, a novel approach of accelerating the fuzzy Bayesian inference algorithm is proposed which is based on using approximate posterior distributions derived from surrogate modeling, as a screening tool in the computations. The proposed approach is first applied to a synthetic test case of seawater intrusion (SWI) in a coastal aquifer. It is shown that for this synthetic test case, the proposed approach decreases the number of required numerical simulations by an order of magnitude. Then the proposed approach is applied to a real-world test case involving three-dimensional numerical modeling of SWI in Kish Island, located in the Persian Gulf. An expert elicitation methodology is developed and applied to the real-world test case in order to provide a road map for the use of fuzzy Bayesian inference in groundwater modeling applications.
A Quantum Probability Model of Causal Reasoning
Trueblood, Jennifer S.; Busemeyer, Jerome R.
2012-01-01
People can often outperform statistical methods and machine learning algorithms in situations that involve making inferences about the relationship between causes and effects. While people are remarkably good at causal reasoning in many situations, there are several instances where they deviate from expected responses. This paper examines three situations where judgments related to causal inference problems produce unexpected results and describes a quantum inference model based on the axiomatic principles of quantum probability theory that can explain these effects. Two of the three phenomena arise from the comparison of predictive judgments (i.e., the conditional probability of an effect given a cause) with diagnostic judgments (i.e., the conditional probability of a cause given an effect). The third phenomenon is a new finding examining order effects in predictive causal judgments. The quantum inference model uses the notion of incompatibility among different causes to account for all three phenomena. Psychologically, the model assumes that individuals adopt different points of view when thinking about different causes. The model provides good fits to the data and offers a coherent account for all three causal reasoning effects thus proving to be a viable new candidate for modeling human judgment. PMID:22593747
Alonso, Ariel; Van der Elst, Wim; Molenberghs, Geert; Buyse, Marc; Burzykowski, Tomasz
2015-03-01
The increasing cost of drug development has raised the demand for surrogate endpoints when evaluating new drugs in clinical trials. However, over the years, it has become clear that surrogate endpoints need to be statistically evaluated and deemed valid, before they can be used as substitutes of "true" endpoints in clinical studies. Nowadays, two paradigms, based on causal-inference and meta-analysis, dominate the scene. Nonetheless, although the literature emanating from these paradigms is wide, till now the relationship between them has largely been left unexplored. In the present work, we discuss the conceptual framework underlying both approaches and study the relationship between them using theoretical elements and the analysis of a real case study. Furthermore, we show that the meta-analytic approach can be embedded within a causal-inference framework on the one hand and that it can be heuristically justified why surrogate endpoints successfully evaluated using this approach will often be appealing from a causal-inference perspective as well, on the other. A newly developed and user friendly R package Surrogate is provided to carry out the evaluation exercise. © 2014, The International Biometric Society.
Moving beyond qualitative evaluations of Bayesian models of cognition.
Hemmer, Pernille; Tauber, Sean; Steyvers, Mark
2015-06-01
Bayesian models of cognition provide a powerful way to understand the behavior and goals of individuals from a computational point of view. Much of the focus in the Bayesian cognitive modeling approach has been on qualitative model evaluations, where predictions from the models are compared to data that is often averaged over individuals. In many cognitive tasks, however, there are pervasive individual differences. We introduce an approach to directly infer individual differences related to subjective mental representations within the framework of Bayesian models of cognition. In this approach, Bayesian data analysis methods are used to estimate cognitive parameters and motivate the inference process within a Bayesian cognitive model. We illustrate this integrative Bayesian approach on a model of memory. We apply the model to behavioral data from a memory experiment involving the recall of heights of people. A cross-validation analysis shows that the Bayesian memory model with inferred subjective priors predicts withheld data better than a Bayesian model where the priors are based on environmental statistics. In addition, the model with inferred priors at the individual subject level led to the best overall generalization performance, suggesting that individual differences are important to consider in Bayesian models of cognition.
Deep Learning for Population Genetic Inference.
Sheehan, Sara; Song, Yun S
2016-03-01
Given genomic variation data from multiple individuals, computing the likelihood of complex population genetic models is often infeasible. To circumvent this problem, we introduce a novel likelihood-free inference framework by applying deep learning, a powerful modern technique in machine learning. Deep learning makes use of multilayer neural networks to learn a feature-based function from the input (e.g., hundreds of correlated summary statistics of data) to the output (e.g., population genetic parameters of interest). We demonstrate that deep learning can be effectively employed for population genetic inference and learning informative features of data. As a concrete application, we focus on the challenging problem of jointly inferring natural selection and demography (in the form of a population size change history). Our method is able to separate the global nature of demography from the local nature of selection, without sequential steps for these two factors. Studying demography and selection jointly is motivated by Drosophila, where pervasive selection confounds demographic analysis. We apply our method to 197 African Drosophila melanogaster genomes from Zambia to infer both their overall demography, and regions of their genome under selection. We find many regions of the genome that have experienced hard sweeps, and fewer under selection on standing variation (soft sweep) or balancing selection. Interestingly, we find that soft sweeps and balancing selection occur more frequently closer to the centromere of each chromosome. In addition, our demographic inference suggests that previously estimated bottlenecks for African Drosophila melanogaster are too extreme.
Deep Learning for Population Genetic Inference
Sheehan, Sara; Song, Yun S.
2016-01-01
Given genomic variation data from multiple individuals, computing the likelihood of complex population genetic models is often infeasible. To circumvent this problem, we introduce a novel likelihood-free inference framework by applying deep learning, a powerful modern technique in machine learning. Deep learning makes use of multilayer neural networks to learn a feature-based function from the input (e.g., hundreds of correlated summary statistics of data) to the output (e.g., population genetic parameters of interest). We demonstrate that deep learning can be effectively employed for population genetic inference and learning informative features of data. As a concrete application, we focus on the challenging problem of jointly inferring natural selection and demography (in the form of a population size change history). Our method is able to separate the global nature of demography from the local nature of selection, without sequential steps for these two factors. Studying demography and selection jointly is motivated by Drosophila, where pervasive selection confounds demographic analysis. We apply our method to 197 African Drosophila melanogaster genomes from Zambia to infer both their overall demography, and regions of their genome under selection. We find many regions of the genome that have experienced hard sweeps, and fewer under selection on standing variation (soft sweep) or balancing selection. Interestingly, we find that soft sweeps and balancing selection occur more frequently closer to the centromere of each chromosome. In addition, our demographic inference suggests that previously estimated bottlenecks for African Drosophila melanogaster are too extreme. PMID:27018908
Certified Reduced Basis Model Characterization: a Frequentistic Uncertainty Framework
2011-01-11
14) It then follows that the Legendre coefficient random vector, (Z [0], Z [1], . . . , Z [I])(ω), is (I+1)– variate normally distributed with mean (δ...I. Note each two-sided inequality represents two constraints. 3. PDE-Based Statistical Inference We now proceed to the parametrized partial...appearance of defects or geometric variations relative to an initial baseline, or perhaps manufacturing departures from nominal specifications; if our
Prediction system of hydroponic plant growth and development using algorithm Fuzzy Mamdani method
NASA Astrophysics Data System (ADS)
Sudana, I. Made; Purnawirawan, Okta; Arief, Ulfa Mediaty
2017-03-01
Hydroponics is a method of farming without soil. One of the Hydroponic plants is Watercress (Nasturtium Officinale). The development and growth process of hydroponic Watercress was influenced by levels of nutrients, acidity and temperature. The independent variables can be used as input variable system to predict the value level of plants growth and development. The prediction system is using Fuzzy Algorithm Mamdani method. This system was built to implement the function of Fuzzy Inference System (Fuzzy Inference System/FIS) as a part of the Fuzzy Logic Toolbox (FLT) by using MATLAB R2007b. FIS is a computing system that works on the principle of fuzzy reasoning which is similar to humans' reasoning. Basically FIS consists of four units which are fuzzification unit, fuzzy logic reasoning unit, base knowledge unit and defuzzification unit. In addition to know the effect of independent variables on the plants growth and development that can be visualized with the function diagram of FIS output surface that is shaped three-dimensional, and statistical tests based on the data from the prediction system using multiple linear regression method, which includes multiple linear regression analysis, T test, F test, the coefficient of determination and donations predictor that are calculated using SPSS (Statistical Product and Service Solutions) software applications.
Norris, Peter M; da Silva, Arlindo M
2016-07-01
A method is presented to constrain a statistical model of sub-gridcolumn moisture variability using high-resolution satellite cloud data. The method can be used for large-scale model parameter estimation or cloud data assimilation. The gridcolumn model includes assumed probability density function (PDF) intra-layer horizontal variability and a copula-based inter-layer correlation model. The observables used in the current study are Moderate Resolution Imaging Spectroradiometer (MODIS) cloud-top pressure, brightness temperature and cloud optical thickness, but the method should be extensible to direct cloudy radiance assimilation for a small number of channels. The algorithm is a form of Bayesian inference with a Markov chain Monte Carlo (MCMC) approach to characterizing the posterior distribution. This approach is especially useful in cases where the background state is clear but cloudy observations exist. In traditional linearized data assimilation methods, a subsaturated background cannot produce clouds via any infinitesimal equilibrium perturbation, but the Monte Carlo approach is not gradient-based and allows jumps into regions of non-zero cloud probability. The current study uses a skewed-triangle distribution for layer moisture. The article also includes a discussion of the Metropolis and multiple-try Metropolis versions of MCMC.
NASA Technical Reports Server (NTRS)
Norris, Peter M.; Da Silva, Arlindo M.
2016-01-01
A method is presented to constrain a statistical model of sub-gridcolumn moisture variability using high-resolution satellite cloud data. The method can be used for large-scale model parameter estimation or cloud data assimilation. The gridcolumn model includes assumed probability density function (PDF) intra-layer horizontal variability and a copula-based inter-layer correlation model. The observables used in the current study are Moderate Resolution Imaging Spectroradiometer (MODIS) cloud-top pressure, brightness temperature and cloud optical thickness, but the method should be extensible to direct cloudy radiance assimilation for a small number of channels. The algorithm is a form of Bayesian inference with a Markov chain Monte Carlo (MCMC) approach to characterizing the posterior distribution. This approach is especially useful in cases where the background state is clear but cloudy observations exist. In traditional linearized data assimilation methods, a subsaturated background cannot produce clouds via any infinitesimal equilibrium perturbation, but the Monte Carlo approach is not gradient-based and allows jumps into regions of non-zero cloud probability. The current study uses a skewed-triangle distribution for layer moisture. The article also includes a discussion of the Metropolis and multiple-try Metropolis versions of MCMC.
Norris, Peter M.; da Silva, Arlindo M.
2018-01-01
A method is presented to constrain a statistical model of sub-gridcolumn moisture variability using high-resolution satellite cloud data. The method can be used for large-scale model parameter estimation or cloud data assimilation. The gridcolumn model includes assumed probability density function (PDF) intra-layer horizontal variability and a copula-based inter-layer correlation model. The observables used in the current study are Moderate Resolution Imaging Spectroradiometer (MODIS) cloud-top pressure, brightness temperature and cloud optical thickness, but the method should be extensible to direct cloudy radiance assimilation for a small number of channels. The algorithm is a form of Bayesian inference with a Markov chain Monte Carlo (MCMC) approach to characterizing the posterior distribution. This approach is especially useful in cases where the background state is clear but cloudy observations exist. In traditional linearized data assimilation methods, a subsaturated background cannot produce clouds via any infinitesimal equilibrium perturbation, but the Monte Carlo approach is not gradient-based and allows jumps into regions of non-zero cloud probability. The current study uses a skewed-triangle distribution for layer moisture. The article also includes a discussion of the Metropolis and multiple-try Metropolis versions of MCMC. PMID:29618847
Dazard, Jean-Eudes; Ishwaran, Hemant; Mehlotra, Rajeev; Weinberg, Aaron; Zimmerman, Peter
2018-01-01
Unraveling interactions among variables such as genetic, clinical, demographic and environmental factors is essential to understand the development of common and complex diseases. To increase the power to detect such variables interactions associated with clinical time-to-events outcomes, we borrowed established concepts from random survival forest (RSF) models. We introduce a novel RSF-based pairwise interaction estimator and derive a randomization method with bootstrap confidence intervals for inferring interaction significance. Using various linear and nonlinear time-to-events survival models in simulation studies, we first show the efficiency of our approach: true pairwise interaction-effects between variables are uncovered, while they may not be accompanied with their corresponding main-effects, and may not be detected by standard semi-parametric regression modeling and test statistics used in survival analysis. Moreover, using a RSF-based cross-validation scheme for generating prediction estimators, we show that informative predictors may be inferred. We applied our approach to an HIV cohort study recording key host gene polymorphisms and their association with HIV change of tropism or AIDS progression. Altogether, this shows how linear or nonlinear pairwise statistical interactions of variables may be efficiently detected with a predictive value in observational studies with time-to-event outcomes. PMID:29453930
Dazard, Jean-Eudes; Ishwaran, Hemant; Mehlotra, Rajeev; Weinberg, Aaron; Zimmerman, Peter
2018-02-17
Unraveling interactions among variables such as genetic, clinical, demographic and environmental factors is essential to understand the development of common and complex diseases. To increase the power to detect such variables interactions associated with clinical time-to-events outcomes, we borrowed established concepts from random survival forest (RSF) models. We introduce a novel RSF-based pairwise interaction estimator and derive a randomization method with bootstrap confidence intervals for inferring interaction significance. Using various linear and nonlinear time-to-events survival models in simulation studies, we first show the efficiency of our approach: true pairwise interaction-effects between variables are uncovered, while they may not be accompanied with their corresponding main-effects, and may not be detected by standard semi-parametric regression modeling and test statistics used in survival analysis. Moreover, using a RSF-based cross-validation scheme for generating prediction estimators, we show that informative predictors may be inferred. We applied our approach to an HIV cohort study recording key host gene polymorphisms and their association with HIV change of tropism or AIDS progression. Altogether, this shows how linear or nonlinear pairwise statistical interactions of variables may be efficiently detected with a predictive value in observational studies with time-to-event outcomes.
Data free inference with processed data products
Chowdhary, K.; Najm, H. N.
2014-07-12
Here, we consider the context of probabilistic inference of model parameters given error bars or confidence intervals on model output values, when the data is unavailable. We introduce a class of algorithms in a Bayesian framework, relying on maximum entropy arguments and approximate Bayesian computation methods, to generate consistent data with the given summary statistics. Once we obtain consistent data sets, we pool the respective posteriors, to arrive at a single, averaged density on the parameters. This approach allows us to perform accurate forward uncertainty propagation consistent with the reported statistics.
Design of experiments and data analysis challenges in calibration for forensics applications
Anderson-Cook, Christine M.; Burr, Thomas L.; Hamada, Michael S.; ...
2015-07-15
Forensic science aims to infer characteristics of source terms using measured observables. Our focus is on statistical design of experiments and data analysis challenges arising in nuclear forensics. More specifically, we focus on inferring aspects of experimental conditions (of a process to produce product Pu oxide powder), such as temperature, nitric acid concentration, and Pu concentration, using measured features of the product Pu oxide powder. The measured features, Y, include trace chemical concentrations and particle morphology such as particle size and shape of the produced Pu oxide power particles. Making inferences about the nature of inputs X that were usedmore » to create nuclear materials having particular characteristics, Y, is an inverse problem. Therefore, statistical analysis can be used to identify the best set (or sets) of Xs for a new set of observed responses Y. One can fit a model (or models) such as Υ = f(Χ) + error, for each of the responses, based on a calibration experiment and then “invert” to solve for the best set of Xs for a new set of Ys. This perspectives paper uses archived experimental data to consider aspects of data collection and experiment design for the calibration data to maximize the quality of the predicted Ys in the forward models; that is, we assume that well-estimated forward models are effective in the inverse problem. In addition, we consider how to identify a best solution for the inferred X, and evaluate the quality of the result and its robustness to a variety of initial assumptions, and different correlation structures between the responses. In addition, we also briefly review recent advances in metrology issues related to characterizing particle morphology measurements used in the response vector, Y.« less
Design of experiments and data analysis challenges in calibration for forensics applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Anderson-Cook, Christine M.; Burr, Thomas L.; Hamada, Michael S.
Forensic science aims to infer characteristics of source terms using measured observables. Our focus is on statistical design of experiments and data analysis challenges arising in nuclear forensics. More specifically, we focus on inferring aspects of experimental conditions (of a process to produce product Pu oxide powder), such as temperature, nitric acid concentration, and Pu concentration, using measured features of the product Pu oxide powder. The measured features, Y, include trace chemical concentrations and particle morphology such as particle size and shape of the produced Pu oxide power particles. Making inferences about the nature of inputs X that were usedmore » to create nuclear materials having particular characteristics, Y, is an inverse problem. Therefore, statistical analysis can be used to identify the best set (or sets) of Xs for a new set of observed responses Y. One can fit a model (or models) such as Υ = f(Χ) + error, for each of the responses, based on a calibration experiment and then “invert” to solve for the best set of Xs for a new set of Ys. This perspectives paper uses archived experimental data to consider aspects of data collection and experiment design for the calibration data to maximize the quality of the predicted Ys in the forward models; that is, we assume that well-estimated forward models are effective in the inverse problem. In addition, we consider how to identify a best solution for the inferred X, and evaluate the quality of the result and its robustness to a variety of initial assumptions, and different correlation structures between the responses. In addition, we also briefly review recent advances in metrology issues related to characterizing particle morphology measurements used in the response vector, Y.« less
Are great apes able to reason from multi-item samples to populations of food items?
Eckert, Johanna; Rakoczy, Hannes; Call, Josep
2017-10-01
Inductive learning from limited observations is a cognitive capacity of fundamental importance. In humans, it is underwritten by our intuitive statistics, the ability to draw systematic inferences from populations to randomly drawn samples and vice versa. According to recent research in cognitive development, human intuitive statistics develops early in infancy. Recent work in comparative psychology has produced first evidence for analogous cognitive capacities in great apes who flexibly drew inferences from populations to samples. In the present study, we investigated whether great apes (Pongo abelii, Pan troglodytes, Pan paniscus, Gorilla gorilla) also draw inductive inferences in the opposite direction, from samples to populations. In two experiments, apes saw an experimenter randomly drawing one multi-item sample from each of two populations of food items. The populations differed in their proportion of preferred to neutral items (24:6 vs. 6:24) but apes saw only the distribution of food items in the samples that reflected the distribution of the respective populations (e.g., 4:1 vs. 1:4). Based on this observation they were then allowed to choose between the two populations. Results show that apes seemed to make inferences from samples to populations and thus chose the population from which the more favorable (4:1) sample was drawn in Experiment 1. In this experiment, the more attractive sample not only contained proportionally but also absolutely more preferred food items than the less attractive sample. Experiment 2, however, revealed that when absolute and relative frequencies were disentangled, apes performed at chance level. Whether these limitations in apes' performance reflect true limits of cognitive competence or merely performance limitations due to accessory task demands is still an open question. © 2017 Wiley Periodicals, Inc.
Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining.
Hero, Alfred O; Rajaratnam, Bala
2016-01-01
When can reliable inference be drawn in fue "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data". Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.
UniEnt: uniform entropy model for the dynamics of a neuronal population
NASA Astrophysics Data System (ADS)
Hernandez Lahme, Damian; Nemenman, Ilya
Sensory information and motor responses are encoded in the brain in a collective spiking activity of a large number of neurons. Understanding the neural code requires inferring statistical properties of such collective dynamics from multicellular neurophysiological recordings. Questions of whether synchronous activity or silence of multiple neurons carries information about the stimuli or the motor responses are especially interesting. Unfortunately, detection of such high order statistical interactions from data is especially challenging due to the exponentially large dimensionality of the state space of neural collectives. Here we present UniEnt, a method for the inference of strengths of multivariate neural interaction patterns. The method is based on the Bayesian prior that makes no assumptions (uniform a priori expectations) about the value of the entropy of the observed multivariate neural activity, in contrast to popular approaches that maximize this entropy. We then study previously published multi-electrode recordings data from salamander retina, exposing the relevance of higher order neural interaction patterns for information encoding in this system. This work was supported in part by Grants JSMF/220020321 and NSF/IOS/1208126.
Kirkpatrick, Robert M; McGue, Matt; Iacono, William G
2015-03-01
The present study of general cognitive ability attempts to replicate and extend previous investigations of a biometric moderator, family-of-origin socioeconomic status (SES), in a sample of 2,494 pairs of adolescent twins, non-twin biological siblings, and adoptive siblings assessed with individually administered IQ tests. We hypothesized that SES would covary positively with additive-genetic variance and negatively with shared-environmental variance. Important potential confounds unaddressed in some past studies, such as twin-specific effects, assortative mating, and differential heritability by trait level, were found to be negligible. In our main analysis, we compared models by their sample-size corrected AIC, and base our statistical inference on model-averaged point estimates and standard errors. Additive-genetic variance increased with SES-an effect that was statistically significant and robust to model specification. We found no evidence that SES moderated shared-environmental influence. We attempt to explain the inconsistent replication record of these effects, and provide suggestions for future research.
Kirkpatrick, Robert M.; McGue, Matt; Iacono, William G.
2015-01-01
The present study of general cognitive ability attempts to replicate and extend previous investigations of a biometric moderator, family-of-origin socioeconomic status (SES), in a sample of 2,494 pairs of adolescent twins, non-twin biological siblings, and adoptive siblings assessed with individually administered IQ tests. We hypothesized that SES would covary positively with additive-genetic variance and negatively with shared-environmental variance. Important potential confounds unaddressed in some past studies, such as twin-specific effects, assortative mating, and differential heritability by trait level, were found to be negligible. In our main analysis, we compared models by their sample-size corrected AIC, and base our statistical inference on model-averaged point estimates and standard errors. Additive-genetic variance increased with SES—an effect that was statistically significant and robust to model specification. We found no evidence that SES moderated shared-environmental influence. We attempt to explain the inconsistent replication record of these effects, and provide suggestions for future research. PMID:25539975
Statistics at the Chinese Universities.
1981-09-01
education in China in the postwar years is pro- vided to give some perspective. My observa- tions on statistics at the Chinese universities are necessarily...has been accepted as a member society of ISI. 3. Education in China Understanding of statistics in universities in China will be enhanced through some...programaming), Statistical Mathematics (infer- ence, data analysis, industrial statistics , information theory), tiathematical Physics (dif- ferential
Honda, Hidehito; Matsuka, Toshihiko; Ueda, Kazuhiro
2017-05-01
Some researchers on binary choice inference have argued that people make inferences based on simple heuristics, such as recognition, fluency, or familiarity. Others have argued that people make inferences based on available knowledge. To examine the boundary between heuristic and knowledge usage, we examine binary choice inference processes in terms of attribute substitution in heuristic use (Kahneman & Frederick, 2005). In this framework, it is predicted that people will rely on heuristic or knowledge-based inference depending on the subjective difficulty of the inference task. We conducted competitive tests of binary choice inference models representing simple heuristics (fluency and familiarity heuristics) and knowledge-based inference models. We found that a simple heuristic model (especially a familiarity heuristic model) explained inference patterns for subjectively difficult inference tasks, and that a knowledge-based inference model explained subjectively easy inference tasks. These results were consistent with the predictions of the attribute substitution framework. Issues on usage of simple heuristics and psychological processes are discussed. Copyright © 2016 Cognitive Science Society, Inc.
Walker, Martin; Basáñez, María-Gloria; Ouédraogo, André Lin; Hermsen, Cornelus; Bousema, Teun; Churcher, Thomas S
2015-01-16
Quantitative molecular methods (QMMs) such as quantitative real-time polymerase chain reaction (q-PCR), reverse-transcriptase PCR (qRT-PCR) and quantitative nucleic acid sequence-based amplification (QT-NASBA) are increasingly used to estimate pathogen density in a variety of clinical and epidemiological contexts. These methods are often classified as semi-quantitative, yet estimates of reliability or sensitivity are seldom reported. Here, a statistical framework is developed for assessing the reliability (uncertainty) of pathogen densities estimated using QMMs and the associated diagnostic sensitivity. The method is illustrated with quantification of Plasmodium falciparum gametocytaemia by QT-NASBA. The reliability of pathogen (e.g. gametocyte) densities, and the accompanying diagnostic sensitivity, estimated by two contrasting statistical calibration techniques, are compared; a traditional method and a mixed model Bayesian approach. The latter accounts for statistical dependence of QMM assays run under identical laboratory protocols and permits structural modelling of experimental measurements, allowing precision to vary with pathogen density. Traditional calibration cannot account for inter-assay variability arising from imperfect QMMs and generates estimates of pathogen density that have poor reliability, are variable among assays and inaccurately reflect diagnostic sensitivity. The Bayesian mixed model approach assimilates information from replica QMM assays, improving reliability and inter-assay homogeneity, providing an accurate appraisal of quantitative and diagnostic performance. Bayesian mixed model statistical calibration supersedes traditional techniques in the context of QMM-derived estimates of pathogen density, offering the potential to improve substantially the depth and quality of clinical and epidemiological inference for a wide variety of pathogens.
Efficient statistical tests to compare Youden index: accounting for contingency correlation.
Chen, Fangyao; Xue, Yuqiang; Tan, Ming T; Chen, Pingyan
2015-04-30
Youden index is widely utilized in studies evaluating accuracy of diagnostic tests and performance of predictive, prognostic, or risk models. However, both one and two independent sample tests on Youden index have been derived ignoring the dependence (association) between sensitivity and specificity, resulting in potentially misleading findings. Besides, paired sample test on Youden index is currently unavailable. This article develops efficient statistical inference procedures for one sample, independent, and paired sample tests on Youden index by accounting for contingency correlation, namely associations between sensitivity and specificity and paired samples typically represented in contingency tables. For one and two independent sample tests, the variances are estimated by Delta method, and the statistical inference is based on the central limit theory, which are then verified by bootstrap estimates. For paired samples test, we show that the estimated covariance of the two sensitivities and specificities can be represented as a function of kappa statistic so the test can be readily carried out. We then show the remarkable accuracy of the estimated variance using a constrained optimization approach. Simulation is performed to evaluate the statistical properties of the derived tests. The proposed approaches yield more stable type I errors at the nominal level and substantially higher power (efficiency) than does the original Youden's approach. Therefore, the simple explicit large sample solution performs very well. Because we can readily implement the asymptotic and exact bootstrap computation with common software like R, the method is broadly applicable to the evaluation of diagnostic tests and model performance. Copyright © 2015 John Wiley & Sons, Ltd.
Targeted estimation of nuisance parameters to obtain valid statistical inference.
van der Laan, Mark J
2014-01-01
In order to obtain concrete results, we focus on estimation of the treatment specific mean, controlling for all measured baseline covariates, based on observing independent and identically distributed copies of a random variable consisting of baseline covariates, a subsequently assigned binary treatment, and a final outcome. The statistical model only assumes possible restrictions on the conditional distribution of treatment, given the covariates, the so-called propensity score. Estimators of the treatment specific mean involve estimation of the propensity score and/or estimation of the conditional mean of the outcome, given the treatment and covariates. In order to make these estimators asymptotically unbiased at any data distribution in the statistical model, it is essential to use data-adaptive estimators of these nuisance parameters such as ensemble learning, and specifically super-learning. Because such estimators involve optimal trade-off of bias and variance w.r.t. the infinite dimensional nuisance parameter itself, they result in a sub-optimal bias/variance trade-off for the resulting real-valued estimator of the estimand. We demonstrate that additional targeting of the estimators of these nuisance parameters guarantees that this bias for the estimand is second order and thereby allows us to prove theorems that establish asymptotic linearity of the estimator of the treatment specific mean under regularity conditions. These insights result in novel targeted minimum loss-based estimators (TMLEs) that use ensemble learning with additional targeted bias reduction to construct estimators of the nuisance parameters. In particular, we construct collaborative TMLEs (C-TMLEs) with known influence curve allowing for statistical inference, even though these C-TMLEs involve variable selection for the propensity score based on a criterion that measures how effective the resulting fit of the propensity score is in removing bias for the estimand. As a particular special case, we also demonstrate the required targeting of the propensity score for the inverse probability of treatment weighted estimator using super-learning to fit the propensity score.
Frazin, Richard A
2016-04-01
A new generation of telescopes with mirror diameters of 20 m or more, called extremely large telescopes (ELTs), has the potential to provide unprecedented imaging and spectroscopy of exoplanetary systems, if the difficulties in achieving the extremely high dynamic range required to differentiate the planetary signal from the star can be overcome to a sufficient degree. Fully utilizing the potential of ELTs for exoplanet imaging will likely require simultaneous and self-consistent determination of both the planetary image and the unknown aberrations in multiple planes of the optical system, using statistical inference based on the wavefront sensor and science camera data streams. This approach promises to overcome the most important systematic errors inherent in the various schemes based on differential imaging, such as angular differential imaging and spectral differential imaging. This paper is the first in a series on this subject, in which a formalism is established for the exoplanet imaging problem, setting the stage for the statistical inference methods to follow in the future. Every effort has been made to be rigorous and complete, so that validity of approximations to be made later can be assessed. Here, the polarimetric image is expressed in terms of aberrations in the various planes of a polarizing telescope with an adaptive optics system. Further, it is shown that current methods that utilize focal plane sensing to correct the speckle field, e.g., electric field conjugation, rely on the tacit assumption that aberrations on multiple optical surfaces can be represented as aberration on a single optical surface, ultimately limiting their potential effectiveness for ground-based astronomy.
Ter Braak, Cajo J F; Peres-Neto, Pedro; Dray, Stéphane
2017-01-01
Statistical testing of trait-environment association from data is a challenge as there is no common unit of observation: the trait is observed on species, the environment on sites and the mediating abundance on species-site combinations. A number of correlation-based methods, such as the community weighted trait means method (CWM), the fourth-corner correlation method and the multivariate method RLQ, have been proposed to estimate such trait-environment associations. In these methods, valid statistical testing proceeds by performing two separate resampling tests, one site-based and the other species-based and by assessing significance by the largest of the two p -values (the p max test). Recently, regression-based methods using generalized linear models (GLM) have been proposed as a promising alternative with statistical inference via site-based resampling. We investigated the performance of this new approach along with approaches that mimicked the p max test using GLM instead of fourth-corner. By simulation using models with additional random variation in the species response to the environment, the site-based resampling tests using GLM are shown to have severely inflated type I error, of up to 90%, when the nominal level is set as 5%. In addition, predictive modelling of such data using site-based cross-validation very often identified trait-environment interactions that had no predictive value. The problem that we identify is not an "omitted variable bias" problem as it occurs even when the additional random variation is independent of the observed trait and environment data. Instead, it is a problem of ignoring a random effect. In the same simulations, the GLM-based p max test controlled the type I error in all models proposed so far in this context, but still gave slightly inflated error in more complex models that included both missing (but important) traits and missing (but important) environmental variables. For screening the importance of single trait-environment combinations, the fourth-corner test is shown to give almost the same results as the GLM-based tests in far less computing time.
Statistical modeling of software reliability
NASA Technical Reports Server (NTRS)
Miller, Douglas R.
1992-01-01
This working paper discusses the statistical simulation part of a controlled software development experiment being conducted under the direction of the System Validation Methods Branch, Information Systems Division, NASA Langley Research Center. The experiment uses guidance and control software (GCS) aboard a fictitious planetary landing spacecraft: real-time control software operating on a transient mission. Software execution is simulated to study the statistical aspects of reliability and other failure characteristics of the software during development, testing, and random usage. Quantification of software reliability is a major goal. Various reliability concepts are discussed. Experiments are described for performing simulations and collecting appropriate simulated software performance and failure data. This data is then used to make statistical inferences about the quality of the software development and verification processes as well as inferences about the reliability of software versions and reliability growth under random testing and debugging.
Quantum-Like Representation of Non-Bayesian Inference
NASA Astrophysics Data System (ADS)
Asano, M.; Basieva, I.; Khrennikov, A.; Ohya, M.; Tanaka, Y.
2013-01-01
This research is related to the problem of "irrational decision making or inference" that have been discussed in cognitive psychology. There are some experimental studies, and these statistical data cannot be described by classical probability theory. The process of decision making generating these data cannot be reduced to the classical Bayesian inference. For this problem, a number of quantum-like coginitive models of decision making was proposed. Our previous work represented in a natural way the classical Bayesian inference in the frame work of quantum mechanics. By using this representation, in this paper, we try to discuss the non-Bayesian (irrational) inference that is biased by effects like the quantum interference. Further, we describe "psychological factor" disturbing "rationality" as an "environment" correlating with the "main system" of usual Bayesian inference.
Scliar, Marilia O; Gouveia, Mateus H; Benazzo, Andrea; Ghirotto, Silvia; Fagundes, Nelson J R; Leal, Thiago P; Magalhães, Wagner C S; Pereira, Latife; Rodrigues, Maira R; Soares-Souza, Giordano B; Cabrera, Lilia; Berg, Douglas E; Gilman, Robert H; Bertorelle, Giorgio; Tarazona-Santos, Eduardo
2014-09-30
Archaeology reports millenary cultural contacts between Peruvian Coast-Andes and the Amazon Yunga, a rainforest transitional region between Andes and Lower Amazonia. To clarify the relationships between cultural and biological evolution of these populations, in particular between Amazon Yungas and Andeans, we used DNA-sequence data, a model-based Bayesian approach and several statistical validations to infer a set of demographic parameters. We found that the genetic diversity of the Shimaa (an Amazon Yunga population) is a subset of that of Quechuas from Central-Andes. Using the Isolation-with-Migration population genetics model, we inferred that the Shimaa ancestors were a small subgroup that split less than 5300 years ago (after the development of complex societies) from an ancestral Andean population. After the split, the most plausible scenario compatible with our results is that the ancestors of Shimaas moved toward the Peruvian Amazon Yunga and incorporated the culture and language of some of their neighbors, but not a substantial amount of their genes. We validated our results using Approximate Bayesian Computations, posterior predictive tests and the analysis of pseudo-observed datasets. We presented a case study in which model-based Bayesian approaches, combined with necessary statistical validations, shed light into the prehistoric demographic relationship between Andeans and a population from the Amazon Yunga. Our results offer a testable model for the peopling of this large transitional environmental region between the Andes and the Lower Amazonia. However, studies on larger samples and involving more populations of these regions are necessary to confirm if the predominant Andean biological origin of the Shimaas is the rule, and not the exception.
A simple statistical model for geomagnetic reversals
NASA Technical Reports Server (NTRS)
Constable, Catherine
1990-01-01
The diversity of paleomagnetic records of geomagnetic reversals now available indicate that the field configuration during transitions cannot be adequately described by simple zonal or standing field models. A new model described here is based on statistical properties inferred from the present field and is capable of simulating field transitions like those observed. Some insight is obtained into what one can hope to learn from paleomagnetic records. In particular, it is crucial that the effects of smoothing in the remanence acquisition process be separated from true geomagnetic field behavior. This might enable us to determine the time constants associated with the dominant field configuration during a reversal.
Why environmental scientists are becoming Bayesians
James S. Clark
2005-01-01
Advances in computational statistics provide a general framework for the high dimensional models typically needed for ecological inference and prediction. Hierarchical Bayes (HB) represents a modelling structure with capacity to exploit diverse sources of information, to accommodate influences that are unknown (or unknowable), and to draw inference on large numbers of...
Cross-Situational Learning of Minimal Word Pairs
ERIC Educational Resources Information Center
Escudero, Paola; Mulak, Karen E.; Vlach, Haley A.
2016-01-01
"Cross-situational statistical learning" of words involves tracking co-occurrences of auditory words and objects across time to infer word-referent mappings. Previous research has demonstrated that learners can infer referents across sets of very phonologically distinct words (e.g., WUG, DAX), but it remains unknown whether learners can…
Accounting for measurement error: a critical but often overlooked process.
Harris, Edward F; Smith, Richard N
2009-12-01
Due to instrument imprecision and human inconsistencies, measurements are not free of error. Technical error of measurement (TEM) is the variability encountered between dimensions when the same specimens are measured at multiple sessions. A goal of a data collection regimen is to minimise TEM. The few studies that actually quantify TEM, regardless of discipline, report that it is substantial and can affect results and inferences. This paper reviews some statistical approaches for identifying and controlling TEM. Statistically, TEM is part of the residual ('unexplained') variance in a statistical test, so accounting for TEM, which requires repeated measurements, enhances the chances of finding a statistically significant difference if one exists. The aim of this paper was to review and discuss common statistical designs relating to types of error and statistical approaches to error accountability. This paper addresses issues of landmark location, validity, technical and systematic error, analysis of variance, scaled measures and correlation coefficients in order to guide the reader towards correct identification of true experimental differences. Researchers commonly infer characteristics about populations from comparatively restricted study samples. Most inferences are statistical and, aside from concerns about adequate accounting for known sources of variation with the research design, an important source of variability is measurement error. Variability in locating landmarks that define variables is obvious in odontometrics, cephalometrics and anthropometry, but the same concerns about measurement accuracy and precision extend to all disciplines. With increasing accessibility to computer-assisted methods of data collection, the ease of incorporating repeated measures into statistical designs has improved. Accounting for this technical source of variation increases the chance of finding biologically true differences when they exist.
Using a Five-Step Procedure for Inferential Statistical Analyses
ERIC Educational Resources Information Center
Kamin, Lawrence F.
2010-01-01
Many statistics texts pose inferential statistical problems in a disjointed way. By using a simple five-step procedure as a template for statistical inference problems, the student can solve problems in an organized fashion. The problem and its solution will thus be a stand-by-itself organic whole and a single unit of thought and effort. The…
Northern Russian chironomid-based modern summer temperature data set and inference models
NASA Astrophysics Data System (ADS)
Nazarova, Larisa; Self, Angela E.; Brooks, Stephen J.; van Hardenbroek, Maarten; Herzschuh, Ulrike; Diekmann, Bernhard
2015-11-01
West and East Siberian data sets and 55 new sites were merged based on the high taxonomic similarity, and the strong relationship between mean July air temperature and the distribution of chironomid taxa in both data sets compared with other environmental parameters. Multivariate statistical analysis of chironomid and environmental data from the combined data set consisting of 268 lakes, located in northern Russia, suggests that mean July air temperature explains the greatest amount of variance in chironomid distribution compared with other measured variables (latitude, longitude, altitude, water depth, lake surface area, pH, conductivity, mean January air temperature, mean July air temperature, and continentality). We established two robust inference models to reconstruct mean summer air temperatures from subfossil chironomids based on ecological and geographical approaches. The North Russian 2-component WA-PLS model (RMSEPJack = 1.35 °C, rJack2 = 0.87) can be recommended for application in palaeoclimatic studies in northern Russia. Based on distinctive chironomid fauna and climatic regimes of Kamchatka the Far East 2-component WAPLS model (RMSEPJack = 1.3 °C, rJack2 = 0.81) has potentially better applicability in Kamchatka.
On Statistical Analysis of Neuroimages with Imperfect Registration
Kim, Won Hwa; Ravi, Sathya N.; Johnson, Sterling C.; Okonkwo, Ozioma C.; Singh, Vikas
2016-01-01
A variety of studies in neuroscience/neuroimaging seek to perform statistical inference on the acquired brain image scans for diagnosis as well as understanding the pathological manifestation of diseases. To do so, an important first step is to register (or co-register) all of the image data into a common coordinate system. This permits meaningful comparison of the intensities at each voxel across groups (e.g., diseased versus healthy) to evaluate the effects of the disease and/or use machine learning algorithms in a subsequent step. But errors in the underlying registration make this problematic, they either decrease the statistical power or make the follow-up inference tasks less effective/accurate. In this paper, we derive a novel algorithm which offers immunity to local errors in the underlying deformation field obtained from registration procedures. By deriving a deformation invariant representation of the image, the downstream analysis can be made more robust as if one had access to a (hypothetical) far superior registration procedure. Our algorithm is based on recent work on scattering transform. Using this as a starting point, we show how results from harmonic analysis (especially, non-Euclidean wavelets) yields strategies for designing deformation and additive noise invariant representations of large 3-D brain image volumes. We present a set of results on synthetic and real brain images where we achieve robust statistical analysis even in the presence of substantial deformation errors; here, standard analysis procedures significantly under-perform and fail to identify the true signal. PMID:27042168
Bayesian inference for joint modelling of longitudinal continuous, binary and ordinal events.
Li, Qiuju; Pan, Jianxin; Belcher, John
2016-12-01
In medical studies, repeated measurements of continuous, binary and ordinal outcomes are routinely collected from the same patient. Instead of modelling each outcome separately, in this study we propose to jointly model the trivariate longitudinal responses, so as to take account of the inherent association between the different outcomes and thus improve statistical inferences. This work is motivated by a large cohort study in the North West of England, involving trivariate responses from each patient: Body Mass Index, Depression (Yes/No) ascertained with cut-off score not less than 8 at the Hospital Anxiety and Depression Scale, and Pain Interference generated from the Medical Outcomes Study 36-item short-form health survey with values returned on an ordinal scale 1-5. There are some well-established methods for combined continuous and binary, or even continuous and ordinal responses, but little work was done on the joint analysis of continuous, binary and ordinal responses. We propose conditional joint random-effects models, which take into account the inherent association between the continuous, binary and ordinal outcomes. Bayesian analysis methods are used to make statistical inferences. Simulation studies show that, by jointly modelling the trivariate outcomes, standard deviations of the estimates of parameters in the models are smaller and much more stable, leading to more efficient parameter estimates and reliable statistical inferences. In the real data analysis, the proposed joint analysis yields a much smaller deviance information criterion value than the separate analysis, and shows other good statistical properties too. © The Author(s) 2014.
New methods in iris recognition.
Daugman, John
2007-10-01
This paper presents the following four advances in iris recognition: 1) more disciplined methods for detecting and faithfully modeling the iris inner and outer boundaries with active contours, leading to more flexible embedded coordinate systems; 2) Fourier-based methods for solving problems in iris trigonometry and projective geometry, allowing off-axis gaze to be handled by detecting it and "rotating" the eye into orthographic perspective; 3) statistical inference methods for detecting and excluding eyelashes; and 4) exploration of score normalizations, depending on the amount of iris data that is available in images and the required scale of database search. Statistical results are presented based on 200 billion iris cross-comparisons that were generated from 632500 irises in the United Arab Emirates database to analyze the normalization issues raised in different regions of receiver operating characteristic curves.
Confidence crisis of results in biomechanics research.
Knudson, Duane
2017-11-01
Many biomechanics studies have small sample sizes and incorrect statistical analyses, so reporting of inaccurate inferences and inflated magnitude of effects are common in the field. This review examines these issues in biomechanics research and summarises potential solutions from research in other fields to increase the confidence in the experimental effects reported in biomechanics. Authors, reviewers and editors of biomechanics research reports are encouraged to improve sample sizes and the resulting statistical power, improve reporting transparency, improve the rigour of statistical analyses used, and increase the acceptance of replication studies to improve the validity of inferences from data in biomechanics research. The application of sports biomechanics research results would also improve if a larger percentage of unbiased effects and their uncertainty were reported in the literature.
The Empirical Nature and Statistical Treatment of Missing Data
ERIC Educational Resources Information Center
Tannenbaum, Christyn E.
2009-01-01
Introduction. Missing data is a common problem in research and can produce severely misleading analyses, including biased estimates of statistical parameters, and erroneous conclusions. In its 1999 report, the APA Task Force on Statistical Inference encouraged authors to report complications such as missing data and discouraged the use of…
The Role of the Sampling Distribution in Understanding Statistical Inference
ERIC Educational Resources Information Center
Lipson, Kay
2003-01-01
Many statistics educators believe that few students develop the level of conceptual understanding essential for them to apply correctly the statistical techniques at their disposal and to interpret their outcomes appropriately. It is also commonly believed that the sampling distribution plays an important role in developing this understanding.…
Karl Pearson and eugenics: personal opinions and scientific rigor.
Delzell, Darcie A P; Poliak, Cathy D
2013-09-01
The influence of personal opinions and biases on scientific conclusions is a threat to the advancement of knowledge. Expertise and experience does not render one immune to this temptation. In this work, one of the founding fathers of statistics, Karl Pearson, is used as an illustration of how even the most talented among us can produce misleading results when inferences are made without caution or reference to potential bias and other analysis limitations. A study performed by Pearson on British Jewish schoolchildren is examined in light of ethical and professional statistical practice. The methodology used and inferences made by Pearson and his coauthor are sometimes questionable and offer insight into how Pearson's support of eugenics and his own British nationalism could have potentially influenced his often careless and far-fetched inferences. A short background into Pearson's work and beliefs is provided, along with an in-depth examination of the authors' overall experimental design and statistical practices. In addition, portions of the study regarding intelligence and tuberculosis are discussed in more detail, along with historical reactions to their work.
NASA Technical Reports Server (NTRS)
Abbey, Craig K.; Eckstein, Miguel P.
2002-01-01
We consider estimation and statistical hypothesis testing on classification images obtained from the two-alternative forced-choice experimental paradigm. We begin with a probabilistic model of task performance for simple forced-choice detection and discrimination tasks. Particular attention is paid to general linear filter models because these models lead to a direct interpretation of the classification image as an estimate of the filter weights. We then describe an estimation procedure for obtaining classification images from observer data. A number of statistical tests are presented for testing various hypotheses from classification images based on some more compact set of features derived from them. As an example of how the methods we describe can be used, we present a case study investigating detection of a Gaussian bump profile.
Experimental validation of the RATE tool for inferring HLA restrictions of T cell epitopes.
Paul, Sinu; Arlehamn, Cecilia S Lindestam; Schulten, Veronique; Westernberg, Luise; Sidney, John; Peters, Bjoern; Sette, Alessandro
2017-06-21
The RATE tool was recently developed to computationally infer the HLA restriction of given epitopes from immune response data of HLA typed subjects without additional cumbersome experimentation. Here, RATE was validated using experimentally defined restriction data from a set of 191 tuberculosis-derived epitopes and 63 healthy individuals with MTB infection from the Western Cape Region of South Africa. Using this experimental dataset, the parameters utilized by the RATE tool to infer restriction were optimized, which included relative frequency (RF) of the subjects responding to a given epitope and expressing a given allele as compared to the general test population and the associated p-value in a Fisher's exact test. We also examined the potential for further optimization based on the predicted binding affinity of epitopes to potential restricting HLA alleles, and the absolute number of individuals expressing a given allele and responding to the specific epitope. Different statistical measures, including Matthew's correlation coefficient, accuracy, sensitivity and specificity were used to evaluate performance of RATE as a function of these criteria. Based on our results we recommend selection of HLA restrictions with cutoffs of p-value < 0.01 and RF ≥ 1.3. The usefulness of the tool was demonstrated by inferring new HLA restrictions for epitope sets where restrictions could not be experimentally determined due to lack of necessary cell lines and for an additional data set related to recognition of pollen derived epitopes from allergic patients. Experimental data sets were used to validate RATE tool and the parameters used by the RATE tool to infer restriction were optimized. New HLA restrictions were identified using the optimized RATE tool.
Learning abstract visual concepts via probabilistic program induction in a Language of Thought.
Overlan, Matthew C; Jacobs, Robert A; Piantadosi, Steven T
2017-11-01
The ability to learn abstract concepts is a powerful component of human cognition. It has been argued that variable binding is the key element enabling this ability, but the computational aspects of variable binding remain poorly understood. Here, we address this shortcoming by formalizing the Hierarchical Language of Thought (HLOT) model of rule learning. Given a set of data items, the model uses Bayesian inference to infer a probability distribution over stochastic programs that implement variable binding. Because the model makes use of symbolic variables as well as Bayesian inference and programs with stochastic primitives, it combines many of the advantages of both symbolic and statistical approaches to cognitive modeling. To evaluate the model, we conducted an experiment in which human subjects viewed training items and then judged which test items belong to the same concept as the training items. We found that the HLOT model provides a close match to human generalization patterns, significantly outperforming two variants of the Generalized Context Model, one variant based on string similarity and the other based on visual similarity using features from a deep convolutional neural network. Additional results suggest that variable binding happens automatically, implying that binding operations do not add complexity to peoples' hypothesized rules. Overall, this work demonstrates that a cognitive model combining symbolic variables with Bayesian inference and stochastic program primitives provides a new perspective for understanding people's patterns of generalization. Copyright © 2017 Elsevier B.V. All rights reserved.
Yu, Jihnhee; Yang, Luge; Vexler, Albert; Hutson, Alan D
2016-06-15
The receiver operating characteristic (ROC) curve is a popular technique with applications, for example, investigating an accuracy of a biomarker to delineate between disease and non-disease groups. A common measure of accuracy of a given diagnostic marker is the area under the ROC curve (AUC). In contrast with the AUC, the partial area under the ROC curve (pAUC) looks into the area with certain specificities (i.e., true negative rate) only, and it can be often clinically more relevant than examining the entire ROC curve. The pAUC is commonly estimated based on a U-statistic with the plug-in sample quantile, making the estimator a non-traditional U-statistic. In this article, we propose an accurate and easy method to obtain the variance of the nonparametric pAUC estimator. The proposed method is easy to implement for both one biomarker test and the comparison of two correlated biomarkers because it simply adapts the existing variance estimator of U-statistics. In this article, we show accuracy and other advantages of the proposed variance estimation method by broadly comparing it with previously existing methods. Further, we develop an empirical likelihood inference method based on the proposed variance estimator through a simple implementation. In an application, we demonstrate that, depending on the inferences by either the AUC or pAUC, we can make a different decision on a prognostic ability of a same set of biomarkers. Copyright © 2016 John Wiley & Sons, Ltd. Copyright © 2016 John Wiley & Sons, Ltd.
Mai, Lan-Yin; Li, Yi-Xuan; Chen, Yong; Xie, Zhen; Li, Jie; Zhong, Ming-Yu
2014-05-01
The compatibility of traditional Chinese medicines (TCMs) formulae containing enormous information, is a complex component system. Applications of mathematical statistics methods on the compatibility researches of traditional Chinese medicines formulae have great significance for promoting the modernization of traditional Chinese medicines and improving clinical efficacies and optimizations of formulae. As a tool for quantitative analysis, data inference and exploring inherent rules of substances, the mathematical statistics method can be used to reveal the working mechanisms of the compatibility of traditional Chinese medicines formulae in qualitatively and quantitatively. By reviewing studies based on the applications of mathematical statistics methods, this paper were summarized from perspective of dosages optimization, efficacies and changes of chemical components as well as the rules of incompatibility and contraindication of formulae, will provide the references for further studying and revealing the working mechanisms and the connotations of traditional Chinese medicines.
On the Spike Train Variability Characterized by Variance-to-Mean Power Relationship.
Koyama, Shinsuke
2015-07-01
We propose a statistical method for modeling the non-Poisson variability of spike trains observed in a wide range of brain regions. Central to our approach is the assumption that the variance and the mean of interspike intervals are related by a power function characterized by two parameters: the scale factor and exponent. It is shown that this single assumption allows the variability of spike trains to have an arbitrary scale and various dependencies on the firing rate in the spike count statistics, as well as in the interval statistics, depending on the two parameters of the power function. We also propose a statistical model for spike trains that exhibits the variance-to-mean power relationship. Based on this, a maximum likelihood method is developed for inferring the parameters from rate-modulated spike trains. The proposed method is illustrated on simulated and experimental spike trains.
Spatio-temporal conditional inference and hypothesis tests for neural ensemble spiking precision
Harrison, Matthew T.; Amarasingham, Asohan; Truccolo, Wilson
2014-01-01
The collective dynamics of neural ensembles create complex spike patterns with many spatial and temporal scales. Understanding the statistical structure of these patterns can help resolve fundamental questions about neural computation and neural dynamics. Spatio-temporal conditional inference (STCI) is introduced here as a semiparametric statistical framework for investigating the nature of precise spiking patterns from collections of neurons that is robust to arbitrarily complex and nonstationary coarse spiking dynamics. The main idea is to focus statistical modeling and inference, not on the full distribution of the data, but rather on families of conditional distributions of precise spiking given different types of coarse spiking. The framework is then used to develop families of hypothesis tests for probing the spatio-temporal precision of spiking patterns. Relationships among different conditional distributions are used to improve multiple hypothesis testing adjustments and to design novel Monte Carlo spike resampling algorithms. Of special note are algorithms that can locally jitter spike times while still preserving the instantaneous peri-stimulus time histogram (PSTH) or the instantaneous total spike count from a group of recorded neurons. The framework can also be used to test whether first-order maximum entropy models with possibly random and time-varying parameters can account for observed patterns of spiking. STCI provides a detailed example of the generic principle of conditional inference, which may be applicable in other areas of neurostatistical analysis. PMID:25380339
Reliability of dose volume constraint inference from clinical data.
Lutz, C M; Møller, D S; Hoffmann, L; Knap, M M; Alber, M
2017-04-21
Dose volume histogram points (DVHPs) frequently serve as dose constraints in radiotherapy treatment planning. An experiment was designed to investigate the reliability of DVHP inference from clinical data for multiple cohort sizes and complication incidence rates. The experimental background was radiation pneumonitis in non-small cell lung cancer and the DVHP inference method was based on logistic regression. From 102 NSCLC real-life dose distributions and a postulated DVHP model, an 'ideal' cohort was generated where the most predictive model was equal to the postulated model. A bootstrap and a Cohort Replication Monte Carlo (CoRepMC) approach were applied to create 1000 equally sized populations each. The cohorts were then analyzed to establish inference frequency distributions. This was applied to nine scenarios for cohort sizes of 102 (1), 500 (2) to 2000 (3) patients (by sampling with replacement) and three postulated DVHP models. The Bootstrap was repeated for a 'non-ideal' cohort, where the most predictive model did not coincide with the postulated model. The Bootstrap produced chaotic results for all models of cohort size 1 for both the ideal and non-ideal cohorts. For cohort size 2 and 3, the distributions for all populations were more concentrated around the postulated DVHP. For the CoRepMC, the inference frequency increased with cohort size and incidence rate. Correct inference rates >[Formula: see text] were only achieved by cohorts with more than 500 patients. Both Bootstrap and CoRepMC indicate that inference of the correct or approximate DVHP for typical cohort sizes is highly uncertain. CoRepMC results were less spurious than Bootstrap results, demonstrating the large influence that randomness in dose-response has on the statistical analysis.
Reliability of dose volume constraint inference from clinical data
NASA Astrophysics Data System (ADS)
Lutz, C. M.; Møller, D. S.; Hoffmann, L.; Knap, M. M.; Alber, M.
2017-04-01
Dose volume histogram points (DVHPs) frequently serve as dose constraints in radiotherapy treatment planning. An experiment was designed to investigate the reliability of DVHP inference from clinical data for multiple cohort sizes and complication incidence rates. The experimental background was radiation pneumonitis in non-small cell lung cancer and the DVHP inference method was based on logistic regression. From 102 NSCLC real-life dose distributions and a postulated DVHP model, an ‘ideal’ cohort was generated where the most predictive model was equal to the postulated model. A bootstrap and a Cohort Replication Monte Carlo (CoRepMC) approach were applied to create 1000 equally sized populations each. The cohorts were then analyzed to establish inference frequency distributions. This was applied to nine scenarios for cohort sizes of 102 (1), 500 (2) to 2000 (3) patients (by sampling with replacement) and three postulated DVHP models. The Bootstrap was repeated for a ‘non-ideal’ cohort, where the most predictive model did not coincide with the postulated model. The Bootstrap produced chaotic results for all models of cohort size 1 for both the ideal and non-ideal cohorts. For cohort size 2 and 3, the distributions for all populations were more concentrated around the postulated DVHP. For the CoRepMC, the inference frequency increased with cohort size and incidence rate. Correct inference rates >85 % were only achieved by cohorts with more than 500 patients. Both Bootstrap and CoRepMC indicate that inference of the correct or approximate DVHP for typical cohort sizes is highly uncertain. CoRepMC results were less spurious than Bootstrap results, demonstrating the large influence that randomness in dose-response has on the statistical analysis.
Data mining and statistical inference in selective laser melting
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kamath, Chandrika
Selective laser melting (SLM) is an additive manufacturing process that builds a complex three-dimensional part, layer-by-layer, using a laser beam to fuse fine metal powder together. The design freedom afforded by SLM comes associated with complexity. As the physical phenomena occur over a broad range of length and time scales, the computational cost of modeling the process is high. At the same time, the large number of parameters that control the quality of a part make experiments expensive. In this paper, we describe ways in which we can use data mining and statistical inference techniques to intelligently combine simulations andmore » experiments to build parts with desired properties. We start with a brief summary of prior work in finding process parameters for high-density parts. We then expand on this work to show how we can improve the approach by using feature selection techniques to identify important variables, data-driven surrogate models to reduce computational costs, improved sampling techniques to cover the design space adequately, and uncertainty analysis for statistical inference. Here, our results indicate that techniques from data mining and statistics can complement those from physical modeling to provide greater insight into complex processes such as selective laser melting.« less
Data mining and statistical inference in selective laser melting
Kamath, Chandrika
2016-01-11
Selective laser melting (SLM) is an additive manufacturing process that builds a complex three-dimensional part, layer-by-layer, using a laser beam to fuse fine metal powder together. The design freedom afforded by SLM comes associated with complexity. As the physical phenomena occur over a broad range of length and time scales, the computational cost of modeling the process is high. At the same time, the large number of parameters that control the quality of a part make experiments expensive. In this paper, we describe ways in which we can use data mining and statistical inference techniques to intelligently combine simulations andmore » experiments to build parts with desired properties. We start with a brief summary of prior work in finding process parameters for high-density parts. We then expand on this work to show how we can improve the approach by using feature selection techniques to identify important variables, data-driven surrogate models to reduce computational costs, improved sampling techniques to cover the design space adequately, and uncertainty analysis for statistical inference. Here, our results indicate that techniques from data mining and statistics can complement those from physical modeling to provide greater insight into complex processes such as selective laser melting.« less
Watanabe, Hiroshi
2012-01-01
Procedures of statistical analysis are reviewed to provide an overview of applications of statistics for general use. Topics that are dealt with are inference on a population, comparison of two populations with respect to means and probabilities, and multiple comparisons. This study is the second part of series in which we survey medical statistics. Arguments related to statistical associations and regressions will be made in subsequent papers.
PROBABILITY SAMPLING AND POPULATION INFERENCE IN MONITORING PROGRAMS
A fundamental difference between probability sampling and conventional statistics is that "sampling" deals with real, tangible populations, whereas "conventional statistics" usually deals with hypothetical populations that have no real-world realization. he focus here is on real ...
Statistical Inference in the Learning of Novel Phonetic Categories
ERIC Educational Resources Information Center
Zhao, Yuan
2010-01-01
Learning a phonetic category (or any linguistic category) requires integrating different sources of information. A crucial unsolved problem for phonetic learning is how this integration occurs: how can we update our previous knowledge about a phonetic category as we hear new exemplars of the category? One model of learning is Bayesian Inference,…
Campbell's and Rubin's Perspectives on Causal Inference
ERIC Educational Resources Information Center
West, Stephen G.; Thoemmes, Felix
2010-01-01
Donald Campbell's approach to causal inference (D. T. Campbell, 1957; W. R. Shadish, T. D. Cook, & D. T. Campbell, 2002) is widely used in psychology and education, whereas Donald Rubin's causal model (P. W. Holland, 1986; D. B. Rubin, 1974, 2005) is widely used in economics, statistics, medicine, and public health. Campbell's approach focuses on…
Direct Evidence for a Dual Process Model of Deductive Inference
ERIC Educational Resources Information Center
Markovits, Henry; Brunet, Marie-Laurence; Thompson, Valerie; Brisson, Janie
2013-01-01
In 2 experiments, we tested a strong version of a dual process theory of conditional inference (cf. Verschueren et al., 2005a, 2005b) that assumes that most reasoners have 2 strategies available, the choice of which is determined by situational variables, cognitive capacity, and metacognitive control. The statistical strategy evaluates inferences…
The Role of Probability in Developing Learners' Models of Simulation Approaches to Inference
ERIC Educational Resources Information Center
Lee, Hollylynne S.; Doerr, Helen M.; Tran, Dung; Lovett, Jennifer N.
2016-01-01
Repeated sampling approaches to inference that rely on simulations have recently gained prominence in statistics education, and probabilistic concepts are at the core of this approach. In this approach, learners need to develop a mapping among the problem situation, a physical enactment, computer representations, and the underlying randomization…
From Blickets to Synapses: Inferring Temporal Causal Networks by Observation
ERIC Educational Resources Information Center
Fernando, Chrisantha
2013-01-01
How do human infants learn the causal dependencies between events? Evidence suggests that this remarkable feat can be achieved by observation of only a handful of examples. Many computational models have been produced to explain how infants perform causal inference without explicit teaching about statistics or the scientific method. Here, we…
It's a Girl! Random Numbers, Simulations, and the Law of Large Numbers
ERIC Educational Resources Information Center
Goodwin, Chris; Ortiz, Enrique
2015-01-01
Modeling using mathematics and making inferences about mathematical situations are becoming more prevalent in most fields of study. Descriptive statistics cannot be used to generalize about a population or make predictions of what can occur. Instead, inference must be used. Simulation and sampling are essential in building a foundation for…
Thou Shalt Not Bear False Witness against Null Hypothesis Significance Testing
ERIC Educational Resources Information Center
García-Pérez, Miguel A.
2017-01-01
Null hypothesis significance testing (NHST) has been the subject of debate for decades and alternative approaches to data analysis have been proposed. This article addresses this debate from the perspective of scientific inquiry and inference. Inference is an inverse problem and application of statistical methods cannot reveal whether effects…
Krefeld-Schwalb, Antonia; Witte, Erich H.; Zenker, Frank
2018-01-01
In psychology as elsewhere, the main statistical inference strategy to establish empirical effects is null-hypothesis significance testing (NHST). The recent failure to replicate allegedly well-established NHST-results, however, implies that such results lack sufficient statistical power, and thus feature unacceptably high error-rates. Using data-simulation to estimate the error-rates of NHST-results, we advocate the research program strategy (RPS) as a superior methodology. RPS integrates Frequentist with Bayesian inference elements, and leads from a preliminary discovery against a (random) H0-hypothesis to a statistical H1-verification. Not only do RPS-results feature significantly lower error-rates than NHST-results, RPS also addresses key-deficits of a “pure” Frequentist and a standard Bayesian approach. In particular, RPS aggregates underpowered results safely. RPS therefore provides a tool to regain the trust the discipline had lost during the ongoing replicability-crisis. PMID:29740363
NIRS-SPM: statistical parametric mapping for near infrared spectroscopy
NASA Astrophysics Data System (ADS)
Tak, Sungho; Jang, Kwang Eun; Jung, Jinwook; Jang, Jaeduck; Jeong, Yong; Ye, Jong Chul
2008-02-01
Even though there exists a powerful statistical parametric mapping (SPM) tool for fMRI, similar public domain tools are not available for near infrared spectroscopy (NIRS). In this paper, we describe a new public domain statistical toolbox called NIRS-SPM for quantitative analysis of NIRS signals. Specifically, NIRS-SPM statistically analyzes the NIRS data using GLM and makes inference as the excursion probability which comes from the random field that are interpolated from the sparse measurement. In order to obtain correct inference, NIRS-SPM offers the pre-coloring and pre-whitening method for temporal correlation estimation. For simultaneous recording NIRS signal with fMRI, the spatial mapping between fMRI image and real coordinate in 3-D digitizer is estimated using Horn's algorithm. These powerful tools allows us the super-resolution localization of the brain activation which is not possible using the conventional NIRS analysis tools.
Krefeld-Schwalb, Antonia; Witte, Erich H; Zenker, Frank
2018-01-01
In psychology as elsewhere, the main statistical inference strategy to establish empirical effects is null-hypothesis significance testing (NHST). The recent failure to replicate allegedly well-established NHST-results, however, implies that such results lack sufficient statistical power, and thus feature unacceptably high error-rates. Using data-simulation to estimate the error-rates of NHST-results, we advocate the research program strategy (RPS) as a superior methodology. RPS integrates Frequentist with Bayesian inference elements, and leads from a preliminary discovery against a (random) H 0 -hypothesis to a statistical H 1 -verification. Not only do RPS-results feature significantly lower error-rates than NHST-results, RPS also addresses key-deficits of a "pure" Frequentist and a standard Bayesian approach. In particular, RPS aggregates underpowered results safely. RPS therefore provides a tool to regain the trust the discipline had lost during the ongoing replicability-crisis.
Testing Genetic Pleiotropy with GWAS Summary Statistics for Marginal and Conditional Analyses.
Deng, Yangqing; Pan, Wei
2017-12-01
There is growing interest in testing genetic pleiotropy, which is when a single genetic variant influences multiple traits. Several methods have been proposed; however, these methods have some limitations. First, all the proposed methods are based on the use of individual-level genotype and phenotype data; in contrast, for logistical, and other, reasons, summary statistics of univariate SNP-trait associations are typically only available based on meta- or mega-analyzed large genome-wide association study (GWAS) data. Second, existing tests are based on marginal pleiotropy, which cannot distinguish between direct and indirect associations of a single genetic variant with multiple traits due to correlations among the traits. Hence, it is useful to consider conditional analysis, in which a subset of traits is adjusted for another subset of traits. For example, in spite of substantial lowering of low-density lipoprotein cholesterol (LDL) with statin therapy, some patients still maintain high residual cardiovascular risk, and, for these patients, it might be helpful to reduce their triglyceride (TG) level. For this purpose, in order to identify new therapeutic targets, it would be useful to identify genetic variants with pleiotropic effects on LDL and TG after adjusting the latter for LDL; otherwise, a pleiotropic effect of a genetic variant detected by a marginal model could simply be due to its association with LDL only, given the well-known correlation between the two types of lipids. Here, we develop a new pleiotropy testing procedure based only on GWAS summary statistics that can be applied for both marginal analysis and conditional analysis. Although the main technical development is based on published union-intersection testing methods, care is needed in specifying conditional models to avoid invalid statistical estimation and inference. In addition to the previously used likelihood ratio test, we also propose using generalized estimating equations under the working independence model for robust inference. We provide numerical examples based on both simulated and real data, including two large lipid GWAS summary association datasets based on ∼100,000 and ∼189,000 samples, respectively, to demonstrate the difference between marginal and conditional analyses, as well as the effectiveness of our new approach. Copyright © 2017 by the Genetics Society of America.
NASA Astrophysics Data System (ADS)
Knuth, K. H.
2001-05-01
We consider the application of Bayesian inference to the study of self-organized structures in complex adaptive systems. In particular, we examine the distribution of elements, agents, or processes in systems dominated by hierarchical structure. We demonstrate that results obtained by Caianiello [1] on Hierarchical Modular Systems (HMS) can be found by applying Jaynes' Principle of Group Invariance [2] to a few key assumptions about our knowledge of hierarchical organization. Subsequent application of the Principle of Maximum Entropy allows inferences to be made about specific systems. The utility of the Bayesian method is considered by examining both successes and failures of the hierarchical model. We discuss how Caianiello's original statements suffer from the Mind Projection Fallacy [3] and we restate his assumptions thus widening the applicability of the HMS model. The relationship between inference and statistical physics, described by Jaynes [4], is reiterated with the expectation that this realization will aid the field of complex systems research by moving away from often inappropriate direct application of statistical mechanics to a more encompassing inferential methodology.
Inferring Admixture Histories of Human Populations Using Linkage Disequilibrium
Loh, Po-Ru; Lipson, Mark; Patterson, Nick; Moorjani, Priya; Pickrell, Joseph K.; Reich, David; Berger, Bonnie
2013-01-01
Long-range migrations and the resulting admixtures between populations have been important forces shaping human genetic diversity. Most existing methods for detecting and reconstructing historical admixture events are based on allele frequency divergences or patterns of ancestry segments in chromosomes of admixed individuals. An emerging new approach harnesses the exponential decay of admixture-induced linkage disequilibrium (LD) as a function of genetic distance. Here, we comprehensively develop LD-based inference into a versatile tool for investigating admixture. We present a new weighted LD statistic that can be used to infer mixture proportions as well as dates with fewer constraints on reference populations than previous methods. We define an LD-based three-population test for admixture and identify scenarios in which it can detect admixture events that previous formal tests cannot. We further show that we can uncover phylogenetic relationships among populations by comparing weighted LD curves obtained using a suite of references. Finally, we describe several improvements to the computation and fitting of weighted LD curves that greatly increase the robustness and speed of the calculations. We implement all of these advances in a software package, ALDER, which we validate in simulations and apply to test for admixture among all populations from the Human Genome Diversity Project (HGDP), highlighting insights into the admixture history of Central African Pygmies, Sardinians, and Japanese. PMID:23410830
ERIC Educational Resources Information Center
Snyder, Patricia A.; Thompson, Bruce
The use of tests of statistical significance was explored, first by reviewing some criticisms of contemporary practice in the use of statistical tests as reflected in a series of articles in the "American Psychologist" and in the appointment of a "Task Force on Statistical Inference" by the American Psychological Association…
NASA Astrophysics Data System (ADS)
Pernot, Pascal; Savin, Andreas
2018-06-01
Benchmarking studies in computational chemistry use reference datasets to assess the accuracy of a method through error statistics. The commonly used error statistics, such as the mean signed and mean unsigned errors, do not inform end-users on the expected amplitude of prediction errors attached to these methods. We show that, the distributions of model errors being neither normal nor zero-centered, these error statistics cannot be used to infer prediction error probabilities. To overcome this limitation, we advocate for the use of more informative statistics, based on the empirical cumulative distribution function of unsigned errors, namely, (1) the probability for a new calculation to have an absolute error below a chosen threshold and (2) the maximal amplitude of errors one can expect with a chosen high confidence level. Those statistics are also shown to be well suited for benchmarking and ranking studies. Moreover, the standard error on all benchmarking statistics depends on the size of the reference dataset. Systematic publication of these standard errors would be very helpful to assess the statistical reliability of benchmarking conclusions.
Su, Chun-Lung; Gardner, Ian A; Johnson, Wesley O
2004-07-30
The two-test two-population model, originally formulated by Hui and Walter, for estimation of test accuracy and prevalence estimation assumes conditionally independent tests, constant accuracy across populations and binomial sampling. The binomial assumption is incorrect if all individuals in a population e.g. child-care centre, village in Africa, or a cattle herd are sampled or if the sample size is large relative to population size. In this paper, we develop statistical methods for evaluating diagnostic test accuracy and prevalence estimation based on finite sample data in the absence of a gold standard. Moreover, two tests are often applied simultaneously for the purpose of obtaining a 'joint' testing strategy that has either higher overall sensitivity or specificity than either of the two tests considered singly. Sequential versions of such strategies are often applied in order to reduce the cost of testing. We thus discuss joint (simultaneous and sequential) testing strategies and inference for them. Using the developed methods, we analyse two real and one simulated data sets, and we compare 'hypergeometric' and 'binomial-based' inferences. Our findings indicate that the posterior standard deviations for prevalence (but not sensitivity and specificity) based on finite population sampling tend to be smaller than their counterparts for infinite population sampling. Finally, we make recommendations about how small the sample size should be relative to the population size to warrant use of the binomial model for prevalence estimation. Copyright 2004 John Wiley & Sons, Ltd.
A Statistical Framework for Protein Quantitation in Bottom-Up MS-Based Proteomics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Karpievitch, Yuliya; Stanley, Jeffrey R.; Taverner, Thomas
2009-08-15
Motivation: Quantitative mass spectrometry-based proteomics requires protein-level estimates and associated confidence measures. Challenges include the presence of low quality or incorrectly identified peptides and informative missingness. Furthermore, models are required for rolling peptide-level information up to the protein level. Results: We present a statistical model that carefully accounts for informative missingness in peak intensities and allows unbiased, model-based, protein-level estimation and inference. The model is applicable to both label-based and label-free quantitation experiments. We also provide automated, model-based, algorithms for filtering of proteins and peptides as well as imputation of missing values. Two LC/MS datasets are used to illustrate themore » methods. In simulation studies, our methods are shown to achieve substantially more discoveries than standard alternatives. Availability: The software has been made available in the opensource proteomics platform DAnTE (http://omics.pnl.gov/software/). Contact: adabney@stat.tamu.edu Supplementary information: Supplementary data are available at Bioinformatics online.« less
Long-term strategy for the statistical design of a forest health monitoring system
Hans T. Schreuder; Raymond L. Czaplewski
1993-01-01
A conceptual framework is given for a broad-scale survey of forest health that accomplishes three objectives: generate descriptive statistics; detect changes in such statistics; and simplify analytical inferences that identify, and possibly establish cause-effect relationships. Our paper discusses the development of sampling schemes to satisfy these three objectives,...
ERIC Educational Resources Information Center
Beeman, Jennifer Leigh Sloan
2013-01-01
Research has found that students successfully complete an introductory course in statistics without fully comprehending the underlying theory or being able to exhibit statistical reasoning. This is particularly true for the understanding about the sampling distribution of the mean, a crucial concept for statistical inference. This study…
Applying Statistical Process Control to Clinical Data: An Illustration.
ERIC Educational Resources Information Center
Pfadt, Al; And Others
1992-01-01
Principles of statistical process control are applied to a clinical setting through the use of control charts to detect changes, as part of treatment planning and clinical decision-making processes. The logic of control chart analysis is derived from principles of statistical inference. Sample charts offer examples of evaluating baselines and…
ERIC Educational Resources Information Center
Madhere, Serge
An analytic procedure, efficiency analysis, is proposed for improving the utility of quantitative program evaluation for decision making. The three features of the procedure are explained: (1) for statistical control, it adopts and extends the regression-discontinuity design; (2) for statistical inferences, it de-emphasizes hypothesis testing in…
The Problem of Auto-Correlation in Parasitology
Pollitt, Laura C.; Reece, Sarah E.; Mideo, Nicole; Nussey, Daniel H.; Colegrave, Nick
2012-01-01
Explaining the contribution of host and pathogen factors in driving infection dynamics is a major ambition in parasitology. There is increasing recognition that analyses based on single summary measures of an infection (e.g., peak parasitaemia) do not adequately capture infection dynamics and so, the appropriate use of statistical techniques to analyse dynamics is necessary to understand infections and, ultimately, control parasites. However, the complexities of within-host environments mean that tracking and analysing pathogen dynamics within infections and among hosts poses considerable statistical challenges. Simple statistical models make assumptions that will rarely be satisfied in data collected on host and parasite parameters. In particular, model residuals (unexplained variance in the data) should not be correlated in time or space. Here we demonstrate how failure to account for such correlations can result in incorrect biological inference from statistical analysis. We then show how mixed effects models can be used as a powerful tool to analyse such repeated measures data in the hope that this will encourage better statistical practices in parasitology. PMID:22511865
The penumbra of learning: a statistical theory of synaptic tagging and capture.
Gershman, Samuel J
2014-01-01
Learning in humans and animals is accompanied by a penumbra: Learning one task benefits from learning an unrelated task shortly before or after. At the cellular level, the penumbra of learning appears when weak potentiation of one synapse is amplified by strong potentiation of another synapse on the same neuron during a critical time window. Weak potentiation sets a molecular tag that enables the synapse to capture plasticity-related proteins synthesized in response to strong potentiation at another synapse. This paper describes a computational model which formalizes synaptic tagging and capture in terms of statistical learning mechanisms. According to this model, synaptic strength encodes a probabilistic inference about the dynamically changing association between pre- and post-synaptic firing rates. The rate of change is itself inferred, coupling together different synapses on the same neuron. When the inputs to one synapse change rapidly, the inferred rate of change increases, amplifying learning at other synapses.
Space-Time Data fusion for Remote Sensing Applications
NASA Technical Reports Server (NTRS)
Braverman, Amy; Nguyen, H.; Cressie, N.
2011-01-01
NASA has been collecting massive amounts of remote sensing data about Earth's systems for more than a decade. Missions are selected to be complementary in quantities measured, retrieval techniques, and sampling characteristics, so these datasets are highly synergistic. To fully exploit this, a rigorous methodology for combining data with heterogeneous sampling characteristics is required. For scientific purposes, the methodology must also provide quantitative measures of uncertainty that propagate input-data uncertainty appropriately. We view this as a statistical inference problem. The true but notdirectly- observed quantities form a vector-valued field continuous in space and time. Our goal is to infer those true values or some function of them, and provide to uncertainty quantification for those inferences. We use a spatiotemporal statistical model that relates the unobserved quantities of interest at point-level to the spatially aggregated, observed data. We describe and illustrate our method using CO2 data from two NASA data sets.
Inference of missing data and chemical model parameters using experimental statistics
NASA Astrophysics Data System (ADS)
Casey, Tiernan; Najm, Habib
2017-11-01
A method for determining the joint parameter density of Arrhenius rate expressions through the inference of missing experimental data is presented. This approach proposes noisy hypothetical data sets from target experiments and accepts those which agree with the reported statistics, in the form of nominal parameter values and their associated uncertainties. The data exploration procedure is formalized using Bayesian inference, employing maximum entropy and approximate Bayesian computation methods to arrive at a joint density on data and parameters. The method is demonstrated in the context of reactions in the H2-O2 system for predictive modeling of combustion systems of interest. Work supported by the US DOE BES CSGB. Sandia National Labs is a multimission lab managed and operated by Nat. Technology and Eng'g Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell Intl, for the US DOE NCSA under contract DE-NA-0003525.
Rank-based permutation approaches for non-parametric factorial designs.
Umlauft, Maria; Konietschke, Frank; Pauly, Markus
2017-11-01
Inference methods for null hypotheses formulated in terms of distribution functions in general non-parametric factorial designs are studied. The methods can be applied to continuous, ordinal or even ordered categorical data in a unified way, and are based only on ranks. In this set-up Wald-type statistics and ANOVA-type statistics are the current state of the art. The first method is asymptotically exact but a rather liberal statistical testing procedure for small to moderate sample size, while the latter is only an approximation which does not possess the correct asymptotic α level under the null. To bridge these gaps, a novel permutation approach is proposed which can be seen as a flexible generalization of the Kruskal-Wallis test to all kinds of factorial designs with independent observations. It is proven that the permutation principle is asymptotically correct while keeping its finite exactness property when data are exchangeable. The results of extensive simulation studies foster these theoretical findings. A real data set exemplifies its applicability. © 2017 The British Psychological Society.
Bayes factors for the linear ballistic accumulator model of decision-making.
Evans, Nathan J; Brown, Scott D
2018-04-01
Evidence accumulation models of decision-making have led to advances in several different areas of psychology. These models provide a way to integrate response time and accuracy data, and to describe performance in terms of latent cognitive processes. Testing important psychological hypotheses using cognitive models requires a method to make inferences about different versions of the models which assume different parameters to cause observed effects. The task of model-based inference using noisy data is difficult, and has proven especially problematic with current model selection methods based on parameter estimation. We provide a method for computing Bayes factors through Monte-Carlo integration for the linear ballistic accumulator (LBA; Brown and Heathcote, 2008), a widely used evidence accumulation model. Bayes factors are used frequently for inference with simpler statistical models, and they do not require parameter estimation. In order to overcome the computational burden of estimating Bayes factors via brute force integration, we exploit general purpose graphical processing units; we provide free code for this. This approach allows estimation of Bayes factors via Monte-Carlo integration within a practical time frame. We demonstrate the method using both simulated and real data. We investigate the stability of the Monte-Carlo approximation, and the LBA's inferential properties, in simulation studies.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bennett, Janine Camille; Thompson, David; Pebay, Philippe Pierre
Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy,more » and {chi}{sup 2} independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics (which we discussed in [1]) where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel. We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.« less
Sojoudi, Alireza; Goodyear, Bradley G
2016-12-01
Spontaneous fluctuations of blood-oxygenation level-dependent functional magnetic resonance imaging (BOLD fMRI) signals are highly synchronous between brain regions that serve similar functions. This provides a means to investigate functional networks; however, most analysis techniques assume functional connections are constant over time. This may be problematic in the case of neurological disease, where functional connections may be highly variable. Recently, several methods have been proposed to determine moment-to-moment changes in the strength of functional connections over an imaging session (so called dynamic connectivity). Here a novel analysis framework based on a hierarchical observation modeling approach was proposed, to permit statistical inference of the presence of dynamic connectivity. A two-level linear model composed of overlapping sliding windows of fMRI signals, incorporating the fact that overlapping windows are not independent was described. To test this approach, datasets were synthesized whereby functional connectivity was either constant (significant or insignificant) or modulated by an external input. The method successfully determines the statistical significance of a functional connection in phase with the modulation, and it exhibits greater sensitivity and specificity in detecting regions with variable connectivity, when compared with sliding-window correlation analysis. For real data, this technique possesses greater reproducibility and provides a more discriminative estimate of dynamic connectivity than sliding-window correlation analysis. Hum Brain Mapp 37:4566-4580, 2016. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.
Estimating pseudocounts and fold changes for digital expression measurements.
Erhard, Florian
2018-06-19
Fold changes from count based high-throughput experiments such as RNA-seq suffer from a zero-frequency problem. To circumvent division by zero, so-called pseudocounts are added to make all observed counts strictly positive. The magnitude of pseudocounts for digital expression measurements and on which stage of the analysis they are introduced remained an arbitrary choice. Moreover, in the strict sense, fold changes are not quantities that can be computed. Instead, due to the stochasticity involved in the experiments, they must be estimated by statistical inference. Here, we build on a statistical framework for fold changes, where pseudocounts correspond to the parameters of the prior distribution used for Bayesian inference of the fold change. We show that arbirary and widely used choices for applying pseudocounts can lead to biased results. As a statistical rigorous alternative, we propose and test an empirical Bayes procedure to choose appropriate pseudocounts. Moreover, we introduce the novel estimator Ψ LFC for fold changes showing favorable properties with small counts and smaller deviations from the truth in simulations and real data compared to existing methods. Our results have direct implications for entities with few reads in sequencing experiments, and indirectly also affect results for entities with many reads. Ψ LFC is available as an R package under https://github.com/erhard-lab/lfc (Apache 2.0 license); R scripts to generate all figures are available at zenodo (doi:10.5281/zenodo.1163029).
Using DNA fingerprints to infer familial relationships within NHANES III households
Katki, Hormuzd A.; Sanders, Christopher L.; Graubard, Barry I.; Bergen, Andrew W.
2009-01-01
Developing, targeting, and evaluating genomic strategies for population-based disease prevention require population-based data. In response to this urgent need, genotyping has been conducted within the Third National Health and Nutrition Examination (NHANES III), the nationally-representative household-interview health survey in the U.S. However, before these genetic analyses can occur, family relationships within households must be accurately ascertained. Unfortunately, reported family relationships within NHANES III households based on questionnaire data are incomplete and inconclusive with regards to actual biological relatedness of family members. We inferred family relationships within households using DNA fingerprints (Identifiler®) that contain the DNA loci used by law enforcement agencies for forensic identification of individuals. However, performance of these loci for relationship inference is not well understood. We evaluated two competing statistical methods for relationship inference on pairs of household members: an exact likelihood ratio relying on allele frequencies to an Identical By State (IBS) likelihood ratio that only requires matching alleles. We modified these methods to account for genotyping errors and population substructure. The two methods usually agree on the rankings of the most likely relationships. However, the IBS method underestimates the likelihood ratio by not accounting for the informativeness of matching rare alleles. The likelihood ratio is sensitive to estimates of population substructure, and parent-child relationships are sensitive to the specified genotyping error rate. These loci were unable to distinguish second-degree relationships and cousins from being unrelated. The genetic data is also useful for verifying reported relationships and identifying data quality issues. An important by-product is the first explicitly nationally-representative estimates of allele frequencies at these ubiquitous forensic loci. PMID:20664713
Feinauer, Christoph; Procaccini, Andrea; Zecchina, Riccardo; Weigt, Martin; Pagnani, Andrea
2014-01-01
In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for (i) the prediction of residue-residue contacts in proteins, and (ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/code. PMID:24663061
NASA Astrophysics Data System (ADS)
Alexandridis, Konstantinos T.
This dissertation adopts a holistic and detailed approach to modeling spatially explicit agent-based artificial intelligent systems, using the Multi Agent-based Behavioral Economic Landscape (MABEL) model. The research questions that addresses stem from the need to understand and analyze the real-world patterns and dynamics of land use change from a coupled human-environmental systems perspective. Describes the systemic, mathematical, statistical, socio-economic and spatial dynamics of the MABEL modeling framework, and provides a wide array of cross-disciplinary modeling applications within the research, decision-making and policy domains. Establishes the symbolic properties of the MABEL model as a Markov decision process, analyzes the decision-theoretic utility and optimization attributes of agents towards comprising statistically and spatially optimal policies and actions, and explores the probabilogic character of the agents' decision-making and inference mechanisms via the use of Bayesian belief and decision networks. Develops and describes a Monte Carlo methodology for experimental replications of agent's decisions regarding complex spatial parcel acquisition and learning. Recognizes the gap on spatially-explicit accuracy assessment techniques for complex spatial models, and proposes an ensemble of statistical tools designed to address this problem. Advanced information assessment techniques such as the Receiver-Operator Characteristic curve, the impurity entropy and Gini functions, and the Bayesian classification functions are proposed. The theoretical foundation for modular Bayesian inference in spatially-explicit multi-agent artificial intelligent systems, and the ensembles of cognitive and scenario assessment modular tools build for the MABEL model are provided. Emphasizes the modularity and robustness as valuable qualitative modeling attributes, and examines the role of robust intelligent modeling as a tool for improving policy-decisions related to land use change. Finally, the major contributions to the science are presented along with valuable directions for future research.
Ren, Jie; Song, Kai; Deng, Minghua; Reinert, Gesine; Cannon, Charles H; Sun, Fengzhu
2016-04-01
Next-generation sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential.A plausible model for this underlying distribution of word counts is given through modeling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution ,: using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results ,: and that the clustering results that use a N: MC of the estimated order give a plausible clustering of the species. Our implementation of the statistics developed here is available as R package 'NGS.MC' at http://www-rcf.usc.edu/∼fsun/Programs/NGS-MC/NGS-MC.html fsun@usc.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Du, Yuanwei; Guo, Yubin
2015-01-01
The intrinsic mechanism of multimorbidity is difficult to recognize and prediction and diagnosis are difficult to carry out accordingly. Bayesian networks can help to diagnose multimorbidity in health care, but it is difficult to obtain the conditional probability table (CPT) because of the lack of clinically statistical data. Today, expert knowledge and experience are increasingly used in training Bayesian networks in order to help predict or diagnose diseases, but the CPT in Bayesian networks is usually irrational or ineffective for ignoring realistic constraints especially in multimorbidity. In order to solve these problems, an evidence reasoning (ER) approach is employed to extract and fuse inference data from experts using a belief distribution and recursive ER algorithm, based on which evidence reasoning method for constructing conditional probability tables in Bayesian network of multimorbidity is presented step by step. A multimorbidity numerical example is used to demonstrate the method and prove its feasibility and application. Bayesian network can be determined as long as the inference assessment is inferred by each expert according to his/her knowledge or experience. Our method is more effective than existing methods for extracting expert inference data accurately and is fused effectively for constructing CPTs in a Bayesian network of multimorbidity.
Cohen, Trevor; Schvaneveldt, Roger; Widdows, Dominic
2010-04-01
The discovery of implicit connections between terms that do not occur together in any scientific document underlies the model of literature-based knowledge discovery first proposed by Swanson. Corpus-derived statistical models of semantic distance such as Latent Semantic Analysis (LSA) have been evaluated previously as methods for the discovery of such implicit connections. However, LSA in particular is dependent on a computationally demanding method of dimension reduction as a means to obtain meaningful indirect inference, limiting its ability to scale to large text corpora. In this paper, we evaluate the ability of Random Indexing (RI), a scalable distributional model of word associations, to draw meaningful implicit relationships between terms in general and biomedical language. Proponents of this method have achieved comparable performance to LSA on several cognitive tasks while using a simpler and less computationally demanding method of dimension reduction than LSA employs. In this paper, we demonstrate that the original implementation of RI is ineffective at inferring meaningful indirect connections, and evaluate Reflective Random Indexing (RRI), an iterative variant of the method that is better able to perform indirect inference. RRI is shown to lead to more clearly related indirect connections and to outperform existing RI implementations in the prediction of future direct co-occurrence in the MEDLINE corpus. 2009 Elsevier Inc. All rights reserved.
The space of ultrametric phylogenetic trees.
Gavryushkin, Alex; Drummond, Alexei J
2016-08-21
The reliability of a phylogenetic inference method from genomic sequence data is ensured by its statistical consistency. Bayesian inference methods produce a sample of phylogenetic trees from the posterior distribution given sequence data. Hence the question of statistical consistency of such methods is equivalent to the consistency of the summary of the sample. More generally, statistical consistency is ensured by the tree space used to analyse the sample. In this paper, we consider two standard parameterisations of phylogenetic time-trees used in evolutionary models: inter-coalescent interval lengths and absolute times of divergence events. For each of these parameterisations we introduce a natural metric space on ultrametric phylogenetic trees. We compare the introduced spaces with existing models of tree space and formulate several formal requirements that a metric space on phylogenetic trees must possess in order to be a satisfactory space for statistical analysis, and justify them. We show that only a few known constructions of the space of phylogenetic trees satisfy these requirements. However, our results suggest that these basic requirements are not enough to distinguish between the two metric spaces we introduce and that the choice between metric spaces requires additional properties to be considered. Particularly, that the summary tree minimising the square distance to the trees from the sample might be different for different parameterisations. This suggests that further fundamental insight is needed into the problem of statistical consistency of phylogenetic inference methods. Copyright © 2016 The Authors. Published by Elsevier Ltd.. All rights reserved.
Assessing risk factors for dental caries: a statistical modeling approach.
Trottini, Mario; Bossù, Maurizio; Corridore, Denise; Ierardo, Gaetano; Luzzi, Valeria; Saccucci, Matteo; Polimeni, Antonella
2015-01-01
The problem of identifying potential determinants and predictors of dental caries is of key importance in caries research and it has received considerable attention in the scientific literature. From the methodological side, a broad range of statistical models is currently available to analyze dental caries indices (DMFT, dmfs, etc.). These models have been applied in several studies to investigate the impact of different risk factors on the cumulative severity of dental caries experience. However, in most of the cases (i) these studies focus on a very specific subset of risk factors; and (ii) in the statistical modeling only few candidate models are considered and model selection is at best only marginally addressed. As a result, our understanding of the robustness of the statistical inferences with respect to the choice of the model is very limited; the richness of the set of statistical models available for analysis in only marginally exploited; and inferences could be biased due the omission of potentially important confounding variables in the model's specification. In this paper we argue that these limitations can be overcome considering a general class of candidate models and carefully exploring the model space using standard model selection criteria and measures of global fit and predictive performance of the candidate models. Strengths and limitations of the proposed approach are illustrated with a real data set. In our illustration the model space contains more than 2.6 million models, which require inferences to be adjusted for 'optimism'.
Organic acid modeling and model validation: Workshop summary. Final report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sullivan, T.J.; Eilers, J.M.
1992-08-14
A workshop was held in Corvallis, Oregon on April 9--10, 1992 at the offices of E&S Environmental Chemistry, Inc. The purpose of this workshop was to initiate research efforts on the entitled ``Incorporation of an organic acid representation into MAGIC (Model of Acidification of Groundwater in Catchments) and testing of the revised model using Independent data sources.`` The workshop was attended by a team of internationally-recognized experts in the fields of surface water acid-bass chemistry, organic acids, and watershed modeling. The rationale for the proposed research is based on the recent comparison between MAGIC model hindcasts and paleolimnological inferences ofmore » historical acidification for a set of 33 statistically-selected Adirondack lakes. Agreement between diatom-inferred and MAGIC-hindcast lakewater chemistry in the earlier research had been less than satisfactory. Based on preliminary analyses, it was concluded that incorporation of a reasonable organic acid representation into the version of MAGIC used for hindcasting was the logical next step toward improving model agreement.« less
A descriptivist approach to trait conceptualization and inference.
Jonas, Katherine G; Markon, Kristian E
2016-01-01
In their recent article, How Functionalist and Process Approaches to Behavior Can Explain Trait Covariation, Wood, Gardner, and Harms (2015) underscore the need for more process-based understandings of individual differences. At the same time, the article illustrates a common error in the use and interpretation of latent variable models: namely, the misuse of models to arbitrate issues of causation and the nature of latent variables. Here, we explain how latent variables can be understood simply as parsimonious summaries of data, and how statistical inference can be based on choosing those summaries that minimize information required to represent the data using the model. Although Wood, Gardner, and Harms acknowledge this perspective, they underestimate its significance, including its importance to modeling and the conceptualization of psychological measurement. We believe this perspective has important implications for understanding individual differences in a number of domains, including current debates surrounding the role of formative versus reflective latent variables. (c) 2015 APA, all rights reserved).
A Conceptual Framework for Pharmacodynamic Genome-wide Association Studies in Pharmacogenomics
Wu, Rongling; Tong, Chunfa; Wang, Zhong; Mauger, David; Tantisira, Kelan; Szefler, Stanley J.; Chinchilli, Vernon M.; Israel, Elliot
2013-01-01
Summary Genome-wide association studies (GWAS) have emerged as a powerful tool to identify loci that affect drug response or susceptibility to adverse drug reactions. However, current GWAS based on a simple analysis of associations between genotype and phenotype ignores the biochemical reactions of drug response, thus limiting the scope of inference about its genetic architecture. To facilitate the inference of GWAS in pharmacogenomics, we sought to undertake the mathematical integration of the pharmacodynamic process of drug reactions through computational models. By estimating and testing the genetic control of pharmacodynamic and pharmacokinetic parameters, this mechanistic approach does not only enhance the biological and clinical relevance of significant genetic associations, but also improve the statistical power and robustness of gene detection. This report discusses the general principle and development of pharmacodynamics-based GWAS, highlights the practical use of this approach in addressing various pharmacogenomic problems, and suggests that this approach will be an important method to study the genetic architecture of drug responses or reactions. PMID:21920452
The MR-Base platform supports systematic causal inference across the human phenome
Wade, Kaitlin H; Haberland, Valeriia; Baird, Denis; Laurin, Charles; Burgess, Stephen; Bowden, Jack; Langdon, Ryan; Tan, Vanessa Y; Yarmolinsky, James; Shihab, Hashem A; Timpson, Nicholas J; Evans, David M; Relton, Caroline; Martin, Richard M; Davey Smith, George
2018-01-01
Results from genome-wide association studies (GWAS) can be used to infer causal relationships between phenotypes, using a strategy known as 2-sample Mendelian randomization (2SMR) and bypassing the need for individual-level data. However, 2SMR methods are evolving rapidly and GWAS results are often insufficiently curated, undermining efficient implementation of the approach. We therefore developed MR-Base (http://www.mrbase.org): a platform that integrates a curated database of complete GWAS results (no restrictions according to statistical significance) with an application programming interface, web app and R packages that automate 2SMR. The software includes several sensitivity analyses for assessing the impact of horizontal pleiotropy and other violations of assumptions. The database currently comprises 11 billion single nucleotide polymorphism-trait associations from 1673 GWAS and is updated on a regular basis. Integrating data with software ensures more rigorous application of hypothesis-driven analyses and allows millions of potential causal relationships to be efficiently evaluated in phenome-wide association studies. PMID:29846171
Multiple-Line Inference of Selection on Quantitative Traits
Riedel, Nico; Khatri, Bhavin S.; Lässig, Michael; Berg, Johannes
2015-01-01
Trait differences between species may be attributable to natural selection. However, quantifying the strength of evidence for selection acting on a particular trait is a difficult task. Here we develop a population genetics test for selection acting on a quantitative trait that is based on multiple-line crosses. We show that using multiple lines increases both the power and the scope of selection inferences. First, a test based on three or more lines detects selection with strongly increased statistical significance, and we show explicitly how the sensitivity of the test depends on the number of lines. Second, a multiple-line test can distinguish between different lineage-specific selection scenarios. Our analytical results are complemented by extensive numerical simulations. We then apply the multiple-line test to QTL data on floral character traits in plant species of the Mimulus genus and on photoperiodic traits in different maize strains, where we find a signature of lineage-specific selection not seen in two-line tests. PMID:26139839
Hosseinbor, Ameer Pasha; Kim, Won Hwa; Adluru, Nagesh; Acharya, Amit; Vorperian, Houri K; Chung, Moo K
2014-01-01
Recently, the HyperSPHARM algorithm was proposed to parameterize multiple disjoint objects in a holistic manner using the 4D hyperspherical harmonics. The HyperSPHARM coefficients are global; they cannot be used to directly infer localized variations in signal. In this paper, we present a unified wavelet framework that links Hyper-SPHARM to the diffusion wavelet transform. Specifically, we will show that the HyperSPHARM basis forms a subset of a wavelet-based multiscale representation of surface-based signals. This wavelet, termed the hyperspherical diffusion wavelet, is a consequence of the equivalence of isotropic heat diffusion smoothing and the diffusion wavelet transform on the hypersphere. Our framework allows for the statistical inference of highly localized anatomical changes, which we demonstrate in the first-ever developmental study on the hyoid bone investigating gender and age effects. We also show that the hyperspherical wavelet successfully picks up group-wise differences that are barely detectable using SPHARM.
Hosseinbor, A. Pasha; Kim, Won Hwa; Adluru, Nagesh; Acharya, Amit; Vorperian, Houri K.; Chung, Moo K.
2014-01-01
Recently, the HyperSPHARM algorithm was proposed to parameterize multiple disjoint objects in a holistic manner using the 4D hyperspherical harmonics. The HyperSPHARM coefficients are global; they cannot be used to directly infer localized variations in signal. In this paper, we present a unified wavelet framework that links HyperSPHARM to the diffusion wavelet transform. Specifically, we will show that the HyperSPHARM basis forms a subset of a wavelet-based multiscale representation of surface-based signals. This wavelet, termed the hyperspherical diffusion wavelet, is a consequence of the equivalence of isotropic heat diffusion smoothing and the diffusion wavelet transform on the hypersphere. Our framework allows for the statistical inference of highly localized anatomical changes, which we demonstrate in the firstever developmental study on the hyoid bone investigating gender and age effects. We also show that the hyperspherical wavelet successfully picks up group-wise differences that are barely detectable using SPHARM. PMID:25320783
Inferring Characteristics of Sensorimotor Behavior by Quantifying Dynamics of Animal Locomotion
NASA Astrophysics Data System (ADS)
Leung, KaWai
Locomotion is one of the most well-studied topics in animal behavioral studies. Many fundamental and clinical research make use of the locomotion of an animal model to explore various aspects in sensorimotor behavior. In the past, most of these studies focused on population average of a specific trait due to limitation of data collection and processing power. With recent advance in computer vision and statistical modeling techniques, it is now possible to track and analyze large amounts of behavioral data. In this thesis, I present two projects that aim to infer the characteristics of sensorimotor behavior by quantifying the dynamics of locomotion of nematode Caenorhabditis elegans and fruit fly Drosophila melanogaster, shedding light on statistical dependence between sensing and behavior. In the first project, I investigate the possibility of inferring noxious sensory information from the behavior of Caenorhabditis elegans. I develop a statistical model to infer the heat stimulus level perceived by individual animals from their stereotyped escape responses after stimulation by an IR laser. The model allows quantification of analgesic-like effects of chemical agents or genetic mutations in the worm. At the same time, the method is able to differentiate perturbations of locomotion behavior that are beyond affecting the sensory system. With this model I propose experimental designs that allows statistically significant identification of analgesic-like effects. In the second project, I investigate the relationship of energy budget and stability of locomotion in determining the walking speed distribution of Drosophila melanogaster during aging. The locomotion stability at different age groups is estimated from video recordings using Floquet theory. I calculate the power consumption of different locomotion speed using a biomechanics model. In conclusion, the power consumption, not stability, predicts the locomotion speed distribution at different ages.
Emerging Concepts of Data Integration in Pathogen Phylodynamics.
Baele, Guy; Suchard, Marc A; Rambaut, Andrew; Lemey, Philippe
2017-01-01
Phylodynamics has become an increasingly popular statistical framework to extract evolutionary and epidemiological information from pathogen genomes. By harnessing such information, epidemiologists aim to shed light on the spatio-temporal patterns of spread and to test hypotheses about the underlying interaction of evolutionary and ecological dynamics in pathogen populations. Although the field has witnessed a rich development of statistical inference tools with increasing levels of sophistication, these tools initially focused on sequences as their sole primary data source. Integrating various sources of information, however, promises to deliver more precise insights in infectious diseases and to increase opportunities for statistical hypothesis testing. Here, we review how the emerging concept of data integration is stimulating new advances in Bayesian evolutionary inference methodology which formalize a marriage of statistical thinking and evolutionary biology. These approaches include connecting sequence to trait evolution, such as for host, phenotypic and geographic sampling information, but also the incorporation of covariates of evolutionary and epidemic processes in the reconstruction procedures. We highlight how a full Bayesian approach to covariate modeling and testing can generate further insights into sequence evolution, trait evolution, and population dynamics in pathogen populations. Specific examples demonstrate how such approaches can be used to test the impact of host on rabies and HIV evolutionary rates, to identify the drivers of influenza dispersal as well as the determinants of rabies cross-species transmissions, and to quantify the evolutionary dynamics of influenza antigenicity. Finally, we briefly discuss how data integration is now also permeating through the inference of transmission dynamics, leading to novel insights into tree-generative processes and detailed reconstructions of transmission trees. [Bayesian inference; birth–death models; coalescent models; continuous trait evolution; covariates; data integration; discrete trait evolution; pathogen phylodynamics.
Emerging Concepts of Data Integration in Pathogen Phylodynamics
Baele, Guy; Suchard, Marc A.; Rambaut, Andrew; Lemey, Philippe
2017-01-01
Phylodynamics has become an increasingly popular statistical framework to extract evolutionary and epidemiological information from pathogen genomes. By harnessing such information, epidemiologists aim to shed light on the spatio-temporal patterns of spread and to test hypotheses about the underlying interaction of evolutionary and ecological dynamics in pathogen populations. Although the field has witnessed a rich development of statistical inference tools with increasing levels of sophistication, these tools initially focused on sequences as their sole primary data source. Integrating various sources of information, however, promises to deliver more precise insights in infectious diseases and to increase opportunities for statistical hypothesis testing. Here, we review how the emerging concept of data integration is stimulating new advances in Bayesian evolutionary inference methodology which formalize a marriage of statistical thinking and evolutionary biology. These approaches include connecting sequence to trait evolution, such as for host, phenotypic and geographic sampling information, but also the incorporation of covariates of evolutionary and epidemic processes in the reconstruction procedures. We highlight how a full Bayesian approach to covariate modeling and testing can generate further insights into sequence evolution, trait evolution, and population dynamics in pathogen populations. Specific examples demonstrate how such approaches can be used to test the impact of host on rabies and HIV evolutionary rates, to identify the drivers of influenza dispersal as well as the determinants of rabies cross-species transmissions, and to quantify the evolutionary dynamics of influenza antigenicity. Finally, we briefly discuss how data integration is now also permeating through the inference of transmission dynamics, leading to novel insights into tree-generative processes and detailed reconstructions of transmission trees. [Bayesian inference; birth–death models; coalescent models; continuous trait evolution; covariates; data integration; discrete trait evolution; pathogen phylodynamics. PMID:28173504
Ayres, Daniel L; Darling, Aaron; Zwickl, Derrick J; Beerli, Peter; Holder, Mark T; Lewis, Paul O; Huelsenbeck, John P; Ronquist, Fredrik; Swofford, David L; Cummings, Michael P; Rambaut, Andrew; Suchard, Marc A
2012-01-01
Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emergence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently exploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE library is free open source software licensed under the Lesser GPL and available from http://beagle-lib.googlecode.com. An example client program is available as public domain software.
Ayres, Daniel L.; Darling, Aaron; Zwickl, Derrick J.; Beerli, Peter; Holder, Mark T.; Lewis, Paul O.; Huelsenbeck, John P.; Ronquist, Fredrik; Swofford, David L.; Cummings, Michael P.; Rambaut, Andrew; Suchard, Marc A.
2012-01-01
Abstract Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emergence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently exploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE library is free open source software licensed under the Lesser GPL and available from http://beagle-lib.googlecode.com. An example client program is available as public domain software. PMID:21963610
Gopinath, Kaundinya; Krishnamurthy, Venkatagiri; Lacey, Simon; Sathian, K
2018-02-01
In a recent study Eklund et al. have shown that cluster-wise family-wise error (FWE) rate-corrected inferences made in parametric statistical method-based functional magnetic resonance imaging (fMRI) studies over the past couple of decades may have been invalid, particularly for cluster defining thresholds less stringent than p < 0.001; principally because the spatial autocorrelation functions (sACFs) of fMRI data had been modeled incorrectly to follow a Gaussian form, whereas empirical data suggest otherwise. Hence, the residuals from general linear model (GLM)-based fMRI activation estimates in these studies may not have possessed a homogenously Gaussian sACF. Here we propose a method based on the assumption that heterogeneity and non-Gaussianity of the sACF of the first-level GLM analysis residuals, as well as temporal autocorrelations in the first-level voxel residual time-series, are caused by unmodeled MRI signal from neuronal and physiological processes as well as motion and other artifacts, which can be approximated by appropriate decompositions of the first-level residuals with principal component analysis (PCA), and removed. We show that application of this method yields GLM residuals with significantly reduced spatial correlation, nearly Gaussian sACF and uniform spatial smoothness across the brain, thereby allowing valid cluster-based FWE-corrected inferences based on assumption of Gaussian spatial noise. We further show that application of this method renders the voxel time-series of first-level GLM residuals independent, and identically distributed across time (which is a necessary condition for appropriate voxel-level GLM inference), without having to fit ad hoc stochastic colored noise models. Furthermore, the detection power of individual subject brain activation analysis is enhanced. This method will be especially useful for case studies, which rely on first-level GLM analysis inferences.
Geng, Zongyu; Yang, Feng; Chen, Xi; Wu, Nianqiang
2016-01-01
It remains a challenge to accurately calibrate a sensor subject to environmental drift. The calibration task for such a sensor is to quantify the relationship between the sensor’s response and its exposure condition, which is specified by not only the analyte concentration but also the environmental factors such as temperature and humidity. This work developed a Gaussian Process (GP)-based procedure for the efficient calibration of sensors in drifting environments. Adopted as the calibration model, GP is not only able to capture the possibly nonlinear relationship between the sensor responses and the various exposure-condition factors, but also able to provide valid statistical inference for uncertainty quantification of the target estimates (e.g., the estimated analyte concentration of an unknown environment). Built on GP’s inference ability, an experimental design method was developed to achieve efficient sampling of calibration data in a batch sequential manner. The resulting calibration procedure, which integrates the GP-based modeling and experimental design, was applied on a simulated chemiresistor sensor to demonstrate its effectiveness and its efficiency over the traditional method. PMID:26924894
p-Curve and p-Hacking in Observational Research.
Bruns, Stephan B; Ioannidis, John P A
2016-01-01
The p-curve, the distribution of statistically significant p-values of published studies, has been used to make inferences on the proportion of true effects and on the presence of p-hacking in the published literature. We analyze the p-curve for observational research in the presence of p-hacking. We show by means of simulations that even with minimal omitted-variable bias (e.g., unaccounted confounding) p-curves based on true effects and p-curves based on null-effects with p-hacking cannot be reliably distinguished. We also demonstrate this problem using as practical example the evaluation of the effect of malaria prevalence on economic growth between 1960 and 1996. These findings call recent studies into question that use the p-curve to infer that most published research findings are based on true effects in the medical literature and in a wide range of disciplines. p-values in observational research may need to be empirically calibrated to be interpretable with respect to the commonly used significance threshold of 0.05. Violations of randomization in experimental studies may also result in situations where the use of p-curves is similarly unreliable.
Etzioni, Ruth; Gulati, Roman
2013-04-01
In our article about limitations of basing screening policy on screening trials, we offered several examples of ways in which modeling, using data from large screening trials and population trends, provided insights that differed somewhat from those based only on empirical trial results. In this editorial, we take a step back and consider the general question of whether randomized screening trials provide the strongest evidence for clinical guidelines concerning population screening programs. We argue that randomized trials provide a process that is designed to protect against certain biases but that this process does not guarantee that inferences based on empirical results from screening trials will be unbiased. Appropriate quantitative methods are key to obtaining unbiased inferences from screening trials. We highlight several studies in the statistical literature demonstrating that conventional survival analyses of screening trials can be misleading and list a number of key questions concerning screening harms and benefits that cannot be answered without modeling. Although we acknowledge the centrality of screening trials in the policy process, we maintain that modeling constitutes a powerful tool for screening trial interpretation and screening policy development.
Inferring the parameters of a Markov process from snapshots of the steady state
NASA Astrophysics Data System (ADS)
Dettmer, Simon L.; Berg, Johannes
2018-02-01
We seek to infer the parameters of an ergodic Markov process from samples taken independently from the steady state. Our focus is on non-equilibrium processes, where the steady state is not described by the Boltzmann measure, but is generally unknown and hard to compute, which prevents the application of established equilibrium inference methods. We propose a quantity we call propagator likelihood, which takes on the role of the likelihood in equilibrium processes. This propagator likelihood is based on fictitious transitions between those configurations of the system which occur in the samples. The propagator likelihood can be derived by minimising the relative entropy between the empirical distribution and a distribution generated by propagating the empirical distribution forward in time. Maximising the propagator likelihood leads to an efficient reconstruction of the parameters of the underlying model in different systems, both with discrete configurations and with continuous configurations. We apply the method to non-equilibrium models from statistical physics and theoretical biology, including the asymmetric simple exclusion process (ASEP), the kinetic Ising model, and replicator dynamics.
Suratanee, Apichat; Plaimas, Kitiporn
2017-01-01
The associations between proteins and diseases are crucial information for investigating pathological mechanisms. However, the number of known and reliable protein-disease associations is quite small. In this study, an analysis framework to infer associations between proteins and diseases was developed based on a large data set of a human protein-protein interaction network integrating an effective network search, namely, the reverse k -nearest neighbor (R k NN) search. The R k NN search was used to identify an impact of a protein on other proteins. Then, associations between proteins and diseases were inferred statistically. The method using the R k NN search yielded a much higher precision than a random selection, standard nearest neighbor search, or when applying the method to a random protein-protein interaction network. All protein-disease pair candidates were verified by a literature search. Supporting evidence for 596 pairs was identified. In addition, cluster analysis of these candidates revealed 10 promising groups of diseases to be further investigated experimentally. This method can be used to identify novel associations to better understand complex relationships between proteins and diseases.
Applying a multiobjective metaheuristic inspired by honey bees to phylogenetic inference.
Santander-Jiménez, Sergio; Vega-Rodríguez, Miguel A
2013-10-01
The development of increasingly popular multiobjective metaheuristics has allowed bioinformaticians to deal with optimization problems in computational biology where multiple objective functions must be taken into account. One of the most relevant research topics that can benefit from these techniques is phylogenetic inference. Throughout the years, different researchers have proposed their own view about the reconstruction of ancestral evolutionary relationships among species. As a result, biologists often report different phylogenetic trees from a same dataset when considering distinct optimality principles. In this work, we detail a multiobjective swarm intelligence approach based on the novel Artificial Bee Colony algorithm for inferring phylogenies. The aim of this paper is to propose a complementary view of phylogenetics according to the maximum parsimony and maximum likelihood criteria, in order to generate a set of phylogenetic trees that represent a compromise between these principles. Experimental results on a variety of nucleotide data sets and statistical studies highlight the relevance of the proposal with regard to other multiobjective algorithms and state-of-the-art biological methods. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
The logical primitives of thought: Empirical foundations for compositional cognitive models.
Piantadosi, Steven T; Tenenbaum, Joshua B; Goodman, Noah D
2016-07-01
The notion of a compositional language of thought (LOT) has been central in computational accounts of cognition from earliest attempts (Boole, 1854; Fodor, 1975) to the present day (Feldman, 2000; Penn, Holyoak, & Povinelli, 2008; Fodor, 2008; Kemp, 2012; Goodman, Tenenbaum, & Gerstenberg, 2015). Recent modeling work shows how statistical inferences over compositionally structured hypothesis spaces might explain learning and development across a variety of domains. However, the primitive components of such representations are typically assumed a priori by modelers and theoreticians rather than determined empirically. We show how different sets of LOT primitives, embedded in a psychologically realistic approximate Bayesian inference framework, systematically predict distinct learning curves in rule-based concept learning experiments. We use this feature of LOT models to design a set of large-scale concept learning experiments that can determine the most likely primitives for psychological concepts involving Boolean connectives and quantification. Subjects' inferences are most consistent with a rich (nonminimal) set of Boolean operations, including first-order, but not second-order, quantification. Our results more generally show how specific LOT theories can be distinguished empirically. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
ERIC Educational Resources Information Center
Page, Robert; Satake, Eiki
2017-01-01
While interest in Bayesian statistics has been growing in statistics education, the treatment of the topic is still inadequate in both textbooks and the classroom. Because so many fields of study lead to careers that involve a decision-making process requiring an understanding of Bayesian methods, it is becoming increasingly clear that Bayesian…
Human Inferences about Sequences: A Minimal Transition Probability Model
2016-01-01
The brain constantly infers the causes of the inputs it receives and uses these inferences to generate statistical expectations about future observations. Experimental evidence for these expectations and their violations include explicit reports, sequential effects on reaction times, and mismatch or surprise signals recorded in electrophysiology and functional MRI. Here, we explore the hypothesis that the brain acts as a near-optimal inference device that constantly attempts to infer the time-varying matrix of transition probabilities between the stimuli it receives, even when those stimuli are in fact fully unpredictable. This parsimonious Bayesian model, with a single free parameter, accounts for a broad range of findings on surprise signals, sequential effects and the perception of randomness. Notably, it explains the pervasive asymmetry between repetitions and alternations encountered in those studies. Our analysis suggests that a neural machinery for inferring transition probabilities lies at the core of human sequence knowledge. PMID:28030543
Exploring High School Students Beginning Reasoning about Significance Tests with Technology
ERIC Educational Resources Information Center
García, Víctor N.; Sánchez, Ernesto
2017-01-01
In the present study we analyze how students reason about or make inferences given a particular hypothesis testing problem (without having studied formal methods of statistical inference) when using Fathom. They use Fathom to create an empirical sampling distribution through computer simulation. It is found that most student´s reasoning rely on…
Stan: A Probabilistic Programming Language for Bayesian Inference and Optimization
ERIC Educational Resources Information Center
Gelman, Andrew; Lee, Daniel; Guo, Jiqiang
2015-01-01
Stan is a free and open-source C++ program that performs Bayesian inference or optimization for arbitrary user-specified models and can be called from the command line, R, Python, Matlab, or Julia and has great promise for fitting large and complex statistical models in many areas of application. We discuss Stan from users' and developers'…
Exploiting Sparsity in Hyperspectral Image Classification via Graphical Models
2013-05-01
distribution p by minimizing the Kullback – Leibler (KL) distance D(p‖p̂) = Ep[log(p/p̂)] using first- and second-order statistics, via a maximum-weight...Obtain sparse representations αl, l = 1, . . . , T , in RN from test image. 6: Inference: Classify based on the output of the resulting classifier using ...The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing
Maximum entropy method applied to deblurring images on a MasPar MP-1 computer
NASA Technical Reports Server (NTRS)
Bonavito, N. L.; Dorband, John; Busse, Tim
1991-01-01
A statistical inference method based on the principle of maximum entropy is developed for the purpose of enhancing and restoring satellite images. The proposed maximum entropy image restoration method is shown to overcome the difficulties associated with image restoration and provide the smoothest and most appropriate solution consistent with the measured data. An implementation of the method on the MP-1 computer is described, and results of tests on simulated data are presented.
On the Period-Amplitude and Amplitude-Period Relationships
NASA Technical Reports Server (NTRS)
Wilson, Robert M.; Hathaway, David H.
2008-01-01
Examined are Period-Amplitude and Amplitude-Period relationships based on the cyclic behavior of the 12-month moving averages of monthly mean sunspot numbers for cycles 0.23, both in terms of Fisher's exact tests for 2x2 contingency tables and linear regression analyses. Concerning the Period-Amplitude relationship (same cycle), because cycle 23's maximum amplitude is known to be 120.8, the inferred regressions (90-percent prediction intervals) suggest that its period will be 131 +/- 24 months (using all cycles) or 131 +/- 18 months (ignoring cycles 2 and 4, which have the extremes of period, 108 and 164 months, respectively). Because cycle 23 has already persisted for 142 months (May 1996 through February 2008), based on the latter prediction, it should end before September 2008. Concerning the Amplitude-Period relationship (following cycle maximum amplitude versus preceding cycle period), because cycle 23's period is known to be at least 142 months, the inferred regressions (90-percent prediction intervals) suggest that cycle 24's maximum amplitude will be about less than or equal to 96.1 +/- 55.0 (using all cycle pairs) or less than or equal to 91.0 +/- 36.7 (ignoring statistical outlier cycle pairs). Hence, cycle 24's maximum amplitude is expected to be less than 151, perhaps even less than 128, unless cycle pair 23/24 proves to be a statistical outlier.
[Bayesian approach for the cost-effectiveness evaluation of healthcare technologies].
Berchialla, Paola; Gregori, Dario; Brunello, Franco; Veltri, Andrea; Petrinco, Michele; Pagano, Eva
2009-01-01
The development of Bayesian statistical methods for the assessment of the cost-effectiveness of health care technologies is reviewed. Although many studies adopt a frequentist approach, several authors have advocated the use of Bayesian methods in health economics. Emphasis has been placed on the advantages of the Bayesian approach, which include: (i) the ability to make more intuitive and meaningful inferences; (ii) the ability to tackle complex problems, such as allowing for the inclusion of patients who generate no cost, thanks to the availability of powerful computational algorithms; (iii) the importance of a full use of quantitative and structural prior information to produce realistic inferences. Much literature comparing the cost-effectiveness of two treatments is based on the incremental cost-effectiveness ratio. However, new methods are arising with the purpose of decision making. These methods are based on a net benefits approach. In the present context, the cost-effectiveness acceptability curves have been pointed out to be intrinsically Bayesian in their formulation. They plot the probability of a positive net benefit against the threshold cost of a unit increase in efficacy.A case study is presented in order to illustrate the Bayesian statistics in the cost-effectiveness analysis. Emphasis is placed on the cost-effectiveness acceptability curves. Advantages and disadvantages of the method described in this paper have been compared to frequentist methods and discussed.
Ghosh, Sujit K
2010-01-01
Bayesian methods are rapidly becoming popular tools for making statistical inference in various fields of science including biology, engineering, finance, and genetics. One of the key aspects of Bayesian inferential method is its logical foundation that provides a coherent framework to utilize not only empirical but also scientific information available to a researcher. Prior knowledge arising from scientific background, expert judgment, or previously collected data is used to build a prior distribution which is then combined with current data via the likelihood function to characterize the current state of knowledge using the so-called posterior distribution. Bayesian methods allow the use of models of complex physical phenomena that were previously too difficult to estimate (e.g., using asymptotic approximations). Bayesian methods offer a means of more fully understanding issues that are central to many practical problems by allowing researchers to build integrated models based on hierarchical conditional distributions that can be estimated even with limited amounts of data. Furthermore, advances in numerical integration methods, particularly those based on Monte Carlo methods, have made it possible to compute the optimal Bayes estimators. However, there is a reasonably wide gap between the background of the empirically trained scientists and the full weight of Bayesian statistical inference. Hence, one of the goals of this chapter is to bridge the gap by offering elementary to advanced concepts that emphasize linkages between standard approaches and full probability modeling via Bayesian methods.
Optimization of space system development resources
NASA Astrophysics Data System (ADS)
Kosmann, William J.; Sarkani, Shahram; Mazzuchi, Thomas
2013-06-01
NASA has had a decades-long problem with cost growth during the development of space science missions. Numerous agency-sponsored studies have produced average mission level cost growths ranging from 23% to 77%. A new study of 26 historical NASA Science instrument set developments using expert judgment to reallocate key development resources has an average cost growth of 73.77%. Twice in history, a barter-based mechanism has been used to reallocate key development resources during instrument development. The mean instrument set development cost growth was -1.55%. Performing a bivariate inference on the means of these two distributions, there is statistical evidence to support the claim that using a barter-based mechanism to reallocate key instrument development resources will result in a lower expected cost growth than using the expert judgment approach. Agent-based discrete event simulation is the natural way to model a trade environment. A NetLogo agent-based barter-based simulation of science instrument development was created. The agent-based model was validated against the Cassini historical example, as the starting and ending instrument development conditions are available. The resulting validated agent-based barter-based science instrument resource reallocation simulation was used to perform 300 instrument development simulations, using barter to reallocate development resources. The mean cost growth was -3.365%. A bivariate inference on the means was performed to determine that additional significant statistical evidence exists to support a claim that using barter-based resource reallocation will result in lower expected cost growth, with respect to the historical expert judgment approach. Barter-based key development resource reallocation should work on spacecraft development as well as it has worked on instrument development. A new study of 28 historical NASA science spacecraft developments has an average cost growth of 46.04%. As barter-based key development resource reallocation has never been tried in a spacecraft development, no historical results exist, and a simulation of using that approach must be developed. The instrument development simulation should be modified to account for spacecraft development market participant differences. The resulting agent-based barter-based spacecraft resource reallocation simulation would then be used to determine if significant statistical evidence exists to prove a claim that using barter-based resource reallocation will result in lower expected cost growth.
Data Acquisition and Preprocessing in Studies on Humans: What Is Not Taught in Statistics Classes?
Zhu, Yeyi; Hernandez, Ladia M; Mueller, Peter; Dong, Yongquan; Forman, Michele R
2013-01-01
The aim of this paper is to address issues in research that may be missing from statistics classes and important for (bio-)statistics students. In the context of a case study, we discuss data acquisition and preprocessing steps that fill the gap between research questions posed by subject matter scientists and statistical methodology for formal inference. Issues include participant recruitment, data collection training and standardization, variable coding, data review and verification, data cleaning and editing, and documentation. Despite the critical importance of these details in research, most of these issues are rarely discussed in an applied statistics program. One reason for the lack of more formal training is the difficulty in addressing the many challenges that can possibly arise in the course of a study in a systematic way. This article can help to bridge this gap between research questions and formal statistical inference by using an illustrative case study for a discussion. We hope that reading and discussing this paper and practicing data preprocessing exercises will sensitize statistics students to these important issues and achieve optimal conduct, quality control, analysis, and interpretation of a study.
Tang, Yongqiang
2017-12-01
Control-based pattern mixture models (PMM) and delta-adjusted PMMs are commonly used as sensitivity analyses in clinical trials with non-ignorable dropout. These PMMs assume that the statistical behavior of outcomes varies by pattern in the experimental arm in the imputation procedure, but the imputed data are typically analyzed by a standard method such as the primary analysis model. In the multiple imputation (MI) inference, Rubin's variance estimator is generally biased when the imputation and analysis models are uncongenial. One objective of the article is to quantify the bias of Rubin's variance estimator in the control-based and delta-adjusted PMMs for longitudinal continuous outcomes. These PMMs assume the same observed data distribution as the mixed effects model for repeated measures (MMRM). We derive analytic expressions for the MI treatment effect estimator and the associated Rubin's variance in these PMMs and MMRM as functions of the maximum likelihood estimator from the MMRM analysis and the observed proportion of subjects in each dropout pattern when the number of imputations is infinite. The asymptotic bias is generally small or negligible in the delta-adjusted PMM, but can be sizable in the control-based PMM. This indicates that the inference based on Rubin's rule is approximately valid in the delta-adjusted PMM. A simple variance estimator is proposed to ensure asymptotically valid MI inferences in these PMMs, and compared with the bootstrap variance. The proposed method is illustrated by the analysis of an antidepressant trial, and its performance is further evaluated via a simulation study. © 2017, The International Biometric Society.
An Artificial Intelligence Approach to Analyzing Student Errors in Statistics.
ERIC Educational Resources Information Center
Sebrechts, Marc M.; Schooler, Lael J.
1987-01-01
Describes the development of an artificial intelligence system called GIDE that analyzes student errors in statistics problems by inferring the students' intentions. Learning strategies involved in problem solving are discussed and the inclusion of goal structures is explained. (LRW)
Network inference using informative priors.
Mukherjee, Sach; Speed, Terence P
2008-09-23
Recent years have seen much interest in the study of systems characterized by multiple interacting components. A class of statistical models called graphical models, in which graphs are used to represent probabilistic relationships between variables, provides a framework for formal inference regarding such systems. In many settings, the object of inference is the network structure itself. This problem of "network inference" is well known to be a challenging one. However, in scientific settings there is very often existing information regarding network connectivity. A natural idea then is to take account of such information during inference. This article addresses the question of incorporating prior information into network inference. We focus on directed models called Bayesian networks, and use Markov chain Monte Carlo to draw samples from posterior distributions over network structures. We introduce prior distributions on graphs capable of capturing information regarding network features including edges, classes of edges, degree distributions, and sparsity. We illustrate our approach in the context of systems biology, applying our methods to network inference in cancer signaling.
A Survey of Statistical Models for Reverse Engineering Gene Regulatory Networks
Huang, Yufei; Tienda-Luna, Isabel M.; Wang, Yufeng
2009-01-01
Statistical models for reverse engineering gene regulatory networks are surveyed in this article. To provide readers with a system-level view of the modeling issues in this research, a graphical modeling framework is proposed. This framework serves as the scaffolding on which the review of different models can be systematically assembled. Based on the framework, we review many existing models for many aspects of gene regulation; the pros and cons of each model are discussed. In addition, network inference algorithms are also surveyed under the graphical modeling framework by the categories of point solutions and probabilistic solutions and the connections and differences among the algorithms are provided. This survey has the potential to elucidate the development and future of reverse engineering GRNs and bring statistical signal processing closer to the core of this research. PMID:20046885
Multiple hot-deck imputation for network inference from RNA sequencing data.
Imbert, Alyssa; Valsesia, Armand; Le Gall, Caroline; Armenise, Claudia; Lefebvre, Gregory; Gourraud, Pierre-Antoine; Viguerie, Nathalie; Villa-Vialaneix, Nathalie
2018-05-15
Network inference provides a global view of the relations existing between gene expression in a given transcriptomic experiment (often only for a restricted list of chosen genes). However, it is still a challenging problem: even if the cost of sequencing techniques has decreased over the last years, the number of samples in a given experiment is still (very) small compared to the number of genes. We propose a method to increase the reliability of the inference when RNA-seq expression data have been measured together with an auxiliary dataset that can provide external information on gene expression similarity between samples. Our statistical approach, hd-MI, is based on imputation for samples without available RNA-seq data that are considered as missing data but are observed on the secondary dataset. hd-MI can improve the reliability of the inference for missing rates up to 30% and provides more stable networks with a smaller number of false positive edges. On a biological point of view, hd-MI was also found relevant to infer networks from RNA-seq data acquired in adipose tissue during a nutritional intervention in obese individuals. In these networks, novel links between genes were highlighted, as well as an improved comparability between the two steps of the nutritional intervention. Software and sample data are available as an R package, RNAseqNet, that can be downloaded from the Comprehensive R Archive Network (CRAN). alyssa.imbert@inra.fr or nathalie.villa-vialaneix@inra.fr. Supplementary data are available at Bioinformatics online.
On Determining the Rise, Size, and Duration Classes of a Sunspot Cycle
NASA Astrophysics Data System (ADS)
Wilson, Robert M.; Hathaway, David H.; Reichmann, Edwin J.
1996-09-01
The behavior of ascent duration, maximum amplitude, and period for cycles 1 to 21 suggests that they are not mutually independent. Analysis of the resultant three-dimensional contingency table for cycles divided according to rise time (ascent duration), size (maximum amplitude), and duration (period) yields a chi-square statistic (= 18.59) that is larger than the test statistic (= 9.49 for 4 degrees-of-freedom at the 5-percent level of significance), thereby, inferring that the null hypothesis (mutual independence) can be rejected. Analysis of individual 2 by 2 contingency tables (based on Fisher's exact test) for these parameters shows that, while ascent duration is strongly related to maximum amplitude in the negative sense (inverse correlation) - the Waldmeier effect, it also is related (marginally) to period, but in the positive sense (direct correlation). No significant (or marginally significant) correlation is found between period and maximum amplitude. Using cycle 22 as a test case, we show that by the 12th month following conventional onset, cycle 22 appeared highly likely to be a fast-rising, larger-than-average-size cycle. Because of the inferred correlation between ascent duration and period, it also seems likely that it will have a period shorter than average length.
Wijeysundera, Duminda N; Austin, Peter C; Hux, Janet E; Beattie, W Scott; Laupacis, Andreas
2009-01-01
Randomized trials generally use "frequentist" statistics based on P-values and 95% confidence intervals. Frequentist methods have limitations that might be overcome, in part, by Bayesian inference. To illustrate these advantages, we re-analyzed randomized trials published in four general medical journals during 2004. We used Medline to identify randomized superiority trials with two parallel arms, individual-level randomization and dichotomous or time-to-event primary outcomes. Studies with P<0.05 in favor of the intervention were deemed "positive"; otherwise, they were "negative." We used several prior distributions and exact conjugate analyses to calculate Bayesian posterior probabilities for clinically relevant effects. Of 88 included studies, 39 were positive using a frequentist analysis. Although the Bayesian posterior probabilities of any benefit (relative risk or hazard ratio<1) were high in positive studies, these probabilities were lower and variable for larger benefits. The positive studies had only moderate probabilities for exceeding the effects that were assumed for calculating the sample size. By comparison, there were moderate probabilities of any benefit in negative studies. Bayesian and frequentist analyses complement each other when interpreting the results of randomized trials. Future reports of randomized trials should include both.
Language learners privilege structured meaning over surface frequency
Culbertson, Jennifer; Adger, David
2014-01-01
Although it is widely agreed that learning the syntax of natural languages involves acquiring structure-dependent rules, recent work on acquisition has nevertheless attempted to characterize the outcome of learning primarily in terms of statistical generalizations about surface distributional information. In this paper we investigate whether surface statistical knowledge or structural knowledge of English is used to infer properties of a novel language under conditions of impoverished input. We expose learners to artificial-language patterns that are equally consistent with two possible underlying grammars—one more similar to English in terms of the linear ordering of words, the other more similar on abstract structural grounds. We show that learners’ grammatical inferences overwhelmingly favor structural similarity over preservation of superficial order. Importantly, the relevant shared structure can be characterized in terms of a universal preference for isomorphism in the mapping from meanings to utterances. Whereas previous empirical support for this universal has been based entirely on data from cross-linguistic language samples, our results suggest it may reflect a deep property of the human cognitive system—a property that, together with other structure-sensitive principles, constrains the acquisition of linguistic knowledge. PMID:24706789
A model-based approach to wildland fire reconstruction using sediment charcoal records
Itter, Malcolm S.; Finley, Andrew O.; Hooten, Mevin B.; Higuera, Philip E.; Marlon, Jennifer R.; Kelly, Ryan; McLachlan, Jason S.
2017-01-01
Lake sediment charcoal records are used in paleoecological analyses to reconstruct fire history, including the identification of past wildland fires. One challenge of applying sediment charcoal records to infer fire history is the separation of charcoal associated with local fire occurrence and charcoal originating from regional fire activity. Despite a variety of methods to identify local fires from sediment charcoal records, an integrated statistical framework for fire reconstruction is lacking. We develop a Bayesian point process model to estimate the probability of fire associated with charcoal counts from individual-lake sediments and estimate mean fire return intervals. A multivariate extension of the model combines records from multiple lakes to reduce uncertainty in local fire identification and estimate a regional mean fire return interval. The univariate and multivariate models are applied to 13 lakes in the Yukon Flats region of Alaska. Both models resulted in similar mean fire return intervals (100–350 years) with reduced uncertainty under the multivariate model due to improved estimation of regional charcoal deposition. The point process model offers an integrated statistical framework for paleofire reconstruction and extends existing methods to infer regional fire history from multiple lake records with uncertainty following directly from posterior distributions.
On Determining the Rise, Size, and Duration Classes of a Sunspot Cycle
NASA Technical Reports Server (NTRS)
Wilson, Robert M.; Hathaway, David H.; Reichmann, Edwin J.
1996-01-01
The behavior of ascent duration, maximum amplitude, and period for cycles 1 to 21 suggests that they are not mutually independent. Analysis of the resultant three-dimensional contingency table for cycles divided according to rise time (ascent duration), size (maximum amplitude), and duration (period) yields a chi-square statistic (= 18.59) that is larger than the test statistic (= 9.49 for 4 degrees-of-freedom at the 5-percent level of significance), thereby, inferring that the null hypothesis (mutual independence) can be rejected. Analysis of individual 2 by 2 contingency tables (based on Fisher's exact test) for these parameters shows that, while ascent duration is strongly related to maximum amplitude in the negative sense (inverse correlation) - the Waldmeier effect, it also is related (marginally) to period, but in the positive sense (direct correlation). No significant (or marginally significant) correlation is found between period and maximum amplitude. Using cycle 22 as a test case, we show that by the 12th month following conventional onset, cycle 22 appeared highly likely to be a fast-rising, larger-than-average-size cycle. Because of the inferred correlation between ascent duration and period, it also seems likely that it will have a period shorter than average length.
Görgen, Kai; Hebart, Martin N; Allefeld, Carsten; Haynes, John-Dylan
2017-12-27
Standard neuroimaging data analysis based on traditional principles of experimental design, modelling, and statistical inference is increasingly complemented by novel analysis methods, driven e.g. by machine learning methods. While these novel approaches provide new insights into neuroimaging data, they often have unexpected properties, generating a growing literature on possible pitfalls. We propose to meet this challenge by adopting a habit of systematic testing of experimental design, analysis procedures, and statistical inference. Specifically, we suggest to apply the analysis method used for experimental data also to aspects of the experimental design, simulated confounds, simulated null data, and control data. We stress the importance of keeping the analysis method the same in main and test analyses, because only this way possible confounds and unexpected properties can be reliably detected and avoided. We describe and discuss this Same Analysis Approach in detail, and demonstrate it in two worked examples using multivariate decoding. With these examples, we reveal two sources of error: A mismatch between counterbalancing (crossover designs) and cross-validation which leads to systematic below-chance accuracies, and linear decoding of a nonlinear effect, a difference in variance. Copyright © 2017 Elsevier Inc. All rights reserved.
Li, Changyang; Wang, Xiuying; Eberl, Stefan; Fulham, Michael; Yin, Yong; Dagan Feng, David
2015-01-01
Automated and general medical image segmentation can be challenging because the foreground and the background may have complicated and overlapping density distributions in medical imaging. Conventional region-based level set algorithms often assume piecewise constant or piecewise smooth for segments, which are implausible for general medical image segmentation. Furthermore, low contrast and noise make identification of the boundaries between foreground and background difficult for edge-based level set algorithms. Thus, to address these problems, we suggest a supervised variational level set segmentation model to harness the statistical region energy functional with a weighted probability approximation. Our approach models the region density distributions by using the mixture-of-mixtures Gaussian model to better approximate real intensity distributions and distinguish statistical intensity differences between foreground and background. The region-based statistical model in our algorithm can intuitively provide better performance on noisy images. We constructed a weighted probability map on graphs to incorporate spatial indications from user input with a contextual constraint based on the minimization of contextual graphs energy functional. We measured the performance of our approach on ten noisy synthetic images and 58 medical datasets with heterogeneous intensities and ill-defined boundaries and compared our technique to the Chan-Vese region-based level set model, the geodesic active contour model with distance regularization, and the random walker model. Our method consistently achieved the highest Dice similarity coefficient when compared to the other methods.
Reaction Time in Grade 5: Data Collection within the Practice of Statistics
ERIC Educational Resources Information Center
Watson, Jane; English, Lyn
2017-01-01
This study reports on a classroom activity for Grade 5 students investigating their reaction times. The investigation was part of a 3-year research project introducing students to informal inference and giving them experience carrying out the practice of statistics. For this activity the focus within the practice of statistics was on introducing…
ERIC Educational Resources Information Center
Bakker, Arthur; Ben-Zvi, Dani; Makar, Katie
2017-01-01
To understand how statistical and other types of reasoning are coordinated with actions to reduce uncertainty, we conducted a case study in vocational education that involved statistical hypothesis testing. We analyzed an intern's research project in a hospital laboratory in which reducing uncertainties was crucial to make a valid statistical…
ERIC Educational Resources Information Center
Honda, Hidehito; Matsuka, Toshihiko; Ueda, Kazuhiro
2017-01-01
Some researchers on binary choice inference have argued that people make inferences based on simple heuristics, such as recognition, fluency, or familiarity. Others have argued that people make inferences based on available knowledge. To examine the boundary between heuristic and knowledge usage, we examine binary choice inference processes in…
Causal inference in biology networks with integrated belief propagation.
Chang, Rui; Karr, Jonathan R; Schadt, Eric E
2015-01-01
Inferring causal relationships among molecular and higher order phenotypes is a critical step in elucidating the complexity of living systems. Here we propose a novel method for inferring causality that is no longer constrained by the conditional dependency arguments that limit the ability of statistical causal inference methods to resolve causal relationships within sets of graphical models that are Markov equivalent. Our method utilizes Bayesian belief propagation to infer the responses of perturbation events on molecular traits given a hypothesized graph structure. A distance measure between the inferred response distribution and the observed data is defined to assess the 'fitness' of the hypothesized causal relationships. To test our algorithm, we infer causal relationships within equivalence classes of gene networks in which the form of the functional interactions that are possible are assumed to be nonlinear, given synthetic microarray and RNA sequencing data. We also apply our method to infer causality in real metabolic network with v-structure and feedback loop. We show that our method can recapitulate the causal structure and recover the feedback loop only from steady-state data which conventional method cannot.
Gene-network inference by message passing
NASA Astrophysics Data System (ADS)
Braunstein, A.; Pagnani, A.; Weigt, M.; Zecchina, R.
2008-01-01
The inference of gene-regulatory processes from gene-expression data belongs to the major challenges of computational systems biology. Here we address the problem from a statistical-physics perspective and develop a message-passing algorithm which is able to infer sparse, directed and combinatorial regulatory mechanisms. Using the replica technique, the algorithmic performance can be characterized analytically for artificially generated data. The algorithm is applied to genome-wide expression data of baker's yeast under various environmental conditions. We find clear cases of combinatorial control, and enrichment in common functional annotations of regulated genes and their regulators.
Statistical Inference for Big Data Problems in Molecular Biophysics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ramanathan, Arvind; Savol, Andrej; Burger, Virginia
2012-01-01
We highlight the role of statistical inference techniques in providing biological insights from analyzing long time-scale molecular simulation data. Technologi- cal and algorithmic improvements in computation have brought molecular simu- lations to the forefront of techniques applied to investigating the basis of living systems. While these longer simulations, increasingly complex reaching petabyte scales presently, promise a detailed view into microscopic behavior, teasing out the important information has now become a true challenge on its own. Mining this data for important patterns is critical to automating therapeutic intervention discovery, improving protein design, and fundamentally understanding the mech- anistic basis of cellularmore » homeostasis.« less
Tropical geometry of statistical models.
Pachter, Lior; Sturmfels, Bernd
2004-11-16
This article presents a unified mathematical framework for inference in graphical models, building on the observation that graphical models are algebraic varieties. From this geometric viewpoint, observations generated from a model are coordinates of a point in the variety, and the sum-product algorithm is an efficient tool for evaluating specific coordinates. Here, we address the question of how the solutions to various inference problems depend on the model parameters. The proposed answer is expressed in terms of tropical algebraic geometry. The Newton polytope of a statistical model plays a key role. Our results are applied to the hidden Markov model and the general Markov model on a binary tree.
Salehi, Sohrab; Steif, Adi; Roth, Andrew; Aparicio, Samuel; Bouchard-Côté, Alexandre; Shah, Sohrab P
2017-03-01
Next-generation sequencing (NGS) of bulk tumour tissue can identify constituent cell populations in cancers and measure their abundance. This requires computational deconvolution of allelic counts from somatic mutations, which may be incapable of fully resolving the underlying population structure. Single cell sequencing (SCS) is a more direct method, although its replacement of NGS is impeded by technical noise and sampling limitations. We propose ddClone, which analytically integrates NGS and SCS data, leveraging their complementary attributes through joint statistical inference. We show on real and simulated datasets that ddClone produces more accurate results than can be achieved by either method alone.
The Heuristic Value of p in Inductive Statistical Inference
Krueger, Joachim I.; Heck, Patrick R.
2017-01-01
Many statistical methods yield the probability of the observed data – or data more extreme – under the assumption that a particular hypothesis is true. This probability is commonly known as ‘the’ p-value. (Null Hypothesis) Significance Testing ([NH]ST) is the most prominent of these methods. The p-value has been subjected to much speculation, analysis, and criticism. We explore how well the p-value predicts what researchers presumably seek: the probability of the hypothesis being true given the evidence, and the probability of reproducing significant results. We also explore the effect of sample size on inferential accuracy, bias, and error. In a series of simulation experiments, we find that the p-value performs quite well as a heuristic cue in inductive inference, although there are identifiable limits to its usefulness. We conclude that despite its general usefulness, the p-value cannot bear the full burden of inductive inference; it is but one of several heuristic cues available to the data analyst. Depending on the inferential challenge at hand, investigators may supplement their reports with effect size estimates, Bayes factors, or other suitable statistics, to communicate what they think the data say. PMID:28649206
Subjective randomness as statistical inference.
Griffiths, Thomas L; Daniels, Dylan; Austerweil, Joseph L; Tenenbaum, Joshua B
2018-06-01
Some events seem more random than others. For example, when tossing a coin, a sequence of eight heads in a row does not seem very random. Where do these intuitions about randomness come from? We argue that subjective randomness can be understood as the result of a statistical inference assessing the evidence that an event provides for having been produced by a random generating process. We show how this account provides a link to previous work relating randomness to algorithmic complexity, in which random events are those that cannot be described by short computer programs. Algorithmic complexity is both incomputable and too general to capture the regularities that people can recognize, but viewing randomness as statistical inference provides two paths to addressing these problems: considering regularities generated by simpler computing machines, and restricting the set of probability distributions that characterize regularity. Building on previous work exploring these different routes to a more restricted notion of randomness, we define strong quantitative models of human randomness judgments that apply not just to binary sequences - which have been the focus of much of the previous work on subjective randomness - but also to binary matrices and spatial clustering. Copyright © 2018 Elsevier Inc. All rights reserved.
The Heuristic Value of p in Inductive Statistical Inference.
Krueger, Joachim I; Heck, Patrick R
2017-01-01
Many statistical methods yield the probability of the observed data - or data more extreme - under the assumption that a particular hypothesis is true. This probability is commonly known as 'the' p -value. (Null Hypothesis) Significance Testing ([NH]ST) is the most prominent of these methods. The p -value has been subjected to much speculation, analysis, and criticism. We explore how well the p -value predicts what researchers presumably seek: the probability of the hypothesis being true given the evidence, and the probability of reproducing significant results. We also explore the effect of sample size on inferential accuracy, bias, and error. In a series of simulation experiments, we find that the p -value performs quite well as a heuristic cue in inductive inference, although there are identifiable limits to its usefulness. We conclude that despite its general usefulness, the p -value cannot bear the full burden of inductive inference; it is but one of several heuristic cues available to the data analyst. Depending on the inferential challenge at hand, investigators may supplement their reports with effect size estimates, Bayes factors, or other suitable statistics, to communicate what they think the data say.
Statistical inference of seabed sound-speed structure in the Gulf of Oman Basin.
Sagers, Jason D; Knobles, David P
2014-06-01
Addressed is the statistical inference of the sound-speed depth profile of a thick soft seabed from broadband sound propagation data recorded in the Gulf of Oman Basin in 1977. The acoustic data are in the form of time series signals recorded on a sparse vertical line array and generated by explosive sources deployed along a 280 km track. The acoustic data offer a unique opportunity to study a deep-water bottom-limited thickly sedimented environment because of the large number of time series measurements, very low seabed attenuation, and auxiliary measurements. A maximum entropy method is employed to obtain a conditional posterior probability distribution (PPD) for the sound-speed ratio and the near-surface sound-speed gradient. The multiple data samples allow for a determination of the average error constraint value required to uniquely specify the PPD for each data sample. Two complicating features of the statistical inference study are addressed: (1) the need to develop an error function that can both utilize the measured multipath arrival structure and mitigate the effects of data errors and (2) the effect of small bathymetric slopes on the structure of the bottom interacting arrivals.
Advances in Bayesian Modeling in Educational Research
ERIC Educational Resources Information Center
Levy, Roy
2016-01-01
In this article, I provide a conceptually oriented overview of Bayesian approaches to statistical inference and contrast them with frequentist approaches that currently dominate conventional practice in educational research. The features and advantages of Bayesian approaches are illustrated with examples spanning several statistical modeling…
Incorporating GIS and remote sensing for census population disaggregation
NASA Astrophysics Data System (ADS)
Wu, Shuo-Sheng'derek'
Census data are the primary source of demographic data for a variety of researches and applications. For confidentiality issues and administrative purposes, census data are usually released to the public by aggregated areal units. In the United States, the smallest census unit is census blocks. Due to data aggregation, users of census data may have problems in visualizing population distribution within census blocks and estimating population counts for areas not coinciding with census block boundaries. The main purpose of this study is to develop methodology for estimating sub-block areal populations and assessing the estimation errors. The City of Austin, Texas was used as a case study area. Based on tax parcel boundaries and parcel attributes derived from ancillary GIS and remote sensing data, detailed urban land use classes were first classified using a per-field approach. After that, statistical models by land use classes were built to infer population density from other predictor variables, including four census demographic statistics (the Hispanic percentage, the married percentage, the unemployment rate, and per capita income) and three physical variables derived from remote sensing images and building footprints vector data (a landscape heterogeneity statistics, a building pattern statistics, and a building volume statistics). In addition to statistical models, deterministic models were proposed to directly infer populations from building volumes and three housing statistics, including the average space per housing unit, the housing unit occupancy rate, and the average household size. After population models were derived or proposed, how well the models predict populations for another set of sample blocks was assessed. The results show that deterministic models were more accurate than statistical models. Further, by simulating the base unit for modeling from aggregating blocks, I assessed how well the deterministic models estimate sub-unit-level populations. I also assessed the aggregation effects and the resealing effects on sub-unit estimates. Lastly, from another set of mixed-land-use sample blocks, a mixed-land-use model was derived and compared with a residential-land-use model. The results of per-field land use classification are satisfactory with a Kappa accuracy statistics of 0.747. Model Assessments by land use show that population estimates for multi-family land use areas have higher errors than those for single-family land use areas, and population estimates for mixed land use areas have higher errors than those for residential land use areas. The assessments of sub-unit estimates using a simulation approach indicate that smaller areas show higher estimation errors, estimation errors do not relate to the base unit size, and resealing improves all levels of sub-unit estimates.
“Magnitude-based Inference”: A Statistical Review
Welsh, Alan H.; Knight, Emma J.
2015-01-01
ABSTRACT Purpose We consider “magnitude-based inference” and its interpretation by examining in detail its use in the problem of comparing two means. Methods We extract from the spreadsheets, which are provided to users of the analysis (http://www.sportsci.org/), a precise description of how “magnitude-based inference” is implemented. We compare the implemented version of the method with general descriptions of it and interpret the method in familiar statistical terms. Results and Conclusions We show that “magnitude-based inference” is not a progressive improvement on modern statistics. The additional probabilities introduced are not directly related to the confidence interval but, rather, are interpretable either as P values for two different nonstandard tests (for different null hypotheses) or as approximate Bayesian calculations, which also lead to a type of test. We also discuss sample size calculations associated with “magnitude-based inference” and show that the substantial reduction in sample sizes claimed for the method (30% of the sample size obtained from standard frequentist calculations) is not justifiable so the sample size calculations should not be used. Rather than using “magnitude-based inference,” a better solution is to be realistic about the limitations of the data and use either confidence intervals or a fully Bayesian analysis. PMID:25051387
Statistical Inference for Quality-Adjusted Survival Time
2003-08-01
survival functions of QAL. If an influence function for a test statistic exists for complete data case, denoted as ’i, then a test statistic for...the survival function for the censoring variable. Zhao and Tsiatis (2001) proposed a test statistic where O is the influence function of the general...to 1 everywhere until a subject’s death. We have considered other forms of test statistics. One option is to use an influence function 0i that is
Infinite von Mises-Fisher Mixture Modeling of Whole Brain fMRI Data.
Røge, Rasmus E; Madsen, Kristoffer H; Schmidt, Mikkel N; Mørup, Morten
2017-10-01
Cluster analysis of functional magnetic resonance imaging (fMRI) data is often performed using gaussian mixture models, but when the time series are standardized such that the data reside on a hypersphere, this modeling assumption is questionable. The consequences of ignoring the underlying spherical manifold are rarely analyzed, in part due to the computational challenges imposed by directional statistics. In this letter, we discuss a Bayesian von Mises-Fisher (vMF) mixture model for data on the unit hypersphere and present an efficient inference procedure based on collapsed Markov chain Monte Carlo sampling. Comparing the vMF and gaussian mixture models on synthetic data, we demonstrate that the vMF model has a slight advantage inferring the true underlying clustering when compared to gaussian-based models on data generated from both a mixture of vMFs and a mixture of gaussians subsequently normalized. Thus, when performing model selection, the two models are not in agreement. Analyzing multisubject whole brain resting-state fMRI data from healthy adult subjects, we find that the vMF mixture model is considerably more reliable than the gaussian mixture model when comparing solutions across models trained on different groups of subjects, and again we find that the two models disagree on the optimal number of components. The analysis indicates that the fMRI data support more than a thousand clusters, and we confirm this is not a result of overfitting by demonstrating better prediction on data from held-out subjects. Our results highlight the utility of using directional statistics to model standardized fMRI data and demonstrate that whole brain segmentation of fMRI data requires a very large number of functional units in order to adequately account for the discernible statistical patterns in the data.
Nagy, László G; Urban, Alexander; Orstadius, Leif; Papp, Tamás; Larsson, Ellen; Vágvölgyi, Csaba
2010-12-01
Recently developed comparative phylogenetic methods offer a wide spectrum of applications in evolutionary biology, although it is generally accepted that their statistical properties are incompletely known. Here, we examine and compare the statistical power of the ML and Bayesian methods with regard to selection of best-fit models of fruiting-body evolution and hypothesis testing of ancestral states on a real-life data set of a physiological trait (autodigestion) in the family Psathyrellaceae. Our phylogenies are based on the first multigene data set generated for the family. Two different coding regimes (binary and multistate) and two data sets differing in taxon sampling density are examined. The Bayesian method outperformed Maximum Likelihood with regard to statistical power in all analyses. This is particularly evident if the signal in the data is weak, i.e. in cases when the ML approach does not provide support to choose among competing hypotheses. Results based on binary and multistate coding differed only modestly, although it was evident that multistate analyses were less conclusive in all cases. It seems that increased taxon sampling density has favourable effects on inference of ancestral states, while model parameters are influenced to a smaller extent. The model best fitting our data implies that the rate of losses of deliquescence equals zero, although model selection in ML does not provide proper support to reject three of the four candidate models. The results also support the hypothesis that non-deliquescence (lack of autodigestion) has been ancestral in Psathyrellaceae, and that deliquescent fruiting bodies represent the preferred state, having evolved independently several times during evolution. Copyright © 2010 Elsevier Inc. All rights reserved.